Missing value analysis
Missing observations are defined as NA
in R. Missing data can have different implications for data summaries, analyses and conclusions based on the data with missing values. In this post, different types of missing data are reviewed and explored in data examples.
In the example data that is generated with the mvrnorm
function from the MASS
package, there are missing values created with the MCAR
function from the dmo
package. The missing data are created completely at random.
cov <- matrix(c(rep(0.7, 25)), nrow = 5)
diag(cov) <- 1
dat <- MASS::mvrnorm(n = 25, rep(0,5), Sigma = cov)
patterns <- matrix(c(0,1,1,0,0,
1,0,1,1,0), nrow = 2, byrow = T)
set.seed(23421)
datm <- dmo::MCAR(data.frame(dat), alpha = 0.50, pattern = patterns, f = c(0.5, 0.5))
datm
## X1 X2 X3 X4 X5
## 1 0.48254355 NA 1.22503920 0.94830819 NA
## 2 -0.76593931 -0.58947972 0.23381032 -0.92179261 0.3973875
## 3 -1.27156163 -1.16344909 -1.32924974 -0.18096602 -0.9381264
## 4 NA 0.51468335 0.69357134 NA NA
## 5 -0.87224773 -1.32694752 -2.10149597 -2.07193068 -0.5811637
## 6 NA -1.29365430 -0.23190870 NA NA
## 7 -0.63441654 0.63175205 0.88374347 0.49630705 0.8510002
## 8 -1.03565673 NA -0.50081545 0.22400813 NA
## 9 -1.45818759 NA -0.93729349 -1.46735772 NA
## 10 -0.10557647 -0.42966109 -1.94293461 0.23174833 -1.4388635
## 11 0.40391895 0.45833012 0.18081480 0.81902851 1.1755890
## 12 0.25211079 0.62311673 1.34122157 1.16316744 0.8369275
## 13 NA 0.45321311 0.02143118 NA NA
## 14 -0.01333257 NA -0.24037155 -0.06467755 NA
## 15 -0.65493146 NA -0.08024527 -1.97446923 NA
## 16 0.20929929 0.05071251 -0.08652511 0.02855808 0.9680992
## 17 0.21796759 0.31319513 -0.12847812 0.06131959 0.5502352
## 18 1.91382365 1.15337328 1.28467733 2.38323910 1.1387217
## 19 NA 0.57147096 -0.73910012 NA NA
## 20 -1.13260425 -0.50749712 0.56741058 0.18329881 -0.8296074
## 21 -0.28239083 -0.34652718 -0.79294057 -0.55645359 -1.2081233
## 22 0.36607300 NA 0.50513287 0.69087834 NA
## 23 1.95303223 2.35087305 2.21686758 2.12909474 2.2496803
## 24 -0.51180792 -0.50407539 0.42609062 -0.75193336 -0.6362983
## 25 -0.19527936 -1.06955603 0.38553712 0.63139846 -0.7634680
Explore the amount of missing data
The amount of missing data can be expressed from different points of views. We can count the number of missing entries in the entire data by using the is.na
function. In total there are 24 missing data entries. The data frame contains 5 variables for 25 subjects, which makes a total of 125 data entries. Accordingly, 19.2% of the data entries are missing.
sum(is.na(datm))
## [1] 24
sum(is.na(datm))/length(is.na(datm)) * 100
## [1] 19.2
Another way is to look at the amount of missing data, is to summarize the missing observations per variable. For each variable we can count the number of missing observations (n) and calculate the proportion (p)
datm %>%
is.na %>%
data.frame() %>%
summarise_all(list(n = sum, p = mean)) %>%
pivot_longer(everything(),
names_to = c("variable", ".value"),
names_pattern = "(.*)_(.)")
## # A tibble: 5 x 3
## variable n p
## <chr> <int> <dbl>
## 1 X1 4 0.16
## 2 X2 6 0.24
## 3 X3 0 0
## 4 X4 4 0.16
## 5 X5 10 0.4
When data with missing values are analyzed, many analysis methods only use the rows that have are fully observed. This is called a complete-case analysis. The data are then listwise deleted. To analyze the implications of the missing values in the data for this analysis strategy we have to evaluate the number of rows that contain missing values and the rows that contain no missing values.
datm %>%
is.na %>%
data.frame() %>%
mutate(n_miss = rowSums(.),
missing = ifelse(n_miss > 0, "rows with misings", "rows without missing")) %>%
group_by(missing) %>%
summarise(n = n(),
p = n/ 25)
## # A tibble: 2 x 3
## missing n p
## <chr> <int> <dbl>
## 1 rows with misings 10 0.4
## 2 rows without missing 15 0.6
The cci
function in the mice
package creates and indicator for the number of fully observed rows.
mice::cci(datm)
## [1] FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
## [13] FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
## [25] TRUE
The nic
function in the mice
package counts the number of incomplete cases, i.e. cases with missing values.
mice::nic(datm)
## [1] 10
The ncc
function in the mice
package counts the number of complete cases, i.e. cases full fully observed rows.
mice::ncc(datm)
## [1] 15
Missing data patterns
A missing data pattern is the combination of observed an unobserved values that occur together in a row. A missing data pattern is generally notated as having a 0 for a missing value and a 1 for an observed value. The data often contains multiple different missing data patterns. The way we generated the missing data in our example will show three missing data patterns in the data. The first pattern is where the data are fully observed, so a row of only ones. The second pattern has three variables observed ant two missing and the thirds has three missing values and two observed. The md.pattern
function from the mice
package displays the missing data patterns in the data. The row-names show the number of times the pattern occurs in the data. The final column shows the number missing values the missing data pattern holds.
mice::md.pattern(datm, plot= F)
## X3 X1 X4 X2 X5
## 15 1 1 1 1 1 0
## 6 1 1 1 0 0 2
## 4 1 0 0 1 0 3
## 0 4 4 6 10 24
The missing data patterns can also be shows per variable pair. The number of times two variables are either missing together or observed together can inform us about how many cases we can actually use for imputation. The md.pair
function from the mice
package shows four matrices. Each matrix gives us information about combinations of missing values in our data.
rr
: response-response, the count of how often two variables are both observed.rm
: response-missing, the count of how often the row-variable is observed and the column-variable is missing.mr
: missing-response, the count of how often the row-variable is missing and the column-variable is observed.mm
: missing-missing, the count of how often two variables are both missing.
pat <- mice::md.pairs(datm)
pat
## $rr
## X1 X2 X3 X4 X5
## X1 21 15 21 21 15
## X2 15 19 19 15 15
## X3 21 19 25 21 15
## X4 21 15 21 21 15
## X5 15 15 15 15 15
##
## $rm
## X1 X2 X3 X4 X5
## X1 0 6 0 0 6
## X2 4 0 0 4 4
## X3 4 6 0 4 10
## X4 0 6 0 0 6
## X5 0 0 0 0 0
##
## $mr
## X1 X2 X3 X4 X5
## X1 0 4 4 0 0
## X2 6 0 6 6 0
## X3 0 0 0 0 0
## X4 0 4 4 0 0
## X5 6 4 10 6 0
##
## $mm
## X1 X2 X3 X4 X5
## X1 4 0 0 4 4
## X2 0 6 0 0 6
## X3 0 0 0 0 0
## X4 4 0 0 4 4
## X5 4 6 0 4 10
The proportion missing-response from the sum of the missing-response and missing-missing matrices shows how many usable cases the data have to impute the row variable from the column variable.
round(100 * pat$mr / (pat$mr + pat$mm))
## X1 X2 X3 X4 X5
## X1 0 100 100 0 0
## X2 100 0 100 100 0
## X3 NaN NaN NaN NaN NaN
## X4 0 100 100 0 0
## X5 60 40 100 60 0
Note that X3 does not have any missing data
Types of missing data
In research, missing data occur when a data value is unavailable. Many empirical studies encounter missing data. Missing data can occur in many stages of research due to many different causes in many different forms.
Missing data can occur because an invited respondent does not participate in the study: non-response. In case of non-response there is often no information available about the respondents that did not participate, besides the information used to select study participants.
Missing data can take place on one or more of the measured variables that are used as a predictor, covariate or outcome. This is often referred to as intermittend missing data.
When participants in a longitudinal study do not show up at one or more repeated measurement occasions, the missing data are often referred to as drop-out or loss to follow up.
Each type of missing data may have different reasons, and also different implication for the methods to deal with the missing data.