Missing value analysis

Missing observations are defined as NA in R. Missing data can have different implications for data summaries, analyses and conclusions based on the data with missing values. In this post, different types of missing data are reviewed and explored in data examples.

In the example data that is generated with the mvrnorm function from the MASS package, there are missing values created with the MCAR function from the dmo package. The missing data are created completely at random.

cov <- matrix(c(rep(0.7, 25)), nrow = 5)
diag(cov) <- 1
dat <- MASS::mvrnorm(n = 25, rep(0,5), Sigma = cov)

patterns <- matrix(c(0,1,1,0,0,
                     1,0,1,1,0), nrow = 2, byrow = T)
set.seed(23421)
datm <- dmo::MCAR(data.frame(dat), alpha = 0.50, pattern = patterns, f = c(0.5, 0.5))
datm
##             X1          X2          X3          X4         X5
## 1   0.48254355          NA  1.22503920  0.94830819         NA
## 2  -0.76593931 -0.58947972  0.23381032 -0.92179261  0.3973875
## 3  -1.27156163 -1.16344909 -1.32924974 -0.18096602 -0.9381264
## 4           NA  0.51468335  0.69357134          NA         NA
## 5  -0.87224773 -1.32694752 -2.10149597 -2.07193068 -0.5811637
## 6           NA -1.29365430 -0.23190870          NA         NA
## 7  -0.63441654  0.63175205  0.88374347  0.49630705  0.8510002
## 8  -1.03565673          NA -0.50081545  0.22400813         NA
## 9  -1.45818759          NA -0.93729349 -1.46735772         NA
## 10 -0.10557647 -0.42966109 -1.94293461  0.23174833 -1.4388635
## 11  0.40391895  0.45833012  0.18081480  0.81902851  1.1755890
## 12  0.25211079  0.62311673  1.34122157  1.16316744  0.8369275
## 13          NA  0.45321311  0.02143118          NA         NA
## 14 -0.01333257          NA -0.24037155 -0.06467755         NA
## 15 -0.65493146          NA -0.08024527 -1.97446923         NA
## 16  0.20929929  0.05071251 -0.08652511  0.02855808  0.9680992
## 17  0.21796759  0.31319513 -0.12847812  0.06131959  0.5502352
## 18  1.91382365  1.15337328  1.28467733  2.38323910  1.1387217
## 19          NA  0.57147096 -0.73910012          NA         NA
## 20 -1.13260425 -0.50749712  0.56741058  0.18329881 -0.8296074
## 21 -0.28239083 -0.34652718 -0.79294057 -0.55645359 -1.2081233
## 22  0.36607300          NA  0.50513287  0.69087834         NA
## 23  1.95303223  2.35087305  2.21686758  2.12909474  2.2496803
## 24 -0.51180792 -0.50407539  0.42609062 -0.75193336 -0.6362983
## 25 -0.19527936 -1.06955603  0.38553712  0.63139846 -0.7634680

Explore the amount of missing data

The amount of missing data can be expressed from different points of views. We can count the number of missing entries in the entire data by using the is.na function. In total there are 24 missing data entries. The data frame contains 5 variables for 25 subjects, which makes a total of 125 data entries. Accordingly, 19.2% of the data entries are missing.

sum(is.na(datm))
## [1] 24
sum(is.na(datm))/length(is.na(datm)) * 100
## [1] 19.2

Another way is to look at the amount of missing data, is to summarize the missing observations per variable. For each variable we can count the number of missing observations (n) and calculate the proportion (p)

datm %>%
  is.na %>%
  data.frame() %>%
  summarise_all(list(n = sum, p = mean)) %>%
  pivot_longer(everything(), 
               names_to = c("variable", ".value"),
               names_pattern = "(.*)_(.)")
## # A tibble: 5 x 3
##   variable     n     p
##   <chr>    <int> <dbl>
## 1 X1           4  0.16
## 2 X2           6  0.24
## 3 X3           0  0   
## 4 X4           4  0.16
## 5 X5          10  0.4

When data with missing values are analyzed, many analysis methods only use the rows that have are fully observed. This is called a complete-case analysis. The data are then listwise deleted. To analyze the implications of the missing values in the data for this analysis strategy we have to evaluate the number of rows that contain missing values and the rows that contain no missing values.

datm %>% 
  is.na %>%
  data.frame() %>%
  mutate(n_miss = rowSums(.),
         missing = ifelse(n_miss > 0, "rows with misings", "rows without missing")) %>%
  group_by(missing) %>%
  summarise(n = n(),
            p = n/ 25)
## # A tibble: 2 x 3
##   missing                  n     p
##   <chr>                <int> <dbl>
## 1 rows with misings       10   0.4
## 2 rows without missing    15   0.6

The cci function in the mice package creates and indicator for the number of fully observed rows.

mice::cci(datm)
##  [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## [13] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
## [25]  TRUE

The nic function in the mice package counts the number of incomplete cases, i.e. cases with missing values.

mice::nic(datm)
## [1] 10

The ncc function in the mice package counts the number of complete cases, i.e. cases full fully observed rows.

mice::ncc(datm)
## [1] 15

Missing data patterns

A missing data pattern is the combination of observed an unobserved values that occur together in a row. A missing data pattern is generally notated as having a 0 for a missing value and a 1 for an observed value. The data often contains multiple different missing data patterns. The way we generated the missing data in our example will show three missing data patterns in the data. The first pattern is where the data are fully observed, so a row of only ones. The second pattern has three variables observed ant two missing and the thirds has three missing values and two observed. The md.pattern function from the mice package displays the missing data patterns in the data. The row-names show the number of times the pattern occurs in the data. The final column shows the number missing values the missing data pattern holds.

mice::md.pattern(datm, plot= F)
##    X3 X1 X4 X2 X5   
## 15  1  1  1  1  1  0
## 6   1  1  1  0  0  2
## 4   1  0  0  1  0  3
##     0  4  4  6 10 24

The missing data patterns can also be shows per variable pair. The number of times two variables are either missing together or observed together can inform us about how many cases we can actually use for imputation. The md.pair function from the mice package shows four matrices. Each matrix gives us information about combinations of missing values in our data.

  • rr: response-response, the count of how often two variables are both observed.
  • rm: response-missing, the count of how often the row-variable is observed and the column-variable is missing.
  • mr: missing-response, the count of how often the row-variable is missing and the column-variable is observed.
  • mm: missing-missing, the count of how often two variables are both missing.
pat <- mice::md.pairs(datm)
pat
## $rr
##    X1 X2 X3 X4 X5
## X1 21 15 21 21 15
## X2 15 19 19 15 15
## X3 21 19 25 21 15
## X4 21 15 21 21 15
## X5 15 15 15 15 15
## 
## $rm
##    X1 X2 X3 X4 X5
## X1  0  6  0  0  6
## X2  4  0  0  4  4
## X3  4  6  0  4 10
## X4  0  6  0  0  6
## X5  0  0  0  0  0
## 
## $mr
##    X1 X2 X3 X4 X5
## X1  0  4  4  0  0
## X2  6  0  6  6  0
## X3  0  0  0  0  0
## X4  0  4  4  0  0
## X5  6  4 10  6  0
## 
## $mm
##    X1 X2 X3 X4 X5
## X1  4  0  0  4  4
## X2  0  6  0  0  6
## X3  0  0  0  0  0
## X4  4  0  0  4  4
## X5  4  6  0  4 10

The proportion missing-response from the sum of the missing-response and missing-missing matrices shows how many usable cases the data have to impute the row variable from the column variable.

round(100 * pat$mr / (pat$mr + pat$mm))
##     X1  X2  X3  X4  X5
## X1   0 100 100   0   0
## X2 100   0 100 100   0
## X3 NaN NaN NaN NaN NaN
## X4   0 100 100   0   0
## X5  60  40 100  60   0

Note that X3 does not have any missing data

Types of missing data

In research, missing data occur when a data value is unavailable. Many empirical studies encounter missing data. Missing data can occur in many stages of research due to many different causes in many different forms.

Missing data can occur because an invited respondent does not participate in the study: non-response. In case of non-response there is often no information available about the respondents that did not participate, besides the information used to select study participants.

Missing data can take place on one or more of the measured variables that are used as a predictor, covariate or outcome. This is often referred to as intermittend missing data.

When participants in a longitudinal study do not show up at one or more repeated measurement occasions, the missing data are often referred to as drop-out or loss to follow up.

Each type of missing data may have different reasons, and also different implication for the methods to deal with the missing data.

Iris Eekhout, PhD
Iris Eekhout, PhD
Statistician

Iris works on a variety of projects as methodologist and statistical analyst related to child health, e.g. measuring child development (D-score) and adaptive screenings for psycho-social problems (psycat).

Related