icc_varcomp.Rmd
This Document describes the computational background and the use of
the icc()
function from the Agree
package. We
developed the icc()
functions for this package in
connection with a simulation study about sample size requirements for
studies on reliability and measurement error Mokkink, de Vet, and Eekhout (2022a) and a
methodological paper about how to design and conduct a study on
reliability and measurement error Mokkink, de
Vet, and Eekhout (2022b).
library(Agree)
>
> Attaching package: 'Agree'
> The following object is masked from 'package:base':
>
> kappa
The intra-class agreement is usually obtained for continuous ratings.
As an example we can use data from data study by Dikmans et al. (2017). This data is based on
photographs of breasts of 50 women after breast reconstruction. The
photographs are independently scored by 5 surgeons, the patients, and
three mammography nurses. They each rated the quality of the
reconstruction on a 5 point ordinal scale with the verbal anchors on the
left side ‘very dissatisfied’ on the left end and on the right end ‘very
satisfied’ on the right end. They specifically rated the volume, shape,
symmetry, scars and nipple. For the icc
examples we can use
the sum scores for volume, shape, symmetry, scars and nipple as an
overall rating from each rater.
<-
breast_scores ::breast %>%
Agree::select(Patient_score, PCH1_score, PCH2_score, PCH3_score, PCH4_score,
dplyr
PCH5_score, Mam1_score, Mam2_score, Mam3_score)
head(breast_scores)
> Patient_score PCH1_score PCH2_score PCH3_score PCH4_score PCH5_score
> 1 10.0 9 9 8 7.0 8
> 2 NA 7 NA NA 8.0 7
> 3 5.5 8 7 7 7.0 8
> 4 7.0 NA 5 4 4.0 8
> 5 1.0 1 2 2 3.0 1
> 6 8.0 8 7 NA 7.5 8
> Mam1_score Mam2_score Mam3_score
> 1 8 8 8
> 2 9 7 NA
> 3 8 NA 7
> 4 NA 5 NA
> 5 3 3 4
> 6 9 7 7
The example data shows missings. The icc
function can
deal with these missings, because a mixed model is used to estimate the
variances to compute the icc with.
For a mixed model, the data needs to be restructured to a long
format. we can use the pivot_longer()
function from the
tidyr
package to do that:
<- breast_scores %>%
breast_long mutate(id = 1:nrow(breast_scores)) %>% #add id column
pivot_longer(cols = -id, names_to = "rater", values_to = "score")
breast_long>
[38;5;246m# A tibble: 450 x 3
[39m
> id rater score
>
[3m
[38;5;246m<int>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<dbl>
[39m
[23m
>
[38;5;250m 1
[39m 1 Patient_score 10
>
[38;5;250m 2
[39m 1 PCH1_score 9
>
[38;5;250m 3
[39m 1 PCH2_score 9
>
[38;5;250m 4
[39m 1 PCH3_score 8
>
[38;5;250m 5
[39m 1 PCH4_score 7
>
[38;5;250m 6
[39m 1 PCH5_score 8
>
[38;5;250m 7
[39m 1 Mam1_score 8
>
[38;5;250m 8
[39m 1 Mam2_score 8
>
[38;5;250m 9
[39m 1 Mam3_score 8
>
[38;5;250m10
[39m 2 Patient_score
[31mNA
[39m
>
[38;5;246m# ... with 440 more rows
[39m
>
[38;5;246m# i Use `print(n = ...)` to see more rows
[39m
The variances that are used to compute the icc
are
obtained from a linear mixed model. This model is estimated with the
lmer()
function from the lme4
package. In the
breast
example we have two levels: patients are level 1 and
raters/observers are level 2. The two-level multilevel model is defined
as \(Y_{ijr} = \beta_0 + b_{0j} + b_{0r} +
\epsilon_{ijr}\), where \(b_{0j}\) is the random intercept at the
subject level and \(b_{0r}\) the random
intercept at the rater/observer level. The \(\epsilon_{ijr}\) is the residual error. The
r-code
for the model in lme4
is:
lmer(score ~ (1|id) + (1|observer), data, REML = T)
De exact specification of the multilevel model, depends on the design of the study and the type of ICC that one wants to compute.
There are three types of icc incorporated in the icc
function. The ICC oneway, ICC agreement and the ICC consistency.
The ICC type oneway is the variance between the subjects (\(\sigma^2_j\)) divided by the sum of the subject variance (\(\sigma^2_j\)) and the residual variance (\(\sigma^2_{\epsilon}\)). The \(ICC_{oneway}\) is computed as follows:
\(ICC_{oneway} = \frac{\sigma^2_j}{\sigma^2_j + \sigma^2_{\epsilon}}\)
The ICC oneway assumes that each subject is rated by a different set
of raters, that are randomly selected from a larger population of judges
(Shrout and Fleis (1979)). The
icc_oneway()
uses the varcomp()
function to
compute the variance components. These variances are estimated from a
lmer
model with random slope for the subjects. \(Y_{ij} = \beta_0 + b_{0j} +
\epsilon_{ij}\)
The standard error of measurement (\(sem\)) is the square root of the error variance (i.e. \(sem = \sigma^2_{\epsilon}\)). The confidence intervals are computed with the exact F method. \(F = \frac{k \sigma^2_{j} + \sigma^2_{\epsilon}}{\sigma^2_{\epsilon}}\), with \(df1 = n - 1\) and \(df2 = n (k - 1)\) (Shrout and Fleis (1979)).
For the oneway ICC, only the level 1, the patient level, is random. The rater variance is not used.
The r-code
to extract the variance component with the
varcomp
function is:
varcomp(score~(1|id), data = breast_long)
> grp vcov
> id id 2.051072
> Residual Residual 1.047193
The variance components, can be used to compute the ICC oneway:
<- varcomp(score~(1|id), data = breast_long)
vc "id", "vcov"]/sum(vc[,"vcov"])
vc[> [1] 0.6620067
We have also incorporated a function that computed the ICC oneway
directly from the data in the wide format, using the same steps. This is
the icc_oneway
function.
icc_oneway(breast_scores)
> icc L_icc U_icc sem varj_oneway varerr_oneway
> 1 0.6620067 0.5638296 0.7598147 1.023324 2.051072 1.047193
The icc type agreement is the variance between the subjects (\(\sigma^2_j\)) divided by the sum of the subject variance (\(\sigma^2_j\)), rater variance (\(\sigma^2_k\)) and the residual variance (\(\sigma^2_\epsilon\)). The \(ICC_{agreement}\) is computed as follows:
\(ICC_{agreement} = \frac{\sigma^2_j}{\sigma^2_j + \sigma^2_k + \sigma^2_{\epsilon}}\)
The ICC for agreement generalizes to other raters within a population
(Shrout and Fleis (1979)). All subjects
are rated by the same set of raters, and the rater variance is taken
into account in the calculation of the ICC. The variance components are
computed with the icc_model()
function. This is a
lmer
model with a random slope for the subjects and for the
raters. The \(sem\) is the square root
of the sum of the rater variance and the error variance (i.e. \(sem = \sqrt{\sigma^2_r +
\sigma^2_\epsilon}\)). The confidence intervals are approximated
to account for the three independent variance components, as defined by
Satterthwaite (1946) & Shrout and Fleis (1979).
For the ICC for agreement, both the level 1 and level 2 are random.
The r-code
to extract the variance component with the
varcomp
function is:
varcomp(score ~ (1|id) + (1|rater), data = breast_long)
> grp vcov
> id id 2.0069835
> rater rater 0.1167749
> Residual Residual 0.9424505
The variance components, can be used to compute the ICC agreement:
<- varcomp(score~ (1|id) + (1|rater), data = breast_long)
vc "id", "vcov"]/sum(vc[,"vcov"])
vc[> [1] 0.6545488
We have also incorporated a function that computed the ICC for
agreement directly from the data in the wide format, using the same
steps. This is the icc_agreement
function.
icc_agreement(breast_scores)
> icc L_icc U_icc sem varj_agr varr_agr varerr_agr
> 1 0.6545488 0.5554292 0.75384 1.029187 2.006984 0.1167749 0.9424505
The ICC type consistency is the variance between the subjects (\(\sigma^2_j\)) divided by the sum of the subject variance (\(\sigma^2_j\)) and the residual variance (\(\sigma^2_\epsilon\)). The rater variance is not used to calculate the ICC and can therefore also be considered as a fixed effect. The \(ICC_{consistency}\) is computed as follows:
\(ICC_{consistency} = \frac{\sigma^2_j}{\sigma^2_j + \sigma^2_{\epsilon}}\)
The ICC for consistency generalizes only to the set of raters in the
data (Shrout and Fleis (1979)). The
varcomp()
function is used to compute the variance
components. These variances are computed from a a lmer
model with a random slope for the subjects and a fixed effect for the
raters. The sem is the square root of the error variance. The confidence
are computed with the exact F method. \(F =
\frac{(k \sigma^2_j + \sigma^2_\epsilon)}{\sigma^2_\epsilon}\),
with \(df1 = n - 1\) and \(df2 = (n - 1) (k - 1)\) (Shrout and Fleis (1979)).
For the ICC for consistency, the level 1 is a random effect and the level 2 is fixed.
The r-code
to extract the variance component with the
varcomp
function is:
varcomp(score ~ (1|id) + rater, data = breast_long)
> grp vcov
> id id 1.9956492
> Residual Residual 0.9428479
The variance components, can be used to compute the ICC consistency:
<- varcomp(score~ (1|id) + rater, data = breast_long)
vc "id", "vcov"]/sum(vc[,"vcov"])
vc[> [1] 0.6791394
We have also incorporated a function that computed the ICC
consistency directly from the data in the wide format, using the same
steps. This is the icc_consistency
function.
icc_consistency(breast_scores)
> icc L_icc U_icc sem varj_cons varerr_cons
> 1 0.6791394 0.5831551 0.7734836 0.9710035 1.995649 0.9428479
We have one general icc
function that computes all three
ICC types for a data set. The differences in computations between the
ICC types can quickly be seen in the variance components returned by the
icc
function. We can obtain the variances by using
var = TRUE
in the icc()
function, the
var_level2
shows the variance between the raters. Only for
the ICC for agreement this variance component is estimated.
# ICC for all methods
icc(breast_scores, var = TRUE)
> icc lower upper sem var_level1 var_level2
> oneway 0.6620067 0.5638296 0.7598147 1.0233245 2.051072 NA
> agreement 0.6545488 0.5554292 0.7538400 1.0291868 2.006984 0.1167749
> consistency 0.6791394 0.5831551 0.7734836 0.9710035 1.995649 NA
> var_Residual
> oneway 1.0471930
> agreement 0.9424505
> consistency 0.9428479
When we estimate the ICC for the surgeons only, we can see that the variance at the rater level is decreased. This effect is directly shown in the ICC.
In the icc we can also use the data in wide format and use the
cols
option to define the rater columns that we want to
use.
# ICC for all methods
icc(breast_scores,
cols = c("PCH1_score", "PCH2_score", "PCH3_score", "PCH4_score", "PCH5_score"),
var = TRUE)
> icc lower upper sem var_level1 var_level2
> oneway 0.7615871 0.6710817 0.8403886 0.8591071 2.357677 NA
> agreement 0.7593693 0.6682981 0.8387870 0.8611481 2.340225 0.05026114
> consistency 0.7711953 0.6829366 0.8473933 0.8317713 2.331886 NA
> var_Residual
> oneway 0.7380650
> agreement 0.6913149
> consistency 0.6918435
When we estimate the ICC for the mammography nurses only, we see that the variance at the rater level is increased. This effect is directly shown in the ICC.
# ICC for all methods
icc(breast_scores,
cols = c("Mam1_score", "Mam2_score", "Mam3_score"),
var = TRUE)
> icc lower upper sem var_level1 var_level2
> oneway 0.6001478 0.4493754 0.7309390 1.1515241 1.990237 NA
> agreement 0.5698257 0.4137129 0.7079019 1.1919337 1.881923 0.4443499
> consistency 0.6558989 0.5162467 0.7724701 0.9899547 1.868020 NA
> var_Residual
> oneway 1.3260076
> agreement 0.9763561
> consistency 0.9800104
Term | Description |
---|---|
\(\beta_0\) | Fixed intercept |
\(b_{0j}\) | Random intercept at subject level |
\(b_{0r}\) | Random intercept at rater level |
\(\epsilon_{ijr}\) | Residual error |
\(\sigma_j\) | Variance between subjects |
\(\sigma_r\) | Variance between raters |
\(\sigma_\epsilon\) | Residual error variance |
\(k\) | Number of raters/observers |
\(n\) | Number of subjects |