This vignette shows how accuracy data can be analysed with
afex
using either ANOVA or a binomial generalized linear
mixed model (i.e., a mixed model that uses the appropriate
distributional family for such data). Accuracy data means data where
each observation can be categorized as either a 0, which indicates
failure, miss, or an erroneous response, or 1 which indicates success or
a correct response.
We begin by loading the packages needed here and set a somewhat nicer
ggplot2
theme.
The data we are looking at is from Lin and colleagues (2020) and investigates ego depletion using a novel paradigm. Ego depletion is a (social) psychological concept originating from Roy Baumeister’s work which can be summed up by the phrase ‘self-control/willpower is a muscle’: An initial use of self-control, such as performing a demanding task or resisting a temptation, depletes the available self-control resources so that subsequent tasks only have limited self-control resources available. The paper introducing the concept was published in 1998 (references can be found on Wikipedia). The ego depletion concept was extremely popular until the height of the replication crisis in psychology in which researcher also struggled to replicate the key ego depletion phenomenon just outlined. Recently, Lin and colleagues developed a new paradigm for investigation ego depletion. Their primary interest was on analyzing the data using a diffusion model, which will not be of interest here.
Lin and colleagues
(2020) task followed the usual approach for ego depletion tasks. In
a first phase, participants either worked on a high-demand task or a
low-demand task (variable condition
). Then, participants
had to work on a Stroop task. The Stroop task consist of colour words
(i.e., “red”, “blue”, or “yellow”) displayed in a colour (i.e., red,
blue, or yellow). In each trial, one of the three words is displayed in
one of the three colours (e.g., the word “red” could be displayed in red
or blue). The task of the participants is to press a button
corresponding to the three colours the word is displayed in (i.e.,
participants have to ignore the meaning of the word but focus on the
colour the letters have).
The Stroop effect is the finding that it is easier to select the
correct colour in a congruent trial in which the meaning of the word and
the colour the word is displayed in match (e.g., the word “red” in red).
In contrast, people are slower and make more errors in an incongruent
trial in which there is a mismatch between word meaning and word colour
(e.g., the word “red” in blue). In other words, it is difficult for
people to ignore the meaning of a word when having to decide which
colour the word is displayed in. The variable determining match between
word meaning and world colour is congruency
.
The hypothesis of the ego depletion concept is that it moderates or interacts with the congruency effect (i.e., difference in performance between congruent and incongruent trials). In particular, the congruency effect is smaller if participants are ego depleted compared to when they are not. We analyse this Stroop data in the following.
One of the new features of Lin et al.’s (2020) study compared to
earlier ego-depletion studies is that both variables,
condition
and congruency
, vary
within-subjects. That is, each participants once performed a high-demand
task followed by the Stroop task and then also a low-demand task
followed by a Stroop task. This was usually done on two different days
with a week time in between in counterbalanced order.
We then load the data, called stroop
, which comes with
afex
. For simplicity we only focus on the data from Lin et
al.’s (2020) Experiment 1 (the data of all their 4 studies is part of
stroop
). We focus on the accuracy of participants which is
coded as either 0 (error) or 1 (correct response) in column
acc
. We also remove all NA
s in the response
variable and drop unused factor levels. This streamlines a few of the
calls later on.
data("stroop")
## extract data from experiment 1 and remove NAs
stroop_e1 <- stroop %>%
filter(!is.na(acc)) %>%
filter(study == "1") %>%
droplevels()
A look at the resulting data.frame
reveals that we have
trial-wise data. That is, each observation (i.e., row) is one response
to a Stroop trial. The data is not yet aggregated so that each
participant has one observation per design cell (i.e., denoting the
average accuracy for each participant in this cell). We also see we
still have quite a bit of data left, from 253 participants.
head(stroop_e1)
## pno condition study trialnum congruency acc rt
## 1 s1_1 deplete 1 1 congruent 1 0.626
## 2 s1_1 deplete 1 2 congruent 1 0.550
## 3 s1_1 deplete 1 3 incongruent 1 0.872
## 4 s1_1 deplete 1 4 congruent 1 0.635
## 5 s1_1 deplete 1 5 congruent 1 0.660
## 6 s1_1 deplete 1 6 incongruent 1 0.667
str(stroop_e1)
## 'data.frame': 84667 obs. of 7 variables:
## $ pno : Factor w/ 253 levels "s1_1","s1_2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ condition : Factor w/ 2 levels "control","deplete": 2 2 2 2 2 2 2 2 2 2 ...
## $ study : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
## $ trialnum : num 1 2 3 4 5 6 7 8 9 10 ...
## $ congruency: Factor w/ 2 levels "congruent","incongruent": 1 1 2 1 1 2 1 2 1 1 ...
## $ acc : num 1 1 1 1 1 1 1 1 1 1 ...
## $ rt : num 0.626 0.55 0.872 0.635 0.66 0.667 0.538 0.927 0.583 0.588 ...
We begin with an analysis of the data using standard repeated-measures ANOVA. ANOVA is probably the most common approach to analyse such data in psychology, but its use is somewhat questionable. The reason is that ANOVA assumes that response variable is normally distributed (or more precisely, the conditional distribution of the data after taking the model into account, i.e., the residuals, are assumed to follow a normal distribution). Here, our data are only 0s and 1s which do not follow a normal but rather a binomial or Bernoulli distribution. Nevertheless, analysis of such data using models assuming normal distribution such as ANOVA is not uncommon and in many cases leads to the same conclusions than the more complicated model discussed below. However, as we will see below, the results can also change.
We set up the model using aov_ez
and specify both
factors, congruency
and condition
, correctly
as within
factors.
e1_anova <- aov_ez(
id = "pno",
dv = "acc",
data = stroop_e1,
within = c("congruency", "condition")
)
## Warning: More than one observation per design cell, aggregating data using `fun_aggregate = mean`.
## To turn off this warning, pass `fun_aggregate = mean` explicitly.
Fitting this model produces a warning message telling us what we already know. There is more than one observation per participant and cell of the design (i.e., combination of our two factors) and the data is averaged before calculating the ANOVA.
Note that if we would not have removed the NA
s from the
data, this call would have failed with an error as all participants have
NA
s so all would have been removed. In that case, we could
have added na.rm = TRUE
to the aov_ez()
call
to ignore the NAs when aggregating the data.
If we take a look at the ANOVA table, we see a clear main effect of congruency, a somewhat weaker main effect of condition, but no interaction.
e1_anova
## Anova Table (Type 3 tests)
##
## Response: acc
## Effect df MSE F ges p.value
## 1 congruency 1, 252 0.01 242.95 *** .247 <.001
## 2 condition 1, 252 0.00 5.43 * .003 .021
## 3 congruency:condition 1, 252 0.00 0.10 <.001 .757
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
The observed pattern of effects is in line with most recent failures
to replicate ego depletion. The main effect of condition
suggests there be some general effect, such as more errors after a
demanding task, but not the predicted interaction of condition with the
congruency effect. The original prediction of ego-depletion is that it
reduces self-control, thus resulting in attenuated Stroop effects after
depletion. This pattern would result in a significant interaction.
We can look at the effects in turn using emmeans
. We
begin by looking at the Stroop effect, that is the effect of
congruency
.
emmeans(e1_anova, "congruency")
## congruency emmean SE df lower.CL upper.CL
## congruent 0.9763 0.001201 252 0.9739 0.9787
## incongruent 0.8867 0.005876 252 0.8752 0.8983
##
## Results are averaged over the levels of: condition
## Confidence level used: 0.95
We can see the usual Stroop effect. Accuracy is higher for congruent compared to incongruent items.
We also look at the effect of condition.
emmeans(e1_anova, "condition")
## condition emmean SE df lower.CL upper.CL
## control 0.936 0.00345 252 0.929 0.943
## deplete 0.927 0.00375 252 0.920 0.935
##
## Results are averaged over the levels of: congruency
## Confidence level used: 0.95
We see that accuracy is roughly 1% lower in the ego-depletion compared to the control condition.
We can also plot both effects.
A model with a more appropriate conditional distribution (instead of the normal distribution used by ANOVA) are generalized linear models with a binomial family. Because we have repeated measures both within participants within each design cell and across cells (i.e., we have within-subject factors) we need to use a mixed model. Thus, we have to employ a generalized linear mixed model (GLMM) with a binomial family. Here, we use the canonical link function, the logit (an alternative would be probit).
In our call to mixed
we indicate the fact that we want
to estimate a binomial GLMM (in contrast to a regular LMM) by passing
the correct family, family = binomial
. We could further
specify the link function (e.g.,
family = binomial(link = "probit")
), but because we use the
canonical link here, this is not necessary.
As for all mixed models, we should begin our analysis with the “maximal random effect structure justified by the design” (Barr et al., 2013), the maximal model. Because all factors vary within-subjects in the present case and we only consider one random-effects grouping factor (participant), the maximum random effect structure involves by-participant random intercepts as well as by-participant random slopes for factors congruency, condition, and their interaction, as well as the correlation among the random effect parameters.
If you need an introduction to the topic of the random effect structure, consider reading our introduction to mixed models (Singmann & Kellen, 2019). A worked example of how to reduce the maximal model in case of converging warnings is provided in the mixed model vignette.
Another important decision when fitting a mixed model with
afex::mixed
is the choice of the method
for
testing model terms. The default method
is the
Kenward-Roger approximation. However, this is not possible for GLMMs
(i.e., it only applies to LMMs). For GLMMs we only have two methods
available, likelihood ratio tests (method = "LRT"
) and
parametric bootstrap (method = "PB"
). Parametric bootstrap
is generally the preferable procedure, especially in settings with small
N, specifically low numbers of levels for the random effects grouping
factor (i.e., few participants in the present case). However, parametric
bootstrap can be prohibitively time consuming as it requires refitting
the model several times (suggested is at last a 1000 times per model
term/effect). The main factor influencing the time is therefore the
complexity of the random effect structure. Even with only a moderately
complex structure, such as in the present case, it can take days (even
though computations can be distributed on a multi-core machine as shown
below).
One rule of thumb (for which I have never seen any actual data) is
that with more than around 50 levels for each random effects grouping
factor, likelihood ratio tests are probably okay and parametric
bootstrap not necessary. Luckily, in the present case we have over 250
participants (i.e., way more than 50), so we decide to use likelihood
ratio tests as a method for testing model terms (i.e.,
method = "LRT"
).
There are two ways on how to set up such models in
afex::mixed
(lme4
provides another third
variant that is not possible in afex::mixed
). Before
describing both ways, the somewhat surprising fact is that the second
way, based on the aggregated data, is considerably faster than the first
way. In the present case the first way takes more than an hour on my
laptop whereas the second way takes only around one minute!
To understand the two different ways, it might help to recall some
basics about the binomial
distribution. It is defined by the number of observations and the
number of successes and has one parameter, the success probability.
Because we want to estimate the parameter of the binomial distribution,
the success probability, we have to pass two pieces of information to
the mixed
function; the number of successes and the total
number of trials. The easiest way to do so with the current data is by
directly passing the accuracy column acc
. It only consists
of zeros and ones so the total number of observations is the number of
trials, and the number of ones is the number of successes. The following
code sets up the maximal model as discussed above using this approach.
Note that fitting this model takes more than an hour (likely several
hours).
e1_mixed1_v1 <- mixed(
acc ~ congruency*condition + (congruency*condition|pno),
data = stroop_e1,
method = "LRT",
family = binomial
)
Before looking at the results, let us first discuss the alternative
way of specifying the model. Note that the second way is usually
considerably faster. This way consists of two steps. We first need to
aggregate the data within each cell of the design and participant by
calculating the proportion of correct responses and the number of trials
for each cell. Then we can pass these two pieces of information to
mixed
, the proportion of successes as the dependent
variable and the total number of successes as a weight
.
These two pieces of information again provide all information required
for the binomial distribution. However, note that it is important that
proportion times the weight provide an integer number, the actual number
of successes which cannot have a decimal part.
At first, we need to aggregate the data. For this, we can use the
following dplyr
syntax.
stroop_e1_agg <- stroop_e1 %>%
group_by(condition, congruency, pno) %>%
summarise(acc = mean(acc),
n = n())
Then, we can set up the model. Note that we need to pass the total
number of trials as the weight
argument here.
mixed
now supports the exact same syntax as
glmer
so we can pass simply the unquoted name of the column
holding the weights, weight = n
.
The call to mixed
is then pretty much the same as above,
the only differences are that we have to use the data.frame
with the aggregated data, potentially a new name of response variable
(here I decided to use the same name, acc
for the
aggregated accuracy), and the weight
argument.
Before looking at the results, we see that fitting produces quite a
few warnings. These are reproduced below in the way we would also get
them when simply typing in the object name (e.g.,
e1_mixed1_v2
). The following are the warnings for the first
variant, e1_mixed1_v1
.
## Warning: lme4 reported (at least) the following warnings for 'full':
## * Model failed to converge with max|grad| = 0.00378454 (tol = 0.002, component 1)
## Warning: lme4 reported (at least) the following warnings for 'congruency':
## * Model failed to converge with max|grad| = 0.146845 (tol = 0.002, component 1)
## Warning: lme4 reported (at least) the following warnings for 'condition':
## * Model failed to converge with max|grad| = 0.0860405 (tol = 0.002, component 1)
## Warning: lme4 reported (at least) the following warnings for 'congruency:condition':
## * Model failed to converge with max|grad| = 0.0181415 (tol = 0.002, component 1)
And these are the warnings for the second variant,
e1_mixed1_v2
.
## Warning: lme4 reported (at least) the following warnings for 'full':
## * Model failed to converge with max|grad| = 0.0142323 (tol = 0.002, component 1)
## Warning: lme4 reported (at least) the following warnings for 'congruency':
## * Model failed to converge with max|grad| = 0.134205 (tol = 0.002, component 1)
## * Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
## Warning: lme4 reported (at least) the following warnings for 'condition':
## * Model failed to converge with max|grad| = 0.120362 (tol = 0.002, component 1)
## * Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
## Warning: lme4 reported (at least) the following warnings for 'congruency:condition':
## * Model failed to converge with max|grad| = 0.0534769 (tol = 0.002, component 1)
Both variants show quite a few warnings. The warnings are given separately for the different models estimated (i.e., full model with all terms and one model for each term removed from the full model). However, none of the warnings is the dreaded “singular fit” warning that clearly suggests an over-parameterised model. We therefore will be working with this random effect structure going forward.
Here, we have two different warnings suggesting some numerical issue
at the maximum likelihood estimate. One warning tells us that for the
result returned from the fitting (i.e., optimization) algorithm the
absolute gradient of the deviance function is not small (i.e., there
might be a better global optimum that was not discovered). Another
warning relates to the size of the eigenvalue which suggests something
similar, there might be better solutions in the parameter space. More on
these warnings can be found in the lme4
help at
?troubleshooting
(or
help("troubleshooting", package = "lme4")
).
Given that both types of warnings are not too dramatic, one way to
address them is by trying different optimizers. mixed
makes
this easy via the all_fit
argument, if TRUE
several optimizers are applied to the output from the regular run (make
sure to install packages optimx
and dfoptim
to
use all possible available optimizers).
e1_mixed1_v2_allfit <- mixed(
acc ~ congruency*condition + (congruency*condition|pno),
data = stroop_e1_agg,
method = "LRT",
family = binomial,
weight = n,
all_fit = TRUE
)
## Warning: lme4 reported (at least) the following warnings for 'full':
## * Model failed to converge with max|grad| = 0.123381 (tol = 0.002, component 1)
## * Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
## Warning: lme4 reported (at least) the following warnings for 'congruency':
## * Model failed to converge with max|grad| = 0.137517 (tol = 0.002, component 1)
## * Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
## Warning: lme4 reported (at least) the following warnings for 'condition':
## * Model failed to converge with max|grad| = 0.128127 (tol = 0.002, component 1)
## * Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
## Warning: lme4 reported (at least) the following warnings for 'congruency:condition':
## * Model failed to converge with max|grad| = 0.130338 (tol = 0.002, component 1)
## * Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
Somewhat surprisingly the fit using all optimizers produces more convergence warnings than the previous fit. However, all warnings are of the same type as for the other variants.
Another way to deal with the warnings is by fitting a variant of the model with reduced random effect structure. Hopefully this model will not produce the warning, but similar results. If warnings are more severe than the current warning, fitting a reduced model is surely indicated. Critical warnings that make it necessary to reduce the model are for example a singular fit warning or a warning that the algorithm did not converge within the allowed number of iterations. Nevertheless, if this were a real data analysis for a manuscript it would probably still make sense to reduce the random effect structure, beginning with the correlation among the random effect parameters, until the warnings are gone. Then, we would compare the final (i.e., reduced) model with the maximal model to ensure that the pattern of significant and non-significant effects is the same in both cases (if not, this needs to be reported transparently). An example of such an iterative reduction of the random effect structure is given in the mixed model vignette.
We can now print the model to get the resulting ANOVA table, that is the table of model terms (i.e., main effects and interactions), as for the regular ANOVA model. Doing so would normally also reproduces the warnings shown above but is suppressed here.
We show the results of the three variants in turn.
## Mixed Model Anova Table (Type 3 tests, LRT-method)
##
## Model: acc ~ congruency * condition + (congruency * condition | pno)
## Data: stroop_e1
## Df full model: 14
## Effect df Chisq p.value
## 1 congruency 1 321.00 *** <.001
## 2 condition 1 11.08 *** <.001
## 3 congruency:condition 1 4.23 * .040
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1
## Mixed Model Anova Table (Type 3 tests, LRT-method)
##
## Model: acc ~ congruency * condition + (congruency * condition | pno)
## Data: stroop_e1_agg
## Df full model: 14
## Effect df Chisq p.value
## 1 congruency 1 321.05 *** <.001
## 2 condition 1 11.10 *** <.001
## 3 congruency:condition 1 4.23 * .040
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1
## Mixed Model Anova Table (Type 3 tests, LRT-method)
##
## Model: acc ~ congruency * condition + (congruency * condition | pno)
## Data: stroop_e1_agg
## Df full model: 14
## Effect df Chisq p.value
## 1 congruency 1 320.96 *** <.001
## 2 condition 1 11.06 *** <.001
## 3 congruency:condition 1 4.23 * .040
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1
As can be seen, the results of the three different variants are
extremely similar. Especially the fact that the results from variant 2
and the variant with all_fit = TRUE
are pretty much the
same suggests that we can have some confidence in the results.
As for the ANOVA we see significant main effects of congruency and condition. However, we also see a significant congruency by condition interaction. As a reminder, this interaction is the precondition of the central ego depletion prediction.
Let us first look at the main effects as for the ANOVA model. For
this, we will always use the model with all_fit = TRUE
,
e1_mixed1_v2_allfit
. For easier interpretability, we also
always set type = "response"
in the call to
emmeans
. This provides estimated marginal means on the
response scale (i.e., in probability units after applying the inverse
link function). In case of the default (i.e.,
type = "link"
) marginal mean are given on the linear scale
which in this case is the logit scale.
## NOTE: Results may be misleading due to involvement in interactions
## congruency prob SE df asymp.LCL asymp.UCL
## congruent 0.982434 1.44579e-05 Inf 0.982406 0.982463
## incongruent 0.909378 6.89756e-05 Inf 0.909242 0.909513
##
## Results are averaged over the levels of: condition
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
We again see the Stroop effect. Accuracy is higher for congruent compared to incongruent items.
We also look at the effect of condition.
## NOTE: Results may be misleading due to involvement in interactions
## condition prob SE df asymp.LCL asymp.UCL
## control 0.96299 2.9834e-05 Inf 0.96293 0.96305
## deplete 0.95570 3.5437e-05 Inf 0.95563 0.95577
##
## Results are averaged over the levels of: congruency
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
We see that accuracy is roughly 1% lower in the ego-depletion compared to the control condition.
We can also plot both effects as above. We essentially only have to
replace the model name in the calls to afex_plot
. However,
we also remove the error = "within"
to show model-based
error bars.
plot_grid(
afex_plot(e1_mixed1_v2_allfit, "congruency",
data_geom = geom_quasirandom, data_alpha = 0.3) +
coord_cartesian(ylim = c(0.25, 1)),
afex_plot(e1_mixed1_v2_allfit, "condition",
data_geom = geom_quasirandom, data_alpha = 0.3) +
coord_cartesian(ylim = c(0.25, 1))
)
## Aggregating data over: pno
## NOTE: Results may be misleading due to involvement in interactions
## Aggregating data over: pno
## NOTE: Results may be misleading due to involvement in interactions
Finally, we can also have a look at the interaction. One way to do so is by looking at the reference grid of the two variables.
## congruency condition prob SE df asymp.LCL asymp.UCL
## congruent control 0.984783 1.77534e-05 Inf 0.984748 0.984817
## incongruent control 0.912742 9.42464e-05 Inf 0.912558 0.912927
## congruent deplete 0.979731 2.35033e-05 Inf 0.979685 0.979777
## incongruent deplete 0.905897 1.00894e-04 Inf 0.905699 0.906094
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
Alternatively, we can look at the effect of congruency
conditional on condition
. As will be shown below, this has
some benefits. And because we will use this emmeans
object
later, we assign it to emm_inter_1
.
emm_inter_1 <- emmeans(e1_mixed1_v2_allfit, "congruency",
by = "condition", type = "response")
emm_inter_1
## condition = control:
## congruency prob SE df asymp.LCL asymp.UCL
## congruent 0.984783 1.77534e-05 Inf 0.984748 0.984817
## incongruent 0.912742 9.42464e-05 Inf 0.912558 0.912927
##
## condition = deplete:
## congruency prob SE df asymp.LCL asymp.UCL
## congruent 0.979731 2.35033e-05 Inf 0.979685 0.979777
## incongruent 0.905897 1.00894e-04 Inf 0.905699 0.906094
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
With this object, it is now particularly easy to see if there are
indeed differences in the Stroop effects across conditions. We can
simply use the pairs
function which will only calculate
pairwise comparisons for each level of the by
factor
`conditions. Because of the significant interaction this shows that
there is evidence for a reduced Stroop effect in the deplete condition.
The odds ratio are around 5 in the deplete condition and over 6 in the
control condition.
## condition = control:
## contrast odds.ratio SE df null z.ratio p.value
## congruent / incongruent 6.187 0.010362 Inf 1 1088.115 <.0001
##
## condition = deplete:
## contrast odds.ratio SE df null z.ratio p.value
## congruent / incongruent 5.021 0.008404 Inf 1 964.109 <.0001
##
## Tests are performed on the log odds ratio scale
We can of course also plot the interaction. We use violin plots for the data to make this plot somewhat appealing.
## Aggregating data over: pno
For reference, we can also make the plot based on the model using variant 1 (i.e., based on trial-wise non-aggregated data) which produces the same figure.
## Aggregating data over: pno
The stroop
data set contains all four experiments of Lin
et al. (2020). So far we only looked at Experiment 1. Here, we want to
look at Experiments 1 and 2 to show some additional functionality of
analysing accuracy data with GLMMs. Let us begin by preparing the
data.
## extract data from experiment 1 and remove NAs
stroop_e12 <- stroop %>%
filter(!is.na(acc)) %>%
filter(study %in% c("1", "2")) %>%
droplevels()
We then have a look at the number of participants per experiment. As can be seen below, there is a clear imbalance with Experiment 1 having a lot more participants than Experiment 2. We will see below what we can do with this information.
stroop_e12 %>%
group_by(study) %>%
summarise(n = length(unique(pno)))
## # A tibble: 2 × 2
## study n
## <fct> <int>
## 1 1 253
## 2 2 132
But first, we aggregate the data for analysing it using the second variant introduced above as it is dramatically faster.
stroop_e12_agg <- stroop_e12 %>%
group_by(study, condition, congruency, pno) %>%
summarise(acc = mean(acc),
n = n())
We are now ready to fit the model. However, because we now have to
fit 8 models in total and we also have more data at hand, fitting will
take quite a bit longer even when using the aggregated data. Especially,
because based on our experience with the data from Experiment 1, it
makes sense to set all_fit = TRUE
here.
Therefore, it makes sense to reduce our fitting time by distributing
the estimation across the available CPU cores using the
parallel
package support provided by mixed
.
Note that this simply distributes the different models across cores
(i.e., each individual fit is still run on a single core). For this, we
need to set up a cluster, cl
, and then pass this cluster to
mixed
.
library("parallel")
nc <- detectCores() # number of cores
cl <- makeCluster(rep("localhost", nc)) # make cluster
We can then fit the model using the new data in a similar manner as
above. The main changes are that we add *study
to the fixed
effects part of the model formula which estimates main effects and all
interactions with study
. We do not estimate any random
slopes for study
because participants are in only one study
(i.e., study is a between-subjects factor). We also enable multi-core
estimation by setting cl = cl
.
The default type
argument when using mixed
(and the other afex
functions) is 3
which
estimates so-called type III sums of squares tests of model terms. There
is a somewhat heated discussion on these topics in the literature which
I do not want to rehash here (more can be found in our
chapter). In short, the distinction between Type III (default in
afex
) and Type II (recommended by some vocal statisticians)
is about how to deal with imbalance (i.e., different group sizes). Type
III sums of squares assume that the imbalance is random and estimates a
model in which all groups are assumed to have equal sizes. In contrast,
using Type II sums of squares the differences in group sizes are assumed
to be meaningful (e.g., as a consequence of different group sizes in the
environment) and the model is set-up such that the differences in group
sizes are represented.
Remember that the first experiment (study 1) had a lot more
participants than the second experiment (study 2). However, the default
Type III tests treat both studies as having the same sample size. In
this case it might therefore make sense to also look at the results with
Type II tests. For this, we simply need to add type = 2
(or
equivalently type = "II"
) to the call as shown below.
e12_mixed1_t2 <- mixed(
acc ~ congruency*condition*study + (congruency*condition|pno),
data = stroop_e12_agg,
method = "LRT",
family = binomial,
weight = n,
all_fit = TRUE,
cl = cl,
type = 2
)
Fitting both models shows a number of warnings similar to the previous model which we do not reproduce here. Because we are done with fitting models in parallel, we stop the cluster.
Let us instead take a look at the results.
We begin with the Type III model.
## Mixed Model Anova Table (Type 3 tests, LRT-method)
##
## Model: acc ~ congruency * condition * study + (congruency * condition |
## Model: pno)
## Data: stroop_e12_agg
## Df full model: 18
## Effect df Chisq p.value
## 1 congruency 1 453.65 *** <.001
## 2 condition 1 10.43 ** .001
## 3 study 1 0.14 .705
## 4 congruency:condition 1 1.72 .190
## 5 congruency:study 1 0.33 .566
## 6 condition:study 1 1.60 .205
## 7 congruency:condition:study 1 1.46 .226
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1
The results are now more in line with the results from the ANOVA analysis. We only see significant main effects of congruency and condition, but no congruency by condition interaction. Furthermore, no effect involving study is significant.
Let us now take a look at the Type II results.
## Mixed Model Anova Table (Type 2 tests, LRT-method)
##
## Model: acc ~ congruency * condition * study + (congruency * condition |
## Model: pno)
## Data: stroop_e12_agg
## Df full model(s): 14, 14, 14, 17, 17, 17, 18
## Effect df Chisq p.value
## 1 congruency 1 482.39 *** <.001
## 2 condition 1 11.57 *** <.001
## 3 study 1 0.21 .646
## 4 congruency:condition 1 2.86 + .091
## 5 congruency:study 1 0.28 .598
## 6 condition:study 1 1.42 .234
## 7 congruency:condition:study 1 1.46 .226
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1
These look quite similar with one difference. The congruency by condition interaction is still not significant, however, the p-value is below .1.
Taken together the results somewhat suggest that the congruency by condition interaction is a pattern mainly found in (the larger) Experiment 1, but not in Experiment 2. In the Type III analysis in which both studies are weighed equally the interaction therefore does not reach significance. In contrast, in the Type II analysis in which the additional participants of Experiment 1 have a larger influence on the interaction, the p-value drops and is nearer the .05 cut-off. However, the difference in the interaction between the experiments is also not too large as the three way interaction of congruency by condition by study is clearly not significant (and as the highest order effect, this effect has to be the same for both types of sums of squares).
We can now of course also plot the data. Let us plot the congruency by condition interaction as before. We begin with the plot of the Type III model.
## Aggregating data over: pno
## NOTE: Results may be misleading due to involvement in interactions
We then plot the interaction of the Type II model.
## Aggregating data over: pno
## emmeans are based on full model which includes all effects.
## NOTE: Results may be misleading due to involvement in interactions
Both look pretty much the same and actually are the same. The reason
for this is also given as a message when producing the second plot. When
using mixed
Type II models, all follow-up tests (which
includes plotting via afex_plot
) are based on the full
model. However, Type II model tests are not all based on the full model.
Rather, for tests of lower-order effects higher order effects are not
part of the comparison (i.e., tests of two-way interactions such as the
congruency by condition interaction are compared against a reference
model that only includes all two-way interactions and not the three-way
interaction).
We can also see this when comparing the corresponding
emmeans
for both models:
## NOTE: Results may be misleading due to involvement in interactions
## congruency condition prob SE df asymp.LCL asymp.UCL
## congruent control 0.983752 1.54389e-05 Inf 0.983722 0.983782
## incongruent control 0.913573 7.61899e-05 Inf 0.913423 0.913722
## congruent deplete 0.980078 1.88459e-05 Inf 0.980040 0.980114
## incongruent deplete 0.905551 8.25395e-05 Inf 0.905390 0.905713
##
## Results are averaged over the levels of: study
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
## NOTE: Results may be misleading due to involvement in interactions
## congruency condition prob SE df asymp.LCL asymp.UCL
## congruent control 0.983752 1.54389e-05 Inf 0.983722 0.983782
## incongruent control 0.913573 7.61899e-05 Inf 0.913423 0.913722
## congruent deplete 0.980078 1.88459e-05 Inf 0.980040 0.980114
## incongruent deplete 0.905551 8.25395e-05 Inf 0.905390 0.905713
##
## Results are averaged over the levels of: study
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
We again see that they are identical. Thus, the Type II outputs do not actually reflect Type II tests.
A more appropriate way to look at this two-way interaction would be
to use a model that reflects the Type II tests; that is, a model without
the three-way interaction. A variant of this model is part of the
e12_mixed1_t2
object. However, because of the way
afex::mixed
creates the set of tests, this model is
parameterized in a different way so we cannot pass it to
emmeans
for tests. Therefore, we have to refit the model
without the three-way interaction. We do so by changing the fixed
effects part of the formula to
(congruency+condition+study)^2
which means all main effects
and up to two-way interactions.
e12_mixed1_t2_red <- mixed(
acc ~ (congruency+condition+study)^2 + (congruency*condition|pno),
data = stroop_e12_agg,
method = "LRT",
family = binomial,
weight = n,
all_fit = TRUE,
cl = cl,
type = 2
)
## Mixed Model Anova Table (Type 2 tests, LRT-method)
##
## Model: acc ~ (congruency + condition + study)^2 + (congruency * condition |
## Model: pno)
## Data: stroop_e12_agg
## Df full model(s): 14, 14, 14, 17, 17, 17
## Effect df Chisq p.value
## 1 congruency 1 482.39 *** <.001
## 2 condition 1 11.57 *** <.001
## 3 study 1 0.21 .646
## 4 congruency:condition 1 2.86 + .091
## 5 congruency:study 1 0.28 .598
## 6 condition:study 1 1.42 .234
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1
We first have a look at the resulting ANOVA table (note, we suppress
the warnings again here). As expected, the Type II tests are the same as
in e12_mixed1_t2
, because they are based on the exact same
models.
We then take a look at the marginal means for the two-way interaction of congruency by condition.
## emmeans are based on full model which includes all effects.
## congruency condition prob SE df asymp.LCL asymp.UCL
## congruent control 0.983835 1.51410e-05 Inf 0.983806 0.983865
## incongruent control 0.912916 7.55995e-05 Inf 0.912768 0.913064
## congruent deplete 0.979895 1.87376e-05 Inf 0.979858 0.979931
## incongruent deplete 0.905955 8.10300e-05 Inf 0.905796 0.906114
##
## Results are averaged over the levels of: study
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
We can see these are now a bit different from the previous one, but not by a lot. A better way to see the difference is to use the same approach as above and look at the estimated Stroop effects for each condition.
For the correct Type II model these are given by the following call:
## emmeans are based on full model which includes all effects.
## condition = control:
## contrast odds.ratio SE df null z.ratio p.value
## congruent / incongruent 5.806 0.007814 Inf 1 1306.888 <.0001
##
## condition = deplete:
## contrast odds.ratio SE df null z.ratio p.value
## congruent / incongruent 5.059 0.006805 Inf 1 1205.405 <.0001
##
## Results are averaged over the levels of: study
## Tests are performed on the log odds ratio scale
For the wrong Type II model these are given below:
## emmeans are based on full model which includes all effects.
## NOTE: Results may be misleading due to involvement in interactions
## condition = control:
## contrast odds.ratio SE df null z.ratio p.value
## congruent / incongruent 5.728 0.007817 Inf 1 1278.953 <.0001
##
## condition = deplete:
## contrast odds.ratio SE df null z.ratio p.value
## congruent / incongruent 5.131 0.006999 Inf 1 1198.815 <.0001
##
## Results are averaged over the levels of: study
## Tests are performed on the log odds ratio scale
We can see that as expected the magnitude of the difference between
the Stroop effects is smaller in e12_mixed1_t2
(5.73
vs. 5.13) than in e12_mixed1_t2_red
(5.81 vs. 5.06).
One way to approximate the latter behaviour without actually
refitting the model is by passing submodel = "minimal"
(which in the present case is identical to
submodel = ~congruency*condition
, see also the
corresponding emmeans
vignette). This does not produce
exact Type II marginal means as when actually refitting the model. But
at least approximates those.
emmeans(e12_mixed1_t2, "congruency", by = "condition", type = "response",
submodel = "minimal") %>%
pairs()
## emmeans are based on full model which includes all effects.
## NOTE: Results may be misleading due to involvement in interactions
## condition = control:
## contrast odds.ratio SE df null z.ratio p.value
## congruent / incongruent 5.828 0.008192 Inf 1 1253.988 <.0001
##
## condition = deplete:
## contrast odds.ratio SE df null z.ratio p.value
## congruent / incongruent 5.110 0.007365 Inf 1 1131.878 <.0001
##
## Results are averaged over the levels of: study
## Tests are performed on the log odds ratio scale
Finally, we can also use this model to plot the two-way interaction.
## Aggregating data over: pno
## emmeans are based on full model which includes all effects.