Why You Should Always Include a Random Slope for the Lower-Level Variable Involved in a Cross-Level Interaction

Mixed-effects multilevel models are often used to investigate cross-level interactions, a specific type of context effect that may be understood as an upper-level variable moderating the association between a lower-level predictor and the outcome. We argue that multilevel models involving cross-level interactions should always include random slopes on the lower-level components of those interactions. Failure to do so will usually result in severely anti-conservative statistical inference. We illustrate the problem with extensive Monte Carlo simulations and examine its practical relevance by studying 30 prototypical cross-level interactions with European Social Survey data for 28 countries. In these empirical applications, introducing a random slope term reduces the absolute t-ratio of the cross-level interaction term by 31 per cent or more in three quarters of cases, with an average reduction of 42 per cent. Many practitioners seem to be unaware of these issues. Roughly half of the cross-level interaction estimates published in the European Sociological Review between 2011 and 2016 are based on models that omit the crucial random slope term. Detailed analysis of the associated test statistics suggests that many of the estimates would not reach conventional thresholds for statistical significance in correctly specified models that include the random slope. This raises the question how much robust evidence of cross-level interactions sociology has actually produced over the past decades.

might therefore expect their models to include so-called random slope terms that capture unexplained contextual variation in these relationships (see Equation 3 below for a formal representation). In our example, one would include a random slope to account for cross-country differences in the relationship between education and fear of crime that are not explained by country differences in human development.
A review of published research, however, reveals that in many analyses of cross-level interactions the corresponding random slope is missing. Between 2011 and 2016 the European Sociological Review (ESR) published 28 studies that investigated cross-level interactions using (two-level) mixed effects multilevel models (24 of these studies were country comparisons). More than half of these studies ( 17 / 28 or 61%) only specified random intercept models without any random slopes (for details, see the 'Cross-level Interactions in the ESR' section).
Given that empirical practice is so inconsistent, one may wonder whether the inclusion of random slope terms on the lower-level components of cross-level interactions is a matter of taste or whether one approach will usually be preferable to the other. A review of prominent textbooks on multilevel modeling does not provide a clear answer. In one widely-read book, Snijders and Bosker (2012) note that 'tested fixed effects' should be accompanied by 'an appropriate error term [...] For cross-level interactions, it is the random slope of the level-one [i.e., lowerlevel] variable involved in the interaction' (p.104). Other authors take a more ambiguous position. For example, Raudenbush and Bryk's (2002) book includes a section on 'A Model with Nonrandomly Varying Slopes' where they suggest that a model with a cross-level interaction may omit the corresponding random slope if 'little or no variance in the slopes remains to be explained ' (p.28). They provide no precise definition of 'little or no variance', however. In their chapter on 'Random-coefficient models', Rabe-Hesketh and Skrondal (2012) generally include random slope terms alongside cross-level interactions, but they also note that the decision whether to do so often seems to be driven by technicalities of the software used: 'Papers using HLM tend to include more cross-level interactions and more random coefficients in the models (because the level-2 [i.e., upperlevel] models look odd without residuals) than papers using, for instance, Stata' (p.212f.). This certainly does not sound like an emphatic recommendation to include the random slope for statistical reasons.
In this article, we argue that such a recommendation should be given. We explain and demonstrate that the omission of random slopes in the analysis of cross-level interactions constitutes a specification error that will often have severe consequences for statistical inference about the coefficient of the cross-level interaction term (i.e., in our running example the interaction between education and HDI) and about the main effect of the lower-level predictor involved in the interaction (i.e., the main effect of education). Only the main effect of the upperlevel predictor remains unaffected (provided that the model includes a random intercept, as is generally the case in applied research).
In the next section, we briefly introduce mixed effects models with cross-level interactions. In the 'Why Always a Random Slope?' section, we then explain that random slopes capture cluster-driven heteroskedasticity and autocorrelation related to the lower-level component of the cross-level interaction (and hence to the cross-level interaction term itself). As in standard linear regression, ignoring heteroskedasticity and autocorrelation by failing to specify the appropriate random slope term will typically lead to downward bias in standard error estimates.
The two subsequent sections present Monte Carlo simulations and illustrative empirical analyses that support our claims. The simulations show that (correctly specified) mixed effects models with a random intercept and a random slope on the lower-level component of a cross-level interaction generally achieve accurate statistical inference for all coefficients of interest. By contrast, random intercept models that omit the random slope term produce severely anti-conservative in-ference for the cross-level interaction term and the main effect of its lower-level component. The proportion of 95% confidence intervals that do not cover the true effect size (i.e., the actual coverage rate) is generally smaller than the nominal rate, and often by a substantial margin. We find that the extent of undercoverage increases with the extent of variation in the (unmodeled) random slope, the variance of the lower-level component, and the number of lower-level observations per cluster. Illustrative empirical analyses of European Social Survey (ESS) data for 28 countries indicate that the consequences of omitting the random slope on the lower-level component are severe in real-life settings. We examine a total of 30 cross-level interactions and find that inclusion of the random slope term deflates the t-ratio on the cross-level interaction term by 31% or more in three quarters of cases, with an average reduction of 42%.
We then review studies of cross-level interactions published in the ESR between 2011 and 2016. Unsurprisingly, we find that authors were more likely to report statistically significant cross-level interactions when they used a misspecified model that omitted the corresponding random slope. Consistent with 'phacking' (Simonsohn et al., 2014), the distribution of absolute t-ratios for models estimated without a random slope exhibits a marked peak just above the critical value of 1.96 whereas the t-ratios for models that include a random slope do not.
In combination with the results of our Monte Carlo simulations and empirical illustrations, our review therefore suggests that most estimates based on models omitting the random slope would not have reached conventional levels of statistical significance in a correctly specified model.
The subsequent and penultimate section presents a further result of our analysis: the omission of a relevant random slope also leads to anti-conservative inference for a corresponding 'pure' lower-level effect. That is, even if the model does not contain any cross-level interactions involving education, accurate inference for the average effect of education on fear of crime across the 28 ESS coun-tries would require the inclusion of a random slope on education-provided that such a slope is present in the process that gave rise to the data. While this result is troubling, there are two reasons to be less concerned than in the cross-level interaction case. First, most sociologists who use multilevel models are primarily interested in context effects rather than pure lower-level effects, as we confirm through a systematic analysis of the titles, abstracts, and formal hypotheses of research published in the ESR. Second, lower-level effects can typically be estimated with much greater precision (and correspondingly higher absolute tstatistics) than cross-level interactions. As a consequence, estimated lower-level effects should often stay statistically highly significant even if the associated tratio declines by 50% or more. In the cross-level interaction case, such a decrease will often mean the difference between moderately strong and no statistically meaningful evidence against the null hypothesis.
The concluding section discusses the primary implications of our study. Looking backward, our findings suggest that the empirical basis for many seemingly well-established findings in (country-)comparative research may be much shakier than previously thought. Looking forward, a minimum requirement for future studies that examine cross-level interactions using multilevel models is that they include a random slope on the corresponding lower-level variable. However, our findings suggest that fully accurate statistical inference for all coefficients, including pure lower-level effects, requires the inclusion of additional random slopes or alternative methods of inference, an important issue that should be addressed in future work.

Mixed Effects Models with Cross-Level Interactions
In a first step, we briefly review the general logic of mixed effects models with cross-level interactions (for comprehensive introductions, see, for example, Rabe-Hesketh and Skrondal, 2012;Raudenbush and Bryk, 2002;Snijders and Bosker, 2012). We begin with the following lower-level equation for the (lower-level) outcome Y ij (e.g., fear of crime): where i indexes lower-level observations (e.g., individuals) and j indexes upper-level observations or clusters (e.g., countries). β (c) j is the constant (i.e., intercept) and β (x) j is the coefficient of lower-level predictor x ij (e.g., education). The subscript j on the two parameters, β (c) j and β (x) j , indicates that both are considered as potentially varying across clusters. In terms of our example, the j on β (x) j thus means that the degree to which better-educated people are less afraid of crime might vary across countries. The model could be extended to include additional lower-level predictors x 2ij to x kij , but for our analysis this is not necessary. ij is a lower-level error often assumed to follow ij ∼ N (0, σ 2 ), that is, to be normally distributed with a mean of zero and constant variance σ 2 (homoskedasticity).
In a cross-level interaction model, β (x) j is specified as dependent on at least one cluster-level (i.e., contextual) variable z j (e.g., the HDI). Typically, the model will (and should) also allow for a relationship between the constant β (c) j and z j .
One way to formalize this is to write β and Here, u (c) j and u (x) j are cluster-level error terms or 'random effects', with the former often referred to as a 'random intercept' and the latter as a 'random slope' term. It is natural to think of these terms as capturing the effects of unmodeled . (4) Equation 4 shows why γ (xz) is referred to as a 'cross-level interaction effect': it is the coefficient on a multiplicative interaction term between the lower-level predictor x ij and the cluster-level predictor z j ; in our running example, it is the interaction between the individual characteristic education and the country attribute HDI. The first part of the right-hand expression, consisting of the linear combination of the constant and the lower-and upper-level predictors, multiplied by their respective coefficients (or 'fixed effects'), is also referred to as the fixed part of the model. Crucially, the second part shows that the model has a complex error term v ij that consists of three components: the random intercept term u (c) j , the lower-level residual error ij , and the product of the random slope term with the lower-level predictor u Why Always a Random Slope?
The formal exposition of the multilevel model in the previous section provides an intuitive reason for why one should always include the random slope term cluster-level regression for β (x) j , equals 1. As noted above, Raudenbush and Bryk (2002) do indeed discuss the possibility that 'little or no variance in the slopes remains to be explained' (p.28) after accounting for the cluster-level predictor z j .
Yet we would argue that this is an unlikely scenario in the vast majority of so- The covariance of the error terms for two different individuals (say, i and i ) belonging to the same cluster will be (Snijders and Bosker, 2012, Equation 5.6): These equations highlight that v ij will be heteroskedastic even if u j , and ij are all homoskedastic and that errors will be correlated within clusters. More specifically, if the true model includes the random slope term u (x) j , but the estimated model does not, there will be: a) unmodeled heteroskedasticity in the error term (due to the second and third term on the right hand side in Equation   5) and b) unmodeled covariation among the errors for lower-level observations belonging to the same cluster (due to the second and third term on the right hand side in Equation 6). Figure 1 illustrates the problem graphically. To construct the figure, we first simulated a data set according to Equations 1 to 3, assuming substantial crosscluster variation in the slope of x ij . We set the number of clusters to 25 and the number of lower-level observations per cluster to 100 (see the notes to Figure 1 for further information on how the data were generated). We then fitted a multilevel Note: Residuals are from linear mixed effects models. The data are are simulated according to Equations 1 to 3 with 25 clusters and 100 lower-level observations per cluster. The cluster-and lower-level predictors, z j and x ij , are both normally distributed with means of 0 and standard deviations of 1 and their coefficients are being set to 1; u (c) j and u (x) j are multivariate normal with means of 0, standard deviations of .6 and 2, respectively, and with a correlation of .3; the lower-level error ij is normally distributed with a standard deviation of 2. model with and a multilevel model without a random slope on x ij to the simulated data and obtained the lower-level residuals for each. The figure plots these residuals against x ij and z j x ij , after partialing out the cluster-level predictor z j .
We focus on three representative clusters, one with a slope for β (x) j that deviates strongly positive from the average slope, one with a slope for β (x) j that is close to the average, and one with a slope for β (x) j that deviates strongly negative from the average slope. Regression lines have been added to approximate the conditional mean of the residuals for each of the three clusters.
The graphs in the left-column of Figure 1 show that the lower-level residuals from the correctly specified model conform to the assumptions of the model: the cluster-specific means of the residuals are unrelated to either predictor and their variance is constant. The picture looks very different for the residuals from the misspecified model (i.e., the one omitting the random slope) in the right column. Consistent with the above discussion, the variance of the residuals is markedly higher for extreme values of x ij (heteroskedasticity). Moreover, the residuals for lower-level observations belonging to the same cluster are highly positively (auto-)correlated when they have similar values on x ij and z j x ij .
Omitting a random slope that actually belongs in the model thus leads to unmodeled heteroskedasticity and autocorrelation. This will typically lead to the underestimation of standard errors and thereby to anti-conservative inference. This is well-known not only from the multilevel modeling literature, but also from the literature on cluster-robust inference in econometrics (for a recent overview, see Cameron and Miller, 2015). 2 In fact, the goal to achieve accurate inference in the presence of cluster-induced heteroskedasticity and autocorrelation is a common motivation for both multilevel modeling and cluster-robust methods. The former approach seeks to address the interdependencies among observations belonging to the same cluster through the inclusion of random intercept and slope terms (see Equations 1 to 6 above). The latter uses special 'sandwich-type' estimators of the coefficient covariance matrix that remain consistent even in the presence of heteroskedasticity and (within-cluster) autocorrelation.
When will omitting the random slope term be particularly consequential? Inspection of Equations 5 and 6 (as well as Figure 1) suggests two relevant factors.
First, the consequences of omitting the random slope should become more severe as the variance of u (x) j increases. This is because both the conditional variance (Equation 5) and the within-cluster covariance (Equation 6) depend on V ar(u The second factor is the extent of variation in the lower-level predictor, that is, j ) increases, so will the extent of (unmodeled) variation in the conditional error variance across observations. In terms of our running example, failure to model cross-cluster differences in the coefficient of education will be more consequential when individuals differ a lot in terms of their level of education.
The parallels to the literature on cluster-robust inference suggest a third factor that does not immediately follow from the above equations. The consequences of erroneously omitting the random slope term should also be related to the number of observations per cluster, that is, to the (average) cluster size. For the case of linear regression with clustered data, it is well-known that the conventional (uncorrected) ordinary least squares variance estimate for a regressor x understates the true variance approximately by a factor of (Cameron and Miller, 2015, 322): where ρ (x) is the within-cluster correlation of x, ρ (u) is the within-cluster error correlation, andN g is the average cluster size. Intuitively, the underlying reason is that the actual number of cases available for estimating the cross-level interaction is the number of clusters because the cross-level interaction is about a cluster-level relationship. This is immediately clear from the 'slope-as-outcome' formulation of the model (see Equation 3 above). By omitting the random slope term, this cluster-level nature of the cross-level interaction is ignored and observations from the same cluster are treated as contributing independent information about the moderating effect of z j on the slope of x ij . This illusionary increase in the number of cases available for estimating the cross-level interaction is larger when clusters are large.
Another way to understand why the average cluster size matters is to consider how it affects the overall prevalence of autocorrelation in the data (see Cameron and Miller, 2015, 319ff.). Both the multilevel and the cluster-robust approach assume that errors are correlated within, but not across clusters. Thus off-diagonal elements of the error covariance matrix will be zero for pairs of observations that belong to different clusters and will generally be non-zero for pairs of observations that belong to the same cluster. For a given lower-level sample size, this means that the number of non-zero off-diagonal elements increases with the average cluster size (and decreases with the number of clusters). 3 In summary, the above discussion suggests that practitioners should always specify a random slope for the lower-level variable of a cross-level interaction in mixed effects models. Failure to include a random slope is to disregard clusterdriven heteroskedasticity and autocorrelation, violating fundamental model assumptions. Omitting the random slope term associated with a cross-level interaction will not, in general, introduce systematic bias into coefficient estimates. 4 But it will lead to overly optimistic statistical inference for the cross-level interaction term and the coefficient (i.e., the 'main effect') of the lower-level variable involved in the interaction. All other coefficient estimates and their standard errors, including the main effect of the contextual predictor involved in the cross-level interaction as well as any additional lower-and upper-level predictors, should largely remain unaffected. 5 The consequences of omitting the random slope term should become more severe a) as the unaccounted variation in the cluster-specific slopes grows, b) as the variance of the involved lower-level variable increases, and c) as the average cluster size becomes larger.

Inference For 'Pure' Lower-Level Effects
Against the background of the preceding discussion, one may wonder if the incorporation of random slopes is also important for achieving correct inference on the coefficients of lower-level variables that are not involved in a cross-level interaction term, that is, on 'pure' lower-level effects (cf., Barr et al., 2013;Bell et al., 2016). In terms of our running example, this means: does it remain crucial to include the random slope if we are interested in the overall (average) effect of education on fear of crime rather than the interaction between human development and education? After all, it is the presence of an unmodeled random slope term u (x) j -and not the interaction between a cluster-level and a lower-level predictor-that introduces heteroskedasticity (Equation 5) and autocorrelation (Equation 6) into the overall error term v ij . To foreshadow our results, we do indeed find that the omission of a relevant random slope leads to anti-conservative inference also for pure lower-level effects.
That being said, we maintain and demonstrate below that there are at least two important reasons why the cross-level interaction case deserves special attention. The first is that, at least in sociology, the overwhelming majority of studies that use mixed effects models with multilevel data are primarily interested in context effects, including cross-level interactions. The second reason is that the erroneous omission of a random slope term tends to be less consequential in the pure lower-level effect than in the cross-level interaction case. The crucial reason for this is that, compared to a lower-level effect, much more data will usually be needed to achieve the same level of statistical power for identifying a cross-level interaction (Gelman and Hill, 2007, Ch. 20). As a consequence, the same relative increase in the standard error (due to omitting a random slope term) will often make the difference between moderately strong and no meaningful evi-dence against the null hypothesis in the cross-level interaction case (say, between p < .05 and p > .1). In the case of pure lower-level effects, the difference is more likely to be between different degrees of strong evidence (say, between p < .001 and p < .01). We explore these issues in detail in the 'Random Slopes and 'Pure' Lower-level Effects' section. In a first step, we now focus on the cross-level interaction case.

Simulation Evidence Simulation Set-up
We now present Monte Carlo simulations to illustrate the importance of including random slopes alongside cross-level interaction terms. In Monte Carlo analysis, the statistical properties of competing estimators are evaluated under controlled conditions by repeatedly sampling data from a known data-generating process (DGP) and applying the estimators to each simulated dataset. By modifying key aspects of the DGP (e.g., the number of clusters) one can investigate how they shape the relative performance of the competing estimators.
The general form of the DGP for the simulations is given in Equations 1, 2, and 3 above. That is, we consider a simple case with one lower-level predictor x ij and one upper-level predictor z j , with the latter affecting both the constant and the slope of x ij . In our running example, x ij would be education, z j would be human develompment, and the dependent variably y ij would be fear of crime.
We examine several variants of this DGP which, in keeping with standard terminology, we also refer to as 'experimental conditions'. In particular, we vary the number of clusters, the number of (lower-level) observations per cluster, the standard deviation of u  Table 1 lists the dimensions that we manipulate, along with the different values that we consider. In total, we analyze .3333 (.90) for cluster-level 1.0000 (.50) regression in parentheses) 3.0000 (.10) SD(x ij ) .50 Standard deviation of 1.00 lower-level predictor x ij 2.00 162 (= 3 × 3 × 6 × 3) experimental conditions. The coefficients on all predictors (i.e., γ (cz) , γ (x) , and γ (xz) ) are set to 1 and the overall constant γ (c) is set to 0.
We obtain 10,000 replications (i.e., 10,000 simulated datasets) per experimental condition and fit two mixed effects models to each simulated data set. Consistent with the DGP, both models include the cluster-level predictor z j , the lower-level predictor x ij , and their cross-level interaction z j x ij . Both also include a random intercept term corresponding to u (c) j in Equation 2. The only difference between the two models is that the first further includes a random slope term corresponding to u We focus on statistical inference. There is no reason to expect that the omission versus inclusion of the random slope term affects parameter bias. 6 To assess inferential accuracy, we examine the actual coverage rates of two-sided 95% con-fidence intervals. Accurate inference (for an unbiased estimator) requires that the actual coverage rate equals the nominal rate. We therefore examine whether two-sided 95% confidence intervals cover the true parameter in more or less than 95% of the 10,000 Monte Carlo replications. Let C 95 (r) = 1 if the two-sided 95% confidence interval for the r th replication includes the true value of the parameter of interest and zero otherwise. Then coverage is defined as If coverage is greater than 95%, confidence intervals are too large and overconservative; hypothesis tests will retain the null hypothesis of no effect too often. We conducted all simulations in R (R Core Team, 2017), using the lmer function of the lme4 package (Bates et al., 2017) to estimate the mixed effects models.  Table 2 shows actual coverage rates for models that omit versus models that include a random slope term on the lower-level component of the cross-level interaction. Results are displayed along two dimensions: the amount of unexplained variation in the cluster-specific slope β (x) j and the extent of variation in x ij . The number of clusters is 15 and the number of lower-level observations per cluster is 500 throughout the table. We explore the impact of varying these factors below.

Simulation Results
The central result in Table 2 is that coverage rates of confidence intervals based on models that omit the random slope term are inaccurate. As expected, this does not apply to inference for the main effect of the contextual predictor z j where coverage rates fall within the range of 95 ± 0.427% for all experimental conditions.
But the coverage rates of confidence intervals for the cross-level interaction term and for the main effect of the lower-level predictor are too low and the extent of undercoverage is generally substantial. To understand the implications, note that   an actual coverage rate of 90% means that nominal significance on the 5% level would actually only mean 'marginal' significance on the 10% level. 7 Yet, most actual coverage rates displayed in Table 2 are even substantially smaller than 90%.
Our simulation results therefore suggest that omitting the random slope term can easily turn coefficient estimates that are actually far from any conventional level of statistical significance into ones that seemingly surpass the corresponding thresholds.
By contrast, coverage rates of confidence intervals based on models that include a random slope term are by and large accurate for all three coefficients and across all displayed experimental conditions. Only when variation is low for both the lower-level predictor (i.e., SD(x ij )) and the random slope term (i.e., SD(u do the results show a tendency for overly conservative inference, meaning that confidence intervals might be somewhat too wide. We return to this unexpected result at the end of this section. The next important question is: what drives the extent of miscoverage? As expected, the extent of undercoverage grows with the unaccounted cluster-specific variation of β (x) j in the true model (i.e., with SD(u (x) j )) and also with the extent of variation in x ij (i.e., with SD(x ij )). The reason is that the extent of heteroskedasticity and autocorrelation that remains unmodeled in the specification that omits the random slope is a function of the product of these two factors (see Equations 5 and 6 above), which is also why each dimension on its own can drive the extent of undercoverage to completely unacceptable levels.
We further argued that the (average) size of the upper-level units or 'clusters' should exacerbate the consequences of omitting a random slope term because models without a random slope term assume too much independence among observations (see discussion of Equation 7 above). We explore this issue in Table   3, which shows actual coverage rates by the number of clusters and number of observations per cluster. SD(x ij ) is set to 1 and the implicit cluster-level R 2 (β to .5 (i.e., SD(u (x) j ) = 1.00); that is, we hold both factors at the intermediate levels considered in Table 2 above. Table 3 confirms that inference based on models that include a random slope is generally accurate. As befoe, we also see that omitting the random slope term does not, in general, compromises inference for γ (cz) . As expected, the problem gets worse as the cluster size (i.e., the number of lower-level observations per cluster) increases. For every given number of clusters, undercoverage is most severe for 1000 observations per cluster as compared to 500 and especially 100 observations per cluster.
The upshot of our Monte Carlo simulations thus is that omitting the random slope term on the lower-level component of a cross-level interaction can lead to dramatically anti-conservative statistical inference for the interaction term and the main effect of the lower-level variable. In line with our expectations, undercoverage increases with the extent of variation in the lower-level variable, the extent of variation in the unmodeled random slope term, and with the (average) size of the clusters.
Before we investigate the severity of the problem using real-life data from the European Social Survey, we summarize the main results of two additional sets of simulations.
In Online Supplement Appendix B we further investigate the unexpected result that the (correctly specified) model including the random slope term yields overconservative statistical inference in some situations. We present additional simulations that consider even lower values of .14 and .10 for the standard deviation of the random slope term, implying values of .98 and .99 for the cluster-level The additional simulations confirm that very low variation in the random slope term can lead to substantial overcoverage, especially when the number of clusters is also very low. While these results do warrant a note of caution, their practical relevance is limited. In the vast majority of applications the number of clusters is at least in the tens, and cross-cluster variation in random slopes is typically substantial, at least in country-comparative setting. This is confirmed by the empirical examples presented in the next section and in the Online Supplement (see, in particular, the final columns of Table D1 to D6). Moreover, practitioners can easily verify if they are dealing with a situation where the random slope variation is close to zero.

In a second set of supplementary analyses, presented in Online Supplement
Appendix C, we investigate the performance of a data-driven approach to model selection. As noted in the introduction, Raudenbush and Bryk (2002, p.28)   Nearly all arrows point downwards, indicating that absolute t-ratios for the models including the random slope term are lower, and often very substantially so. Take our running example, for instance, which is expressed by the second arrow from the right. The model which does not contain a random slope on high education yields an absolute t-ratio of 9.7 for the cross-level interaction between Note that the y-scale is logged, so arrows of similar length indicate similar relative changes. Over the 30 different models, the reduction in the absolute t-ratio for the cross-level interaction effect due to including the random slope is 42.4%  Note: The triangled arrow heads shows the absolute t-ratio from the specification including a random slope for the lower-level predictor of a cross-level interaction. The point start of the arrows indicate the absolute t-ratio from the specification omitting the random slope. The labels name the outcome (e.g., fear of crime) and lower-level predictor involved in the cross-level interaction (e.g., unemployed). The country-level moderator is always the human development index. Against these results, we conclude that not specifying random slopes on the lower-level components leads to invalid statistical inference about cross-level interactions-and that the magnitude of the problem will be considerable in many sociological applications.

Cross-level Interactions in the ESR
Given our findings one may wonder whether current multilevel modeling practice meets the requirements for correct inference by including random slopes on the lower-level components of cross-level interactions. To answer this question, we reviewed all articles that investigate a cross-level interaction and that were published in the ESR between 2011 and 2016. For simplicity, we confined ourselves to studies using simple two-level models where lower-level observations are nested in exactly one type of upper-level unit. Overall we identified 28 studies, the vast majority of which (24 or 86%) were country comparisons (one of the remaining studies treated individuals as nested in combinations of countries and survey years). The 28 studies reported a total of 150 estimates of cross-level interactions. When a paper provided multiple estimates of the same cross-level interaction (e.g., for different model specifications or subsamples), we chose one estimate at random. Note that we now disregard the main effects of the clusterand lower-level components because the cross-level interaction terms tend to be of primary interest to authors.
The discomforting result of our review is that not even half of the studies ( 11 / 28 or 39%) specify random slopes on the lower-level components of the crosslevel interactions they investigate. Figure 3 displays the percentage of studies that include random slope terms by year of publication. It provides no evidence that correct specifications have become more popular over time; if anything, the contrary seems true. Since there is little reason to suspect that these problems are confined to articles that appeared in the ESR, we conclude that a large number of published sociological studies fail to meet the requirements for correct statistical inference about cross-level interactions. Note: Results are based on 28 articles reporting 150 cross-level interactions from two-level mixed effects models published in the ESR 2011-2016. Since many articles did not report levels of significance beyond p < 0.01, we restrict our review to this threshold as the highest level of significance.
We have shown that inclusion of random slopes on the lower-level components of cross-level interactions results in larger standard errors and smaller absolute t-ratios, so studies using the correct random effects structure should be less likely to find statistically significant effects. To investigate this implication, we surveyed inferential statistics for the 150 cross-level interactions estimated in the 28 ESR articles. If available, we collected the t-ratio and otherwise the p-value or point estimate and standard error to compute the t-ratio from these statistics. 9 Unfortunately, several studies only report whether the estimated cross-level interactions attain a certain level of statistical significance, such as the 5% level of significance, as commonly indicated by a single asterisk * . 10 Another problem is the rounding of point estimates and standard errors, especially in combination with many leading zeros, which often result in tiny coefficients and tiny standard errors which are then rounded and reported as '0.00'. In such extreme cases, it is impossible to reliably approximate the t-statistic and we again surveyed the level of significance of the cross-level interaction term. Table 4 displays the percentage of estimated cross-level interaction effects that attain a given level of statistical significance according to whether the model did or did not include a random slope on the lower-level component. It shows a consistent pattern of more insignificant and marginally significant, but fewer significant and highly significant cross-level interaction terms for models with the correct random effects specification (i.e., models that include a random slope). Put differently, cross-level interactions that are erroneously estimated without a random slope on the lower-level component more often reach (   of t-ratios especially for the incorrectly specified models and is suggestive of phacking. Online Supplement Appendix E further investigates this issue and finds some aggregate-level evidence for p-hacking among studies that did not specify random slopes for their cross-level interactions, but not among those that correctly included a random slope.
What matters here more immediately is another implication of the clustering of t-ratios just above 1.96: in light of the above evidence, it seems almost certain that the solid line for cross-level interactions tested without a random slope needs to shift substantially to the left. That is, the true t-ratios for the cross-level interactions that were estimated using such models will often be much smaller.
If we take the illustrative empirical analyses at face value, the correct t-ratios will be at least 31% smaller for three quarters of these estimates (cf. the percentiles of the relative reductions in t-ratios reported above). This suggests that many of the cross-level interaction effects based on misspecified models are not actually statistically significant at conventional levels. Thus, they should probably not have made it into the ESR or at least should have been interpreted very cautiously.
This conclusion is further reinforced if we take into account that critical values based on the normal distribution (i.e., t = 1.96 and t = 2.58) are questionable when cluster-level samples are small. Elff et al. (2016) elaborate that critical values for cross-level interaction terms should instead be derived from a t-distribution with the appropriate degrees of freedom typically being smaller than the number of clusters. Given that many of the surveyed studies work with cluster-level sample sizes in the 10s or 20s, this recommendation would often result in substantially larger critical values. As this problem also applies to the cross-level interaction terms that were estimated including a random slope, one has to wonder how much robust evidence of cross-level interactions European sociology has generated at all.

Random Slopes and 'Pure' Lower-level Effects
The results so far compellingly demonstrate that inclusion of a random slope term on the lower-level component is crucial for achieving correct statistical inference about the cross-level interaction term and the main effect of the lowerlevel variable. A natural follow-up question is whether the random slope term is also important for inference on the coefficients of lower-level variables that are not involved in a cross-level interaction, that is, for 'pure' lower-level effects. We showed above that omitting a random slope that is actually present in the DGP introduces heteroskedasticity (Equation 5) and autocorrelation (Equation 6) into the overall error term v ij ; and importantly, this fact does not hinge on the presence of a cross-level interaction term in the DGP.
Further Monte Carlo simulations indeed show that the inclusion of random slope terms is also essential for inference about pure lower-level effects. The basic DGP and the experimental conditions considered in these further analyses are identical to those presented in the 'Simulation Evidence' section above. There is only one crucial difference, namely that β (x) j , the coefficient on the lower-level variable x ij , no longer depends on the cluster-level predictor z j (in other words, the DGP no longer includes a cross-level interaction): (8) Table 5 shows results for the same experimental conditions as Table 2. It yields virtually identical conclusions. When the coefficient of a lower-level variable varies across clusters, statistical inference for the coefficient will be anticonservative unless that variation is captured by a random slope term. As in the cross-level interaction case the problem becomes worse as the extent of crosscluster variation in the lower-level effect increases (i.e., the higher SD(u Moreover, because the source of the problem is heteroskedasticity that correlates with x ij , more variation in x ij amplifies the inaccuracy of statistical inference with respect to γ (x) j . Online Supplement Table F1 further reaffirms that the average cluster size exacerbates the problem, just as in the cross-level interaction case (see Table 3 above). Across all experimental conditions, the extent of statistical overconfidence, as measured by the undercoverage of two-sided 95% confidence intervals, is generally very similar to the corresponding results for the cross-level interaction case.
Despite these results we maintain that the cross-level interaction case is more problematic and deserves special attention for at least two reasons. First, practitioners who analyze multilevel data with mixed effects models are primarily interested in context effects. Second, lower-level effects tend to be so precisely estimated that inaccurate inference is less likely to lead to qualitatively different conclusions. We now elaborate on both of these issues.
Our reading of applied research using mixed effects multilevel models is that practitioners predominantly use these models to test hypotheses about context effects. Typically, lower-level variables are mainly included to adjust for composi-  For each article, we coded whether a) the title, b) the abstract, and (if existent) c) explicitly formulated hypotheses stress 1) individual-level relationships, 2) contextual relationships (direct context effects and/or cross-level interactions), or 3) both. A similar pattern emerges if we consider the abstracts of the articles. In some sense, these figures may even overstate the salience of pure lower-level effects in the surveyed studies. Our impression from coding the articles is that hypotheses about lower-level relationships are often the ones that are least novel and that authors take the least interest in. This is also why, as we turn to titles, where authors are forced to stress the cardinal contribution of their paper, the mixed category shrinks to ca. 15%-mostly because articles tend to highlight only context effects in their title. Two thirds of all articles fall into this category.
A second reason why omitting the random slope tends to be much less consequential for the pure lower-level effect case is the much higher overall precision (expressed for instance in higher absolute t-ratios) with which such effects tend to be estimated. Identification of a pure lower-level effect is about estimating the average strength of a lower-level relationship across a set of clusters. Identification of cross-level interactions is about explaining cross-cluster variation in the strength of a relationship. Much more data will usually be needed to gain the statistical power for drawing firm conclusions concerning the second type of effect (Gelman and Hill, 2007, Ch. 20). In consequence, considering random slopes or not will rarely make a difference with respect to conventional levels of significance in the case of pure lower-level effects.
To illustrate this point, we collected the absolute t-ratios for the 60 cross-level interaction estimates reported in Online Supplement Tables D1 to D6 (5 dependent variables × 6 lower-level predictors × 2 specifications, the one including and the one omitting the random slope). In addition, we estimated the same models without the cross-level interaction terms and collected the absolute t-ratios of the 60 pure lower-level effects (i.e., the absolute t-ratios pertaining to the uninteracted coefficients of high education, intermediate education, gender, unemployment, age, and marital status). Figure 5 shows these absolute t-ratios, ranked by their size and differentiated by whether the model entailed a random slope on the respective lower-level predictor or not. The t-ratios for the cross-level interaction terms, displayed in the top graph, are mostly smaller than 5 and if a random slope was specified, the vast majority is smaller than the critical value 2.056 (df ≈ 26, see Elff et al. 2016).
Because of these generally small t-ratios, the inclusion of the random slope term Figure 5: Distribution of absolute t-ratios q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Cross−level interaction Note: The 60 absolute t-ratios for cross-level interactions are estimated by linear mixed effects models which are displayed in Tables D1 to D5. The 60 absolute t-ratios for lower-level effects are estimated by identical linear mixed effects models which simply omit the cross-level interaction terms. Note that 2 of the 60 t-ratios for pure lower-level effects are omitted from the bottom panel.
The reason is that these two cases are extreme outliers with absolute t-ratios of approximately 142 for the model omitting and 66 for the model including the random slope. The dashed horizontal line demarcates 2.056, the threshold for statistical significance at the five percent level (two-tailed test). The threshold is based on a t-distribution with 26 (=28-2) degrees of freedom, as suggested by Elff et al. (2016).
would often lead to qualitatively different conclusions concerning the strength of evidence against the null hypothesis.
The picture looks very different for the absolute t-ratios of the pure lowerlevel effects, displayed in the bottom graph. Including the random slope reduces the distribution of t-ratios substantially. However, the t-ratios remain very high and far above conventional thresholds for statistical significance in the vast majority of cases. Of the 26 lower-level effects that are significant at the five percent level according to a model that omits the respective random slope, 24 remained significant after its inclusion. In the cross-level interaction case, by contrast, we observe a change from statistical significance to insignificance in 7 out of initially 15 cases (see also Figure 2 above). Thus, even though statistical inference for lower-level effects will be overconfident when the corresponding random slope is not included, chances are high that any given effect would remain (highly) significant in the correctly specified model. This is the decisive difference to the cross-level interaction case where switching to the correct specification will often wash away any robust evidence against the null hypothesis.

Conclusions
Our study was motivated by the observation that published research using mixed effects multilevel models is strikingly inconsistent when it comes to the inclusion of random slopes on the lower-level components of cross-level interactions. Several leading textbooks on multilevel modeling fail to give a clear recommendation on this issue as well.
We have argued, and demonstrated with Monte Carlo simulations, that crosslevel interactions generally require the inclusion of the associated random slope.
Omission of the random slope term results in unmodeled cluster-driven heteroskedasticity and autocorrelation, thus violating fundamental model assump-tions and assuming too much independence among observations. The most important consequence is that statistical inference for the cross-level interaction term and the main effect of its lower-level component becomes overly optimistic: t-ratios will be too high, confidence intervals too narrow, and standard errors as well as p-values too low, leading to overrejection of the null hypothesis of no effect. The problem becomes more severe a) as unmodeled variation in the cluster-specific slopes increases, b) as the variance of the lower-level variable involved in the interaction increases, and c) as the the cluster size grows (i.e., the more lower-level observations there are per cluster). Mixed effects models that include a random slope term on the lower-level component of cross-level interaction terms generally performed very well in our simulations. Only in a few situations did we find them to produce over-conservative inference, but these issues arose only under conditions of little practical relevance.
A total of 30 illustrative applications based on European Social Survey data indicate that the consequences of omitting the random slope can be dramatic in real-life settings. In three quarters of cases, the absolute t-statistic on the crosslevel interaction term was at least 31% lower for the model including the random slope than for the model omitting it. These results are highly discomforting since our review of ESR articles indicates that many published cross-level interactions estimated without the associated random slope are barely statistically significant.
It is quite likely that most of these estimates could not be considered as robust evidence for the relationship in question if they were estimated using the correct specification.
Looking backward, our results thus cast doubt on many findings that are potentially considered well-established. We encourage researchers to take our results into account when reviewing previous studies. Results on cross-level interactions that were estimated without the crucial random slope term should be interpreted with caution and considered as preliminary. Their validity should be checked through replication and the results of replication attempts should be publicly reported to promote a balanced assessment of the empirical evidence for a given cross-level relationship.
Looking forward, our findings suggest that researchers who investigate crosslevel interactions using mixed effects multilevel models should always include a random slope for the lower-level component of the interaction. Editors and referees should insist that authors adhere to this rule.
That being said, our results highlight another, broader challenge faced by those who want to analyze multilevel data with mixed effects models. We found that random slopes are similarly required for accurate inference about 'pure' lower-level effects, provided-of course-that the effect truly varies across clusters (see also Barr et al., 2013;Bell et al., 2016). We believe this issue to be less troubling than the cross-level interaction case because researchers using multilevel modeling are rarely interested in pure lower-level effects and because many of these effects would remain highly statistically significant even if the associated absolute t-statistic declined by 50% or more. Nevertheless, the idea that statistical inference on lower-level predictors will typically be anti-conservative is unattractive, even if they are usually only considered as control variables.
How, then, can this issue be resolved? Simply specifying random slopes on all lower-level predictors will rarely be a solution. Such models would typically suffer from overspecification, an issue discussed in great detail by Bates et al. (2015) and Heisig et al. (2017). The strategy of specifying additional random slopes in the interest of accurate inference would at some point become self-defeating, leading to the very problem it seeks to solve: anti-conservative inference (Heisig et al., 2017). One viable, albeit not fully satisfactory, solution will be to focus on achieveing correct inference for the coefficients of interest and take inference for other predictors with a large grain of salt. One might also consider fitting the same fixed effects specification (i.e., the same model in terms of the set of predic-tors included) with several random effects specifications, including the random slope terms one at a time (i.e., first for x 1 , then for x 2 , and so forth) to get a sense of the correct standard errors for the different lower-level predictors. A fully convincing solution, however, will probably require methods such as bootstrapping, profile likelihood methods, or generalizations of 'sandwich-type' methods for cluster-robust inference (as implemented, for example, in the vce(cluster clustvar ) option for Stata's mixed command). 12 There is good justification for all of the latter methods, but further research is needed to learn about their peformance in typical sociological applications that often involve a limited number of large clusters and many lower-level controls.
Notes 1 The careful reader might notice that Equations 5 and 6 in Snijders and Bosker (2012) refer to the conditional variance of the outcome Y ij rather than the overall error v ij . However, this is fully consistent with the formulation given here because conditional on x ij variation in Y ij can only come from the random part of the model, that is, from v ij .
2 We do not study the performance of cluster-robust methods in this paper because mixed effects models are by far the most widely used method for investigating context effects in sociology (Heisig et al., 2017) and because the cluster-robust approach has its own set of pitfalls, especially when the number of clusters is small or when the data are characterized by multiple (non-hierarchical) levels of clustering (for further discussion, see Cameron and Miller, 2015).
3 It also suggests that the size of τ depends on how 'unbalanced' the clusters sizes are, that is, on how the number of lower-level observations differs across clusters. We do not pursue this point further in this paper.
4 This is not to say that point estimates will never differ according to whether a random slope is included or not. This is easiest to see in the case of a 'pure' lower-level effect, that is, of a coefficient of a lower-level variable that is not interacted with an upper-level predictor. In the model that includes the random slope, the coefficient estimate on the lower-level predictor is an estimate of the unweighted average of the cluster-specific slopes. This follows from the fact that the random slope is assumed to be normally distributed with a mean of zero. In the model that does not include a random slope, the coefficient will be a weighted average of the clusterspecific coefficients. Therefore the difference will be particularly large when the magnitude of the cluster-specific coefficients is strongly related to cluster size. It is not clear whether one would necessarily want to describe one of these estimates as 'biased', however, as the two approaches really estimate different quantities. To see that similar issues arise in the estimation of cross-level interactions, one simply has to note that the coefficient on the cross-level interaction term can be conceptualized as the effect of the cluster-level variable on the conditional average slope of the lower-level variable. Equation 3 makes this very clear. 5 We have conducted additional Monte Carlo simulation results which support this claim.
These results are available upon request. 6 The simulation results indeed show that both types of models produce unbiased coefficient estimates. These results can be obtained from the replication files which are part of the online supporting material. As discussed in footnote 4, there may be cases when a model with and a model without a random slope produce systematically different estimates, but the reason here would be that the former estimates an unweighted whereas the latter estimates a weighted average effect.
the null hypothesis of no effect although it is true, would be .05 the true probability would be .10.
8 Replication code for the analyses in Heisig et al. (2017) is available at http://journals. sagepub.com/doi/suppl/10.1177/0003122417717901. Together with the replication code for the present article, it can be used to replicate all analyses reported in this section.
9 When relying on the p-value, we assumed a normally distributed test statistics, consistent with the approach taken by the majority of authors. Elff et al. (2016) show this assumption to be problematic when the number of clusters is small, but we nevertheless use it here to treat the different studies consistently.
10 For a thorough review and critical discussion of reporting practices and significance testing in the ESR, see Bernardi et al. (2017) 11 We focus on statistical significance because of the important role that it continues to play in the publication process and in the evaluation of empirical evidence. We do not mean to imply that statistical significance is the best and/or should be the only criterion used to assess statistical uncertainty. Our conclusions would clearly be similar for alternative measures of uncertainty such as standard errors or confidence intervals.
12 Another option might be to avoid mixed effects models altogether and use conventional regression techniques with cluster-robust variance estimation or two-step approaches (Heisig et al., 2017). However, conventional corrections for clustering are known to perform poorly with the small cluster-level samples that sociologists often deal with. More recent methods developed specifically for the few-clusters case show better performance-often involving a form of bootstrapping too-but this still is an area of active research (for details, see Cameron and Miller, 2015;Esarey and Menger, 2018).

Appendix A Standard Error Bias as an Alternative Outcome
In the main article, we focus on the actual coverage rates of two-sided 95% confidence intervals in assessing the inferential accuracy of the different estimators.
An alternative approach, taken, for example, by Schmidt-Catran and Fairbrother However, reporting coverage has a considerable advantage. It is well known that the standard error is a downward biased estimator of the sampling distribution standard deviation when samples are small. Consider, for instance, the standard error of the mean: σ(x) = SD(x) √ n . This estimator of the standard error relies on the sample standard deviation (SD(x)). Unfortunately, the latter is known to be (downward) biased estimator of the population standard deviation in small samples, even if it is based on an unbiased estimator of the population variance, as provided by the usual estimator Σ(x i −x) 2 / (N −1) (Gurland and Tripathi, 1971).
The well-established solution to this problem, going back to the work of William Gossett (1908) is to use a t-distribution with appropriate degrees of freedom for statistical inference.
Similar issues arise in the context of multilevel mixed effects regression. In particular, Elff et al. (2016) show that a t-distribution with appropriate degrees of freedom leads to accurate statistical inference for contextual (cluster-level) variables in multilevel models with few clusters. The focus on actual coverage rates in the main article allows us to implement this correction. If we focused on standard error bias, we would not be able to do this. Specifically, we would find apparent optimism of the standard errors and might misleadingly conclude that infer-ence is anti-conservative when accurate inference is actually perfectly possibleprovided that the appropriate t-distribution is used. This concern is obviously most serious for experimental conditions with few clusters.
A comparison of Tables A1 and A2 with the corresponding tables in the main article (Tables 2 and Table 3) illustrates this point. Tables A1 and A2 report relative standard error bias, that is, the difference between the average standard error estimate SE(γ) and the actual standard deviation of the coefficient estimates SD(γ) across the R Monte Carlo replications, expressed in % of SD(γ), or formally: Results concerning the relative performance of the two models do not differ from the main article: standard error estimates for the cross-level interaction and the main effect of the lower-level variable generally show stronger negative bias for the model excluding the random slope than for the model including the random slope associated with the cross-level interaction. However, even the standard errors for the latter model appear to suffer from substantial negative bias, especially in the experimental conditions with only five clusters in Table A2. This contrasts very markedly with the corresponding results in the main article where we find confidence interval coverage to be largely accurate (and even slightly over-conservative in some of the more extreme experimental conditions; see Appendix B above). As discussed above, the reason for these difference is that the use of the t-distribution corrects for the substantial downward bias of the standard errors in small (cluster-level) samples.   With respect to significance testing, this means that a true null hypothesis will be rejected less frequently than the nominal level of the test suggests.
Do these results warrant a qualification of the recommendation to always include a random slope on the lower-level component of a cross-level interaction?
We would argue that the answer is almost always no because overcoverage only arises under extreme conditions that have little practical relevance. This reassuring result notwithstanding, this appendix presents additional analyses that reduce the variability of the random slope even further, pushing R 2 (β cluster. At least, that is, to the extent that such correlation is due to unobserved cluster-specific differences in the relationship between y ij and x ij ; there may still be cluster-correlated errors due to a random intercept term or to random slope terms on other lower-level variables. When clustering becomes negligible in this way, the m − l − 1 rule for approximating the degrees of freedom for confidence intervals and t-tests may no longer work well because it is based on the idea that m − l − 1 would be the correct degrees of freedom in the implicit cluster-/aggregate-level regression (Elff et al., 2016). Therefore, we also consider an alternative, computationally more intensive approximation, a generalization of the Satterthwaite (1946) method that was first proposed by Giesbrecht and Burns (1985; for an overview of degree of freedom approximations in the mixed effects context, see Schaalje et al. 2002). Elff et al. (2016) find the Satterthwaite method to perform very similarly to the m − l − 1 rule, but they do not consider the kinds of extreme situations where the above analysis shows the latter approach to produce overconservative inference. 13 Figure B1 plots the actual coverage rates of confidence intervals for the cross- and arguably much less than one could expect to encounter in most social science applications. Figure B1 confirms the one qualification of our recommendation to always include a random slope on the lower-level components of cross-level interaction terms: in cases where the variance of the random slope term is extremely small, following this recommendation can result in overconservative inference, especially if the number of clusters is also very low. The problems seems to at least partly stem from the inaccuracy of the m − l − 1 approximation to the degrees of freedom, as confidence intervals based on the Satterthwaite method perform much better under extreme conditions. However, even the Satterthwaite method fails in the most extreme scenarios.
One might alternatively suspect that convergence problems are responsible for the overcoverage because estimation of a near-zero variance component can create problems for the optimization process. Yet, disaggregated analysis of Monte Carlo trials with and without convergence warnings provides no support for this explanation. These results can be obtained from the replication files which are part of the online supporting material.
While the findings of this section warrant a note of caution, we would like to emphasize again that both methods yield accurate statistical inference under practically relevant conditions (15 or more clusters and R 2 (β (x) j ) ≤ 0.9), whereas the results presented in the main article show models that omit the crucial random slope term to produce overly optimistic results in such situations.

Appendix C Model Selection Criteria are no Remedy
The simulation results in the main article suggest that practitioners who analyze cross-level interactions using mixed effects models are well-advised to always include a random slope on the lower-level component. However, instead of opting for a random slope on a priori grounds, one might take a more data-driven approach and rely on standard model selection criteria such as likelihood ratio tests or information measures such as AIC and BIC in choosing a random effects specification. For example, as noted in the introduction, Raudenbush and Bryk (2002, p.28) suggest that it might be appropriate to omit the random slope if its variance is 'very close to zero'. For want of an exact definition of 'very close', established model selection criteria are obvious candidates when it comes to determining whether a given slope is small enough to warrant omission.
Perhaps unsurprisingly, we would not recommend to rely on model selection criteria in determining whether to include the random slope associated with a cross-level interaction. For reasons given in Section 'Why Always a Random Slope?' in the main article, we would argue these random slopes should always be included in all practically relevant situations. To support this claim, this appendix summarizes additional Monte Carlo evidence demonstrating that datadriven approaches based on model selection criteria will lead to substantial undercoverage in at least some situations. We investigate this issue as follows: for each simulated data set, we determine whether a given model selection criterion favors the model with or the model without the random slope on the lower-level component. To assess the performance of a given selection criterion, we then calculate the actual coverage rate of the models thus selected.
We consider four model selection criteria. The first two are variants of a likelihood ratio/deviance test (LRT). Both are based on the difference in the deviance statistic (i.e., -2 times the log likelihood) between the random intercept model and the model that includes the random slope term as well as its covariance with the random intercept. The first variant compares the difference in the deviance against a Chi-Square distribution with two degrees of freedom (one for the slope variance and one for the covariance). The null hypothesis of the test is that the variance and covariance parameter are jointly zero, so we choose the model including the random slope when the test result is significant (p < .05) and the random intercept model otherwise. It is well-known that this test is overconservative (i.e., underrejects the null hypothesis) because the variance parameter cannot be smaller than zero. The second variant therefore uses the average of the p-values obtained from Chi-square distributions with one and two degrees of freedom (see, for example, Snijders and Bosker, 2012, 98f.). In addition to the two variants of the LRT, we consider Akaike's Information Criterion (AIC) and the Bayesian Information Criterion (BIC) as alternative selection criteria. We used R's anova function to calculate the deviance statistics and information criteria, which uses the likelihood from maximum likelihood rather than restricted maximum likelihood estimation. Confidence intervals for the calculation of coverage rates are based on restricted maximum likelihood estimates, however. Figure C1 plots the actual coverage rates of the confidence intervals for the cross-level interaction term and the main effect of the lower-level component because only these are affected by omitting the random slope term (see Table 2 in the main article). We focus on a subset of the experimental conditions. In particular, we show results for 15 clusters and a standard deviation of 1 on the lower-level predictor. Results for the other experimental conditions do not lead to qualitatively different conclusions and can be obtained from the replication files which are part of the online supporting material.
The overall message emerging from Figure C1 is clear: when the goal is to achieve correct statistical inference for a cross-level interaction effect, it is not advisable to rely on model selection criteria in deciding whether to include a random slope on the lower-level predictor. For all four selection criteria, we find settings where reliance on the criterion results in noteworthy levels of undercoverage. This is not surprising, as we saw above that models that include the random slope term on the lower-level component generally lead to accurate inference, whereas models that omit the term suffer from undercoverage-with the extent of undercoverage depending on various aspects of the DGP. The model selection criteria investigated here will sometimes favor the model including the random slope, and sometimes the one omitting it. The coverage rate for a given model selection strategy will thus be a weighted average of the coverage rates for the correct model (i.e., the one with a random intercept and slope) and for the misspecified model (i.e., the one with only a random intercept). Thus, taking a data-driven approach to model selection will generally be better than selecting the model without a random slope a priori, but only because it sometimes favors the model including the random slope.
Detailed inspection of Figure C1 reveals some interesting patterns. The first is that model selection based on BIC performs worst and model selection based on AIC best, with the two variants of the LRT falling in between. LRTs using a mixture of Chi-Square distributions with one and two degrees of freedom have a slight edge over the alternative because they more often reject the random intercept model. The reason why BIC performs more poorly than the other criteria is that it penalizes additional parameters more harshly, particularly in large samples, so it more often favors the model omitting the random slope, which is more parsimonious (BIC uses a penalty of log(n), whereas AIC uses a constant penalty of 2; Burnham and Anderson, 2004). Another noteworthy pattern is that the performance of all four model selection strategies improves as the (implicit) of the cluster-level regression for the slope of x ij declines or, equivalently, as the standard deviation of the random slope (i.e., SD(u (x) j )) in the DGP increases. Intu-itively, this is because all model selection strategies become more likely to favor the model that includes the random slope, the more variation the latter shows.
Finally, the performance of the different model selection strategies depends on the number of observations per cluster. Model selection based on AIC and the two variants of the LRT tends to improve as the number of observations per cluster increases (except when the random slope shows very little variation with an implied cluster-level R 2 (β (x) j ) of 0.95). This is because both the LRT and AIC more often favor the model that includes the random slope in larger samples. The reason why BIC performs differently from AIC again is that it penalizes additional parameters using a factor that depends on the sample size.
Overall, the impact of the cluster-level R 2 (β (x) j ) and the lower-level sample size on the performance of the different model selection strategies should be taken as illustrative. Their performance in applied settings will depend on various other (and partly unobservable) aspects of a given analysis. The main message to take away from Figure C1 is that there are practically relevant situations where reliance on model selection criteria will lead to anticonservative inference for the cross-level interaction. These results make very clear that one should not blindly rely on model selection criteria in determining whether to include a random slope on the lower-level component of a cross-level interaction. Rather, as we emphasize in the main article, the default should be to specify a random slope term, so much so that we would practically recommend to always include it. There may be a very limited role for model selection criteria in situations characterized by negligible slope variaton (see Appendix B above), but the results presented in this section show that selection criteria must not be the only factor taken into account, as they can easily lead to severely anti-conservative inference (in particular, the substantive magnitude of cross-cluster variation in the slope should be considered as well). Moreover, as also emphasized in the main article, we believe that situations where variation is so low that omitting the random slope might be a reasonable choice are rare exceptions in practice, at least for typical sociological application. Our empirical examples (see Appendix D below) where we generally find substantive variation in the random slopes even after including the cross-level interactions with HDI support this view (see the final columns of Tables D1 to D6 below). However, while we strongly suspect that these findings generalize to most other applications, we do not hesitate to admit that this is ultimately an empirical question that we cannot answer within the confines of our study.

Appendix D Additional Illustrative Empirical Analyses
To get a sense of how serious the consequences of omitting the random slope term for a cross-level interaction are in real-world applications, we conduct a series of illustrative analyses based on European Social Survey data (ESS Round 6, 2016).

illustrative analyses of cross-level interactions.
Replication code for the analyses in Heisig et al. (2017) (Grotenhuis et al., 2016). The coefficient of the high education indicator, for instance, captures the (adjusted) difference in the respective outcome variable (e.g., fear of crime) between high-educated individuals and individuals whose level of education equals that of the average European. Its cross-level interaction with the HDI indicates whether this difference changes with a society's level of human development. Due to the presence of the cross-level interaction term the main effect of the high education indicator must be interpreted as the conditional effect of having high education for a country with an HDI of zero, that is, for a country with an average level of human development. In addition to the lower-level predictor of interest, the HDI, and their cross-level interaction, the models always contain the other lower-level predictors as control variables. These controls are not interacted with other (lower-or upper-level) predictors. Further details are described in Heisig et al. (2017).
Tables D1 to D6 present a summary of the main results, omitting coefficient estimates for control variables. Results for fear of crime at the top of Table D1 show that the cross-level interaction between the HDI and having high education is negative and statistically significant, irrespective of whether we include a random slope term or not. The same holds for the main effect of being highly educated. Thus, the high educated tend to be less afraid of crime than Europeans with average education and their advantage in (perceived) security is particularly strong in countries with a high degree of human development.
Qualitatively, this conclusion does not depend on the random effects specification, but the model that does not contain a random slope for high education strongly overstates the precision with which we can estimate the cross-level interaction and the main effect of high education. The third column shows that the estimated standard errors (in parentheses) for these coefficients are substan-tially larger in the correctly specified model that includes the random slope-by 67.4% for the main effect and by 82.3% for the cross-level interaction. Accordingly, the absolute t-ratios (in brackets) are much smaller when the model is correctly specified-by 40.8% for the main effect and by 46.8% for the cross-level interaction. Over the 30 different models (5 dependent variables × 6 lower-level predictors), the reduction in the absolute t-ratio for the cross-level interaction effect due to including the random slope is 42.4% on average. The median reduction is 48.3% and the 25th and 75th percentiles are 31.3 and 60.9%, respectively.   Note: Estimates are from linear mixed effects models. All estimates are controlled for: age, marital status, unemployment, intermediate, and high (compared to low) education. Standard errors in parentheses, absolute t-statistics in brackets. + p < 0.1; * p < 0.05; * * p < 0.01; * * * p < 0.001. The p-values for HDI and in the models including a random slope also the p-values for high education are based on the t-distribution with degrees of freedom approximated by the m − l − 1 rule (c.f., Elff et al., 2016). p-value for high education in the model without a random slope is based on the normal distribution. Standard errors in parentheses, absolute t-statistics in brackets. + p < 0.1; * p < 0.05; * * p < 0.01; * * * p < 0.001. The p-values for HDI and in the models including a random slope also the p-values for high education are based on the t-distribution with degrees of freedom approximated by the m − l − 1 rule (c.f., Elff et al., 2016). p-value for high education in the model without a random slope is based on the normal distribution.   Note: Estimates are from linear mixed effects models. All estimates are controlled for: age, marital status, unemployment, intermediate, and high (compared to low) education. Standard errors in parentheses, absolute t-statistics in brackets. + p < 0.1; * p < 0.05; * * p < 0.01; * * * p < 0.001. The p-values for HDI and in the models including a random slope also the p-values for high education are based on the t-distribution with degrees of freedom approximated by the m − l − 1 rule (c.f., Elff et al., 2016). p-value for high education in the model without a random slope is based on the normal distribution. more often than 'just-significant' results with a p-value of, say, .49. By contrast, if an effect does not really exist (i.e., if the null-hypothesis is correct), the p-curve will be uniform. A uniform p-curve hence indicates publication bias: the published significant studies lack evidential basis. The fact that there seems to be positive empirical support for an effect is due to the fact that insignificant results are rarely published.
The practice of p-hacking should have a different effect on the shape of the p-curve: authors who have successfully broken (hacked) the .05 threshold should not care much to further reduce the p-value (to, say, p < 0.01 or even p < 0.001).
Thus, p-hacking should introduce a clustering of p-values just below .5 and introduce left skew into the p-curve.
In summary, p-curves come in three principal shapes, each of which (more or less directly and convincingly) supports different conclusions concerning the evidential basis as well as the research and publication processes underlying a given collection of studies: 1. a right-skewed shape indicates evidential basis for a true effect; 2. a uniform shape indicates no evidential basis for a true effect and therefore also indicates (the potential for) publication bias; 3. a left-skewed shape is indicative of p-hacking and the lack of evidential basis for a true effect.
Empirical p-curves can combine these fundamental shapes. For example, a (left-skewed) p-curve with clustering of p-values below .5 and a near-uniform distribution otherwise would signal that both publication bias and p-hacking are at work. We return to this issue below. focus on studies that omitted them. The top panels show the curve for studies that allowed us to get a reasonably precise figure for the p-value, while the bottom panels also include findings for which we had to derive the p-value from an indicator, such as *. Fortunately, the shapes of the p-curves are rather robust to the in-or exclusion of studies that did not report exact inferential statistics. We will therefore focus on the top panels. The red dotted line indicates the (uniform) p-curve that we would expect to find if the results of the studies were pure arti- But in the present context such a narrow application of p-curve analysis runs into the problem that the p-curves could be both right and left skewed, that is, they could be u-shaped. This is for two reasons: first, as we do not review studies on a specific debate-but rather collections of studies that use the same modeling approach-there could be evidential basis among some and p-hacking among others, both at the same time. Second, and more importantly, a narrow interpretation of p-curve analysis has come under attack by Bruns and Ioannidis (2016), who argue that in observational studies omitted variable biases may create right skewed p-curves even in the absence of an underlying causal effect. We acknowledge that many of the ESR findings are not causal but associational. However, the results presented in the main article raise another serious concern. The righthand side p-curves in Figure E1 may be right skewed simply because the omitted random slopes result in deflated p-values.
Our solution to these two problems is to exploit the following two assumptions: First, we assume that there is no systematic difference in power between studies that include and studies that omit the random slope term. Power differences might arise if one type of study investigated systematically stronger effects or worked with systematically larger samples than the other, a possibility that seems rather implausible. Second, we assume that authors potentially try to phack cross-level interaction terms but not the main effects of the lower-level variables. Studies that investigate cross-level interactions virtually always put the primary focus on the cross-level interaction term. The main effect of the lowerlevel variable, by contrast, is usually not of substantive interest. It is a conditional effect that depends on the scaling of the upper-level predictor involved in the interaction. The 'success' of an investigation of a cross-level interaction therefore primarily depends on the significance of the cross-level interaction term. At the same time, p-values for the main effects of the lower-level variable are affected by the omission of the random slope term in exactly the same way as p-values for the cross-level interaction terms. These two assumptions allow us to investigate whether the p-curves of studies that omit the random slope term are significantly more right skewed (i.e., by focusing on the lower-level main effects which are not affected by p-hacking but are similarly affected by omitting the slope term), and whether there is evidence of p-hacking (i.e., by comparing the p-curves of cross-level interaction terms against those of lower-level main effects).
Looking back at Figure E1, we can see that nearly 100% of the lower-level main effects estimated from models omitting the random slope term reach the highest levels of significance (p < 0.01). By contrast, among studies that correctly estimate random intercept and slope models, it is only 70%. To test whether the two p-curves are indeed significantly different from another, we employ simple dichotomous test proposed by Simonsohn et al. (2014). We transform the p-curves to a binary variable (p < 0.025 vs p > 0.025) and use a χ 2 -test to investigate whether there are statistically significant more p < 0.025 among studies omitting the random slope term as compared to those that include it. In principle, we could also conduct this comparison for the cross-level interaction effects. However, this comparison would be complicated by the peak of p-values near .05 for the models omitting the random slope (which is evidence of p-hacking, as discussed below).
The χ 2 -test comparing the p-curves for the lower-level main effects shows that the curve for models without a random slope term is significantly more right-skewed (upper panels: p = 0.0036; lower panels: p = 0.0218). This either means that these studies are better powered; as noted above, this possibility that appears quite unrealistic. An alternative-and much more likely-explanation again is that omitting the random slope term significantly deflates p-values, thus misleadingly amplifying the right skew of the p-curve. This second interpretation bolsters our claim from the main article: 'potential publication bias against insignificant findings [...] hits correctly specified cross-level interactions more often because their standard errors are not deflated'.
A final look at Figure E1 reveals another interesting comparison. In the righthand panel (i.e., among studies that omitted the random slope term) the difference between the black solid and the green dashed p-curves (i.e., cross-level interaction terms and lower-level mains effects) shows a distinct left skew and thus indication of p-hacking. In the left-hand panel (i.e., among studies that include the random slope term), by contrast, the difference between the two p-curves seems negligible. We again use the dichotomous χ 2 -test to investigate, whether this pattern is indeed statistically significant. The results are telling and unaffected by the in-or exclusion of studies that did not report exact inference statistics. Among studies that correctly specified random intercept and slope models to investigate cross-level interactions there is no significant indication of p-hacking (top panel: p = 0.4028; lower panel: p = 1). By contrast, among studies of authors who specify their models incorrectly by omitting the random slope term, we also observe a statistically significant indication of p-hacking (top panel: p = 0.0054; lower panel: p < 0.0001). In other words: for models that omit the random slope term, there is statistically significant evidence for a higher proportion of just-significant p-values and a lower proportion of highly significant results in the cross-level interaction case than in the lower-level main effect case. We consider this as rather strong evidence for p-hacking because, as noted above, researchers usually have considerable incentive to hack the p-value for the cross-level interaction but not to hack the one for the main effect.