You are viewing the site in preview mode

Skip to main content

Cohort data with dropout: a simulation study comparing five longitudinal analysis methods

Abstract

Background

A simulation study was performed to visually demonstrate the problems with repeated measures ANOVA (RMA) and t-tests (TT) compared to linear mixed effects (LME), covariance pattern (CP) or generalized estimating equations (GEE) models in longitudinal cohort studies with dropout.

Methods

Data were generated for a realistic, observational study on health-related quality of life (HRQoL) in a small, heterogeneous sample of children undergoing anti-reflux surgery. Each generated sample comprised two groups: one with low levels (4–10%) of random dropout (missing completely at random, MCAR); the other with higher levels (10–40%), where the chance of dropout depended on lower baseline HRQoL (missing at random, MAR). Outcome data were simulated for four time points in a one-year period, assuming in both groups small but meaningful increases in HRQoL between baseline and 3 months, and thereafter constant levels to 12 months. Five analysis methods were applied to the simulated datasets: LME; CP; GEE; RMA; and independent TT at all time points (between groups) or paired TT on the difference between 12 and 0 months (within groups). The bias in estimated marginal means was examined, and the coverage and width of 95% confidence intervals for, and the power of, three within- and between-group contrasts were examined.

Results

In the group with MCAR, negligible bias was observed in all methods, coverage was close to 95%, and little difference was seen in power among methods. In the group with MAR dropout, independent and paired TT and RMA analyses displayed increasing bias and decreasing coverage and power for increasing levels of dropout. The paired TT also produced the widest confidence intervals on average, with the greatest variability. GEE displayed slightly lower coverage and higher power than LME and CP models, but bias and precision were further comparable to LME and CP. LME and CP models produced unbiased results and close to 95% coverage, even in the case of 40% MAR dropout.

Conclusions

As expected, LME and CP models performed best in terms of bias and coverage even in the case of higher levels of MAR data. Paired TT and RMA produce biased results and poor coverage and precision in the presence of MAR data.

Peer Review reports

Background

Repeated measures on an outcome are often collected in medical studies where the primary interest lies in change in the outcome over time. When longitudinal data on subjects are collected in a study, special analysis techniques are required to properly account for the correlation of repeated measures within individuals [1,2,3,4]. An additional statistical benefit to repeated measures designs, compared to an outcome measured at a single point in time, is (in most circumstances) a decrease in the standard error (SE) of the treatment estimate, thereby increasing the power to detect a treatment effect [1].

A common problem in longitudinal studies is missing outcome measures due to dropout, missed visits or questionnaires not returned. Missing data have traditionally been classified into three categories: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) [5, 6]. In data that are MCAR, the probability of missingness of the outcome is assumed to be the same for all individuals. The less restrictive MAR assumes that missingness depends on data that has been observed (for instance groups of patients and/or previous outcomes). When the missingness of an outcome is dependent on variables not observed, data are MNAR [5, 6]. MCAR is an often unrealistic assumption for missing data in longitudinal studies [6]. The data analysis must therefore also take more complex types of missing data into account.

Repeated measures ANOVA (RMA) is still a common way of analyzing repeated measures in some medical fields, despite several well-known problems. Because RMA uses only cases with complete outcomes at all time points, some data are discarded. This type of complete case analysis can introduce bias in estimated means and/or mean differences, since respondents with complete data are rarely representative for the original sample [3, 7, 8], i.e. the data are not MCAR. In addition to introducing bias, discarding data collected on humans is also scientifically unethical for several reasons, including (but not limited to) a potential loss of statistical power due to larger SEs. RMA further assumes equal variances of the outcome measurements over time, and equal correlations between outcomes at all time points, neither of which is a realistic assumption for longitudinal data [3].

The use of paired and unpaired t-tests (TT) is also still relatively common in the analysis of repeated outcomes. Paired TT are used to test change from baseline at multiple time points, and independent samples TT are sometimes used to compare groups across multiple time points, though this practice has long been discouraged [9, 10]. Because they make use of observed data at one time point (independent samples) or complete cases (paired samples), TT will also produce biased results in the presence of MAR data. TT are also expected to have less power than models that use all data from a subject.

More appropriate methods for longitudinal data are (generalized) linear mixed effects models (LME), covariance pattern (CP) models [1], also called “analysis of response profiles” [3], and generalized estimating equations (GEE) [1]. All three models can be used to test both changes from baseline and differences between groups, using all available data on outcome measures. This will increase the power for both types of questions, and provide unbiased estimates in the presence of data that is MCAR [7, 8]; LME and CP models also provide unbiased estimates in the case of MAR data [3, 7, 8].

Because the more technical literature on statistical theory is not generally read by non-statistical audiences, several articles/tutorials aimed at a wider audience have demonstrated the bias of RMA and GEE under MAR in real-life data [8, 11]. However, because the data generating mechanism is not known in such examples, these demonstrations cannot be considered proof of which method(s) actually produced unbiased estimates.

In order to visually demonstrate the bias in RMA and TT (paired and unpaired) to clinical researchers, we performed a simulation study in which we varied type (MCAR and MAR) and amount of monotonic missingness (in the form of dropout) for a continuous, longitudinally measured outcome. This was done for a hypothetical, but realistic, observational study on anti-reflux surgery (ARS) in children.

Methods

Aims

To demonstrate the bias of methods of analysis that use complete cases (RMA, paired TT) and methods that use complete data at one moment in time (independent TT between groups), compared to more appropriate longitudinal models such as LME, CPM or GEE, especially in smaller samples. In addition to bias, precision and power of the methods are examined.

Data-generating mechanisms

Based on results of several pediatric ARS studies [12,13,14,15,16,17], longitudinal health-related quality of life (HRQoL) data were simulated for 25 neurologically normal (NN) and 25 neurologically impaired (NI) children at 0, 3, 6 and 12 months, mimicking patterns that might occur after fundoplication.

Children in group 1 (NN) were assumed to have a mean HRQoL of 75 at time 0, increasing to 80 at 3 months and remaining stable to one year post-intervention. Children in group 2 (NI) were assumed to have a mean HRQoL of 50, increasing to 60 at 3 months and remaining stable to 12 months. In both groups, a multivariate normal distribution was used, with means as mentioned, a standard deviation of 15 points, and a month-to-month autocorrelation of 0.85 (or a correlation between baseline and one-year HRQoL of \(\:{0.85}^{12}=0.15\)). The MASS [18] package (v7.3.58.1) for R was used to simulate the datasets. In some simulations, HRQoL for individuals exceeded 100. Although this is not possible in practice (most QoL scores are restricted between 0 and 100), in order to demonstrate the bias due to missingness (and not due to ceiling effects) we did not truncate HRQoL scores.    

Monotone missing data (in the form of dropout) was generated for the outcome variable in both groups using the exponential distribution.

In group 1 (NN), dropout was assumed to be completely at random, and percentages varied from 4 to 10% within one year.

Rates of dropout for the NI group were assumed to be higher (10–40%), and dependent on the baseline HRQoL (MAR): children with a low score were expected to drop out more often. In the simulations, children with a HRQoL score lower than the median were assumed to be four times more likely to have missing data than children with a score above the median. Four scenarios were considered:

  1. 1.

    4% dropout in group 1 (MCAR) and 10% dropout in group 2 (MAR).

  2. 2.

    6% dropout in group 1 and 20% dropout in group 2.

  3. 3.

    8% dropout in group 1 and 30% dropout in group 2.

  4. 4.

    10% dropout in group 1 and 40% dropout in group 2.

Targets of analysis

The primary target is the estimated mean HRQoL at all time points, for NN and NI separately, on the basis of estimated marginal means (EMM) for LME, CP, GEE and RMA models, and observed means for TT.

The secondary targets are three contrasts that are often of interest in longitudinal studies and randomized, controlled trials: the change from baseline to one year in each group, and the difference between the groups at 12 months follow-up.

Longitudinal data analysis methods

Each generated dataset was analyzed in five ways:

  1. 1.

    TT, t-tests (paired t-tests: difference in mean HRQoL 12 months vs. 0 months for each group to compare the groups at the end of follow-up; unpaired: Welch’s t-test for the difference in mean HRQoL NN vs. NI at 12 months, to examine within-group change from baseline to end of follow-up), both with the t.test() function. Means for both were obtained from the observed data: complete cases at both time points for within-group contrasts, and complete cases at one time point for the between-group contrasts.

  2. 2.

    RMA, using the aov_car() function in the afex [19] package (v1.2.0).

  3. 3.

    GEE with a first-order autoregressive working correlation matrix, using the geese() function in the geepack [20] package (v1.3.9).

  4. 4.

    CPM: a generalized linear least squares model with correlated errors, using a continuous first-order autoregressive correlation structure and homogeneous variances over time, estimated with the gls() function in nlme [21] package (v3.1.160).

  5. 5.

    LME model with a random intercept and a random slope for time (modelled as continuous), using the lme() function in the nlme package.

More information on each model can be found in Supplementary Material 1. For the TT the usual estimates, SEs, and degrees of freedom (df) were used. The latter four models included an effect for group, an effect for time (modelled as categorical), and an interaction between the two. For each model, the resulting mean HRQoL for the two groups at the different time points (estimated marginal means) and their SEs and df were estimated using the emmeans [22] package (v1.8.3). For RMA and LME models, the default options in emmeans were used (denominator df for RMA and Kenward-Roger df for LME [22, 23]). For CPM, approximate Satterthwaite df were used, and for the GEE the robust variance-covariance option was used. Contrasts corresponding to the paired (changes within groups) and unpaired t-tests (difference between groups) were estimated for each of the models.

The simulations and analyses were performed in R 4.0.3 [24]. R code for the generation of the datasets and the analyses can be found in Supplementary Material 2.

Performance measures

Bias was estimated by comparing the EMMs to the true population means. For precision, the empirical SE (EMM and contrasts), % gain in relative efficiency (EMM and contrasts) for all methods compared to LME, and coverage of 95% CI’s for the contrasts were assessed. The power of the three contrasts was also examined. Means or percentages of these performance measures were plotted with 95% CI’s based on Monte Carlo SEs (MCSE) [25]. Widths of the 95% CI’s for the three contrasts were also examined. Finally, any convergence problems with the LME, CPM, GEE or RMA models were described. Analysis of the simulations was performed using the rsimsum [26] (v0.11.3) and ggplot2 [27] (v3.4.0) packages. The exception was power of the contrasts, which was calculated outside of rsimsum (at the time of writing the package used infinite df instead of model df for power).

Number of simulations

Since no estimate of the variance of the bias was available ahead of time, coverage was used to determine \(\:{n}_{sim}\), the number of simulated datasets. Using the formula for the MCSE of coverage [25], 7600 datasets were generated in order to achieve a coverage MCSE of 0.25%.

Results

Bias

Fig. 1 displays the bias in estimated means for the five methods in the two groups. In group 1 (MCAR) all five methods performed similarly in terms of bias: very close to 0 at all time points and for all scenarios (4 − 10% MCAR dropout). In group 2 (MAR), GEE, CPM and LME all performed similarly, with bias very close to 0 for all times and all scenarios. The independent TT displayed some bias at all times beyond baseline, with increasing bias (up to about one point) as dropout increased. The RMA analysis (based on complete cases) displays bias at all times (highest at time 0) and all scenarios, with means being estimated at 0.7–4.7 points above the true means in the population with 40% dropout.

Fig. 1
figure 1

Bias (with 95% Monte Carlo CI) in estimated marginal means in four scenarios. On the basis of 7600 simulations using 25 patients in each group. (a) Group 1, MCAR dropout scenarios 1: 4%, 2: 6%, 3: 8% and 4: 10%. (b) Group 2, MAR dropout scenarios 1: 10%, 2: 20%, 3: 30% and 4: 40%. RMA: repeated measures ANOVA; TT: independent t-tests; GEE: generalized estimating equations; CPM: covariance pattern model; LME: linear mixed effects model. Monte Carlo CI’s not visible because width within size of point estimate

Bias in the contrasts, which can also be deduced from the bias in the estimated means from Fig. 1, are presented in Fig. 2. There was no bias in the contrasts in any method for the difference between 0 and 12 months for the MCAR group, and bias of GEE, CPM and LME analyses was negligible for the within-MAR group contrast. Paired TT and RMA displayed the same amount of bias (0.7–4.7 points) in the MAR group. Some bias (less than one point) was seen between the groups at 12 months for RMA and independent TT for higher levels of dropout.

Fig. 2
figure 2

Bias (with 95% Monte Carlo CI) in the within- and between-group contrasts in four scenarios. On the basis of 7600 simulations using 25 patients in each group. Column 1: group 1 contrast 0–12 months; column 2: group 2 contrast 0–12 months; column 3: group 1-group 2 contrast at 12 months. Group 1 MCAR dropout scenarios 1: 4%, 2: 6%, 3: 8% and 4: 10%. Group 2, MAR dropout scenarios 1: 10%, 2: 20%, 3: 30% and 4: 40%. RMA: repeated measures ANOVA; TT: paired sample t-tests (within-group contrasts) or independent samples t-tests (between-group contrasts); GEE: generalized estimating equations; CPM: covariance pattern model; LME: linear mixed effects model. Monte Carlo CI’s not visible because width within size of point estimates

Precision

Empirical SE and relative precision

The empirical SE of the estimated means for RMA was slightly higher than other methods for the MCAR group (except at 12 months), and that pattern was more pronounced in the MAR group (Fig. 3). At 12 months, the empirical SE for LME was also slightly increased in the scenarios with 30–40% MCAR compared to other methods, and to lower levels of missingness. There was up to 10% loss in relative precision for RMA (compared to LME) in the MCAR group, and up to 40% loss in the MAR group (Supplementary Material 3).

Fig. 3
figure 3

Empirical standard errors (with 95% Monte Carlo CI) in estimated marginal means in four scenarios. On the basis of 7600 simulations using 25 patients in each group. (a) Group 1, MCAR dropout scenarios 1: 4%, 2: 6%, 3: 8% and 4: 10%. (b) Group 2, MAR dropout scenarios 1: 10%, 2: 20%, 3: 30% and 4: 40%. RMA: repeated measures ANOVA; TT: independent t-tests; GEE: generalized estimating equations; CPM: covariance pattern model; LME: linear mixed effects model. Monte Carlo CI’s not visible because width within size of point estimates

Figure 4 displays the empirical SE for the three contrasts of interest. The empirical SE was larger for TT & RMA for all contrasts, though this was most pronounced in MAR group. In the between-group contrasts the difference in empirical SE’s for the five methods was negligible. There was 5–16% loss in relative precision of RMA & TT compared to LME in the MAR group. GEE and CPM had a 2–3% gain in relative precision compared to LME for the within- MAR-group contrast and the between-group contrast (Supplementary Material 3 and 4).

Fig. 4
figure 4

Empirical standard error (with 95% Monte Carlo CI) for the within- and between-group contrasts. On the basis of 7600 simulations using 25 patients in each group. Column 1: group 1 contrast 0–12 months; column 2: group 2 contrast 0–12 months; column 3: group 1-group 2 contrast at 12 months. Group 1 MCAR dropout scenarios 1: 4%, 2: 6%, 3: 8% and 4: 10%. Group 2, MAR dropout scenarios 1: 10%, 2: 20%, 3: 30% and 4: 40%. RMA: repeated measures ANOVA; TT: paired sample t-tests (within-group contrasts) or independent samples t-tests (between-group contrasts); GEE: generalized estimating equations; CPM: covariance pattern model; LME: linear mixed effects model

Coverage

The coverage of the 95% CI’s for the within- and between-group contrasts is presented in Fig. 5. For all three contrasts and all scenarios, coverage was similar for CPM and LME, and very close to 95%. The coverage for GEE is consistently lower for all contrasts at all time points, though never lower than 92.3%. The coverage of the other methods was similar and very close to 95% for the 0–12 month increase in group 1 (MCAR). For the same contrast in group 2 (MAR), coverage became progressively worse for the paired TT (88.3–94.6%) and RMA (88.5–94.9%) as dropout increased. Coverage of the between-group contrast (difference in HRQoL between the two groups at 12 months) was at or just above 95% or more for all methods but GEE.

Fig. 5
figure 5

Coverage (with 95% Monte Carlo CI) of the 95% CI’s for the within- and between-group contrasts. On the basis of 7600 simulations using 25 patients in each group. Column 1: group 1 contrast 0–12 months; column 2: group 2 contrast 0–12 months; column 3: group 1-group 2 contrast at 12 months. Group 1 MCAR dropout scenarios 1: 4%, 2: 6%, 3: 8% and 4: 10%. Group 2, MAR dropout scenarios 1: 10%, 2: 20%, 3: 30% and 4: 40%. RMA: repeated measures ANOVA; TT: paired sample t-tests (within-group contrasts) or independent samples t-tests (between-group contrasts); GEE: generalized estimating equations; CPM: covariance pattern model; LME: linear mixed effects model

Width of 95% confidence intervals for contrasts

Widths of the 95% CI’s for three contrasts are presented in Fig. 6. In all scenarios and for all contrasts, the CPM’s gave, on average, the narrowest 95% CI’s, and the most consistent results (smallest range in CI widths). There was very little difference in the median or range of widths of CI’s for LME and RMA for the first and third contrasts; for the within-group MAR contrast, RMA has wider CI’s than LME, CPM or GEE. The paired TT gave larger CI widths and the widths displayed more variation (with outliers towards wider CI’s) than the others methods, especially in the group with 40% MAR dropout. GEE had relatively low median widths, but more variation in widths.

Fig. 6
figure 6

Width of 95% CI’s for the within- and between-group contrasts. For two of the four scenarios, on the basis of 7600 simulations using 25 patients in each group. Column 1: group 1 contrast 0–12 months; column 2: group 2 contrast 0–12 months; column 3: group 1-group 2 contrast at 12 months. Group 1 MCAR dropout scenarios 1: 4%, 2: 6%, 3: 8% and 4: 10%. Group 2, MAR dropout scenarios 1: 10%, 2: 20%, 3: 30% and 4: 40%. RMA: repeated measures ANOVA; TT: paired sample t-tests (within-group contrasts) or independent samples t-tests (between-group contrasts); GEE: generalized estimating equations; CPM: covariance pattern model; LME: linear mixed effects model

Power

The power of the contrasts is presented in Fig. 7. Power for the within-group contrasts was low for both groups and lower in group 1 than in group 2, consistent with the size of the increase in HRQoL for the two groups. No large differences in power were observed for group 1 (MCAR), although GEE had the highest power (approximately 0.27) across all scenarios. In group 2 (MAR) the paired TT and RMA had considerably lower power than the other three methods. Power decreased as dropout increased for all methods, though most markedly for TT and RMA. GEE again had the highest power in all scenarios, followed by CPM and LME. The power for the between-group contrast was very close to one for all methods in all scenarios, though power of RMA and independent TT decreased slightly for the higher levels of missingness.

Fig. 7
figure 7

Power (with 95% Monte Carlo CI) of the within- and between-group contrasts. On the basis of 7600 simulations using 25 patients in each group. Column 1: group 1 contrast 0–12 months; column 2: group 2 contrast 0–12 months; column 3: group 1-group 2 contrast at 12 months. Group 1 MCAR dropout scenarios 1: 4%, 2: 6%, 3: 8% and 4: 10%. Group 2, MAR dropout scenarios 1: 10%, 2: 20%, 3: 30% and 4: 40%. RMA: repeated measures ANOVA; TT: paired sample t-tests (within-group contrasts) or independent samples t-tests (between-group contrasts); GEE: generalized estimating equations; CPM: covariance pattern model; LME: linear mixed effects model. Monte Carlo CI’s not visible because width within size of point estimates

Convergence problems

The LME analyses produced warnings about convergence in 0.03%, 0.05%, 0.05% and 0.11%, respectively, of the simulations in dropout scenarios 1–4, though it was still possible to estimate the models and resulting EMMs and contrasts. No convergence problems were encountered with the other 4 analysis methods. However, in two simulations with higher missings (one each in scenarios 3 and 4), an error was encountered in the EMMs for CPM. In those simulations, error df had to be used instead of the approximate Satterthwaite df.

Discussion

Previous articles and books have presented theoretical arguments for the use of linear mixed effect models, CPM or GEE for analyzing longitudinal data [1, 3, 8]. A few studies have demonstrated the bias of TT and RMA on existing datasets [8, 11]. To our knowledge, this is the only study comparing more advanced methods (LME, CPM, GEE) to the traditional methods (TT, RMA) using simulated datasets.

While the independent TT for the between-group contrast did not induce much bias and had coverage close to 95% even with high levels of dropout, the precision of the paired TT for increase in HRQoL tended to be lower than for the same contrast from a LME, CPM or GEE model: TT had consistently wider CI’s than the CP and GEE analyses for the group with MAR dropout. To a lesser extent this was also the case in the group with lower levels of MCAR dropout, though in that group the highest level of dropout was 10%. Increasing dropout would likely have further decreased the precision of the paired TT in the MCAR group as well.

As expected, RMA and paired TT gave biased estimates and poorer coverage for the group with MAR. This is primarily due to the fact that these methods employ listwise deletion, removing every individual with one or more missing outcomes. RMA also depends on unrealistic assumptions about correlations of repeated measures over time [3, 7]. Surprisingly, then, the 95% CI’s for contrasts after RMA were relatively narrow, and were only slightly wider than the LME CI’s in the within-MCAR and between-group contrasts. This could be due to the lower levels of missingness in the group with MCAR.

While the bias in GEE was very similar to LME and CPM, the coverage for all contrasts and scenarios was slightly lower, and the power slightly higher, for GEE than for LME or CP models. This is likely due to the use of z-testing in the GEE contrasts, which in these small samples resulted in narrower 95% CI’s.

Results for the CP models were nearly identical to those of the LME models for both bias and coverage, with better precision for CP models. The major drawback to LME was the occasional problem with convergence. Also, some small problems were seen in precision of LME, likely because linear random effects were applied to non-linear trends. The use of linear random effects for time is common even when the overall (fixed) time trend is not completely linear. It is interesting to note that this method may produce less precise estimates than a CP model, especially at later time points.

Strengths and limitations

The advantage of using simulations is that the data-generating mechanism is known. It is therefore possible to calculate average bias, precision, and coverage for the different methods. The current study used the ADEMP framework [25] for design and reporting.

Several limitations of the current study must be addressed. Generated HRQoL values were not truncated at 0 and100, because the purpose of the study was to demonstrate the performance of the five methods on normally distributed outcomes. If HRQoL values had been subject to a ceiling of 100 (as they are in practice), all methods discussed here would have produced biased results for the group with high levels of HRQoL. In the case of ceiling or floor effects, other models are recommended [28, 29].

Another assumption made for reasons of simplicity was that the variance of HRQoL was the same for both groups at all time points. This may not be a realistic assumption for longitudinal data collected in disparate patient groups (e.g. NN vs. NI). LME, CPM and independent TT can easily be adjusted to account for differences in variances, and GEE with a robust estimator should account for heterogeneity. Future simulation studies could vary heterogeneity between groups and/or time points to compare models that are corrected for heterogeneity to those that are not.

To ground the simulation in a realistic setting, low levels of missing were used in the NN group. Since this was also the group in which we assumed MCAR, it was difficult to assess the effect of large levels of MCAR on precision. That the missingness was only in the outcome, and only depended on the baseline HRQoL was also a simplification for purposes of demonstration; in reality, missing data will depend on more factors. A study using more complicated missingness structures and comparing similar levels of MCAR and MAR data could help clarify this. Finally, the mechanism for dropout in the MAR group may not have been completely realistic; that children with lower baseline HRQoL would be four times more likely to drop out than children with higher HRQoL is likely extreme, and was chosen for the purposes of demonstration. Under less extreme situations, the bias of TT and RMA are expected to be lower. The interested reader can use the R code provided in the supplementary material to try other values for the probability of dropout and their effects on bias, precision, and power.

The samples used in the simulation were relatively small. This choice reflects the reality of pediatric surgery studies, but not of all medical studies. The simulations were repeated (see Supplementary Material 5) with a sample size of 200 (100 per group). While precision and power were improved in all analyses with larger samples, power and relative precision were still worse for RMA in both groups and the within-group contrasts, and for TT in the within group contrasts with non-random dropout. Bias was identical to that found in the small datasets, and coverage of the contrasts was similar or worse for RMA in the larger datasets.

Power for the between-group contrast was quite high and nearly indistinguishable across the methods. This was primarily due to the assumption of a 20-point difference between the groups (equivalent to a difference of one standard deviation). A smaller difference between groups, though less realistic for this particular situation, would have elucidated the differences in power across the methods.

Finally, no multiple imputation method was used. RMA, GEE and TT might all have performed better, certainly in terms of bias, if missing data were first imputed. A previous study found differences in estimates from GEE before and after multiple imputation [8], though that study analyzed an existing dataset for which the true data-generating mechanism was not known and could therefore not estimate bias of either method.

Conclusions

As expected, LME and CP models performed best in terms of bias and coverage even in the case of higher levels of MAR data. CPM slightly outperformed LME in this study, likely due to the non-linear trends over time. All methods gave fairly comparable results for low levels of MCAR data. Paired TT and RMA produce biased results and poor coverage and precision in the presence of MAR data. GEE produced unbiased results for both MCAR and MAR data, but slightly too narrow 95% CI, resulting in slightly poorer coverage and exaggerated power.

Data availability

All data generated or analyzed during this study is can be generated using the R code in Supplementary Material 2.

Abbreviations

ARS:

Anti-reflux surgery

CP(M):

Covariance Pattern (model)

df:

degrees of freedom

GEE:

Generalized Estimating Equations

LME:

Linear Mixed Effects

MAR:

Missing At Random

MCAR:

Missing Completely At Random

MNAR:

Missing Not At Random

MSE:

Mean Squared Error

NI:

Neurologically Impaired

NN:

Neurologically Normal

HRQoL:

Health-related Quality of Life

RMA:

Repeated Measures ANOVA

SE:

Standard Error

References

  1. Hedeker D, Gibbons RD. Longitudinal Data Analysis. Hoboken, NJ: Wiley-Interscience; 2006.

    Google Scholar 

  2. Laird NM, Ware JH. Random effects models for longitudinal data. Biometrics. 1982;38:963–74.

    Article  CAS  PubMed  Google Scholar 

  3. Fitzmaurice GM, Ravichandran C. A primer in longitudinal data analysis. Circulation. 2008;118(19):2005–10.

    Article  PubMed  Google Scholar 

  4. Albert PS. Tutorial in biostatistics: longitudinal data analysis (repeated measures) in clinical trials. Stats Med. 1999;18:1707–32.

    Article  CAS  Google Scholar 

  5. Little RJA, Rubin DB. Statistical Analysis with Missing Data, Third Edition. Third ed. Hoboken, New Jersey: John Wiley & Sons; 2020.

  6. van Buuren S. Flexible imputation of Missing Data. Second ed. New York: Chapman and Hall/CRC; 2018. p. 444.

    Book  Google Scholar 

  7. Gibbons RD, Hedeker D, DuToit S. Advances in analysis of longitudinal data. Annu Rev Clin Psychol. 2010;6:79–107.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Bell ML, Fairclough DL. Practical and statistical issues in missing data for longitudinal patient-reported outcomes. Stat Methods Med Res. 2014;23(5):440–59.

    Article  PubMed  Google Scholar 

  9. Matthews JNS, Altman DG, Campbell MJ, Royston P. Analysis of serial measurements in medical research. Br Med J. 1990;300:230–5.

    Article  CAS  Google Scholar 

  10. Omar RZ, Wright EM, Turner RM, Thompson SG. Analyzing repeated measurements data: a practical comparison of methods. Stat Med. 1999;18:1587–603.

    Article  CAS  PubMed  Google Scholar 

  11. Twisk JWR. Longitudinal data analysis. A comparison between generalized estimating equations and random coefficient analysis. Eur J Epi. 2004;19:769–76.

    Article  Google Scholar 

  12. Kawahara H, Okuyama H, Kubota A, Oue T, Tazuke Y, Yagi M, Okada A. Can laparoscopic antireflux surgery improve the quality of life in children with neurologic and neuromuscular handicaps? J Pediatr Surg. 2004;39(12):1761–4.

    Article  PubMed  Google Scholar 

  13. Srivastava R, Downey EC, Feola P, Samore M, Coburn L, Holubkov R, et al. Quality of life of children with neurological impairment who receive a fundoplication for gastroesophageal reflux disease. J Hosp Med. 2007;2(3):165–73.

    Article  PubMed  Google Scholar 

  14. Engelmann C, Gritsa S, Ure BM. Impact of laparoscopic anterior 270 degrees fundoplication on the quality of life and symptoms profile of neurodevelopmentally delayed versus neurologically unimpaired children and their parents. Surg Endosc. 2010;24(6):1287–95.

    Article  PubMed  Google Scholar 

  15. Kubiak R, Andrews J, Grant HW. Long-term outcome of laparoscopic nissen fundoplication compared with laparoscopic thal fundoplication in children: a prospective, randomized study. Ann Surg. 2011;253(1):44–9.

    Article  PubMed  Google Scholar 

  16. Knatten CK, Kvello M, Fyhn TJ, Edwin B, Schistad O, Aabakken L, et al. Nissen fundoplication in children with and without neurological impairment: a prospective cohort study. J Pediatr Surg. 2016;51(7):1115–21.

    Article  PubMed  Google Scholar 

  17. Stellato RK, Mulder FVM, Tytgat SHA, Oudman TS, vdZD C, van de Peppel– Mauritz FA, Lindeboom MYA. Two-year outcome after laparoscopic fundoplication in pediatric patients with gastroesophageal reflux disease. J Laparoendosc Adv Surg Tech A. 2020;in press.

  18. Venables WN, Ripley BD. Modern Applied statistics with S. Fourth edition. New York: Springer; 2002.

    Book  Google Scholar 

  19. Singmann H, Bolker B, Westfall J, Aust F, Ben-Shachar M. afex: Analysis of Factorial Experiments. R package version 1.2-0 ed2022.

  20. Højsgaard S, Halekoh U. The R Package geepack for generalized estimating equations. J Stat Softw. 2006;15(2):1–11.

    Google Scholar 

  21. Pinheiro J, Bates D, R Core Team. nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1–160 ed2022.

  22. Lenth R. emmeans: Estimated Marginal Means, aka Least-Squares Means. R package version 1.8.3 ed2022.

  23. Luke SG. Evaluating significance in linear mixed-effects models in R. Behav Res Methods. 2017;49(4):1494–502.

    Article  PubMed  Google Scholar 

  24. R Core Team. R: a Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2022.

    Google Scholar 

  25. Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11):2074–102.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Gasparini A. rsimsum: Summarise results from Monte Carlo simulation studies. J Open Source Softw. 2018(26).

  27. Wickham H. ggplot2: elegant graphics for data analysis. New York: Springer-; 2016.

    Book  Google Scholar 

  28. Twisk J, Rijmen F. Longitudinal tobit regression: a new approach to analyze outcome variables with floor or ceiling effects. J Clin Epidemiol. 2009;62(9):953–8.

    Article  PubMed  Google Scholar 

  29. Wang L, Zhang Z, McArdle JJ, Salthouse TA. Investigating Ceiling effects in Longitudinal Data Analysis. Multivar Behav Res. 2009;43(3):476–96.

    Article  Google Scholar 

Download references

Acknowledgements

We are grateful to Dr. Peter van de Ven for his assistance in brainstorming about the data generating mechanism.

Funding

No funding was provided for this study.

Author information

Authors and Affiliations

Authors

Contributions

RKS: Conceptualized and designed the study, performed the simulations and data analysis, interpreted the results, and drafted the manuscript. RvdB provided part of the simulation code and checked all code, aided with interpretation of the results, and substantially revised the manuscript. MS aided with interpretation of the results and substantially revised the manuscript. MYAL substantially revised the manuscript. MJCE contributed to the conception and design of the study and to interpretation and revision of the manuscript. All authors* read and reviewed the final manuscript.*MJCE died before approving the final manuscript, but contributed to several drafts.

Corresponding author

Correspondence to Rebecca K. Stellato.

Ethics declarations

Ethics approval and consent to participate

No human subjects or human tissue were used. All data was simulated.

Consent for publication

Not applicable, no human subjects or human tissue were used.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Stellato, R.K., van den Bor, R.M., Schipper, M. et al. Cohort data with dropout: a simulation study comparing five longitudinal analysis methods. BMC Med Res Methodol 25, 103 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12874-025-02506-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12874-025-02506-4

Keywords