- Research
- Open access
- Published:
Drivers of bias in diagnostic test accuracy estimates when using expert panels as a reference standard: a simulation study
BMC Medical Research Methodology volume 25, Article number: 106 (2025)
Abstract
Background
Expert panels are often used as a reference standard when no gold standard is available in diagnostic test accuracy research. It is often unclear what study and expert panel characteristics produce the best estimates of diagnostic test accuracy. We simulated a large range of scenarios to assess the impact of study and expert panel characteristics on index test diagnostic accuracy estimates.
Methods
Simulations were performed in which an expert panel was the reference standard to estimate the sensitivity and specificity of an index diagnostic test. Diagnostic accuracy was determined by combining probability estimates of target condition presence, provided by experts using four component reference tests, through a predefined threshold. Study and panel characteristics were varied in several scenarios: target condition prevalence, accuracy of component reference tests, expert panel size, study population size, and random or systematic differences between expert’s probability estimates. The total bias in each scenario was quantified using mean squared error.
Results
When estimating an index test with 80% sensitivity and 70% specificity, bias in estimates was hardly affected by the study population size or the number of experts. Prevalence had a large effect on bias, scenarios with a prevalence of 0.5 estimated sensitivity between 63.3% and 76.7% and specificity between 56.1% and 68.7%, whereas scenarios with a prevalence of 0.2 estimated sensitivity between 48.5% and 73.3% and specificity between 65.5% and 68.7%. Improved reference tests also reduced bias. Scenarios with four component tests of 80% sensitivity and specificity estimated index test sensitivity between 60.1% and 77.4% and specificity between 62.9% and 69.1%, whereas scenarios with four component tests of 70% sensitivity and specificity estimated index test sensitivity between 48.5% and 73.4% and specificity between 56.1% and 67.0%.
Conclusions
Bias in accuracy estimates when using an expert panel will increase if the component reference tests are less accurate. Prevalence, the true value of the index test accuracy, and random or systematic differences between experts can also impact the amount of bias, but the amount and even direction will vary between scenarios.
Introduction
Diagnostic accuracy studies are concerned with evaluating the performance of a diagnostic test. The test being evaluated is known as the index test. Typically, an index test result is compared to a reference standard [1, 2]. This reference standard determines whether a target condition (i.e., disease, condition, or health state of interest) is present or absent, and as such determines whether the index test result or condition classification is correctly true or false. For many target conditions the reference standard is not perfect, meaning it may misclassify an individual’s true target condition. When there is no single perfect reference standard available, it is common to use a combination of several imperfect tests to create a reference standard [3, 4]. This reference standard, which may involve a fixed classification algorithm [5], a statistical (latent class) model [6], or an expert panel/consensus strategy [7], is used to determine the ‘true’ presence of the target condition. However, the composite reference standard itself is inherently imperfect due to the combination of these individual imperfect tests [8, 9].
Expert panels are groups of content experts that decide on the presence or absence of the target condition in an individual. Typically, expert panels provide a single dichotomous classification for each individual, i.e., target condition present or absent. This is done through methods such as consensus meeting or majority voting [7, 10, 11]. Although dichotomous classification of the target condition by experts within a panel is currently still standard practice in research, it has been shown in simulation studies to potentially introduce bias in accuracy estimates (e.g., sensitivity and specificity) of the index test when there is substantial uncertainty regarding that panel classification, i.e., when a set of test results occurs in participants with the target condition and participants without the target condition at similar rates [12].
An alternative to dichotomous classification is to ask experts to provide probabilistic estimates on whether the target condition is present in each study participant. This prevents loss of information on uncertainty through dichotomization but introduces new challenges [13]. While previous studies have shown that dichotomous reference standard classification can cause bias [12], it is still unclear how characteristics of the study (e.g., prevalence) and expert panel (e.g., number of experts) affect the validity of diagnostic accuracy estimates of an index test. Additionally, little is known about how to optimally combine estimates from different experts in an expert panel.
In this simulation study we assessed how diagnostic accuracy estimates of an index test are influenced by various study and expert panel characteristics. Specifically, we focused on scenarios where an expert panel is used as the reference standard and experts provide probability estimates on presence of the target condition.
Methods
This simulation study was designed using the structured ADEMP approach for planning simulation studies [14].
Estimands
Our estimands are the ‘true’ sensitivity (Se) and specificity (Sp), which reflect the diagnostic accuracy of the index test for assessing the true target condition and the ‘observed’ sensitivity and specificity based on an expert panel as reference standard.
Data-generating mechanism
We performed a full factorial assessment of study and panel characteristics. The following input parameters were specified: number of experts in the panel, number of study participants, target condition classification threshold, prevalence of the target condition, random and systematic differences between experts, and the sensitivity and specificity of the component tests. In all scenarios, the true index test sensitivity and specificity was 0.8, representing a promising new test. There were a total of 1944 scenario simulations, each simulation was repeated 1000 times.
The parameters and their values are shown in Table 1. Parameters and their values were selected based on situations that are commonly encountered in diagnostic accuracy studies using expert panels as reference standard [15,16,17,18,19,20,21,22,23]. The sensitivity and specificity of simulated tests varied between 0.6 and 0.9, as sensitivity and specificity below 0.6 may be a reason not to use a test [24, 25] and index test sensitivity and specificity up to 0.9 are commonly reported [26,27,28]. Target conditions that already have tests with sensitivity or specificity above 0.9 are unlikely to need to use an expert panel as reference standard.
We used a model-based data-generating mechanism consisting of five steps. Table 1 provides a description of the relevant parameters in the data-generating process.
In step 1 we simulated presence of the target condition for \(\:{n}_{obs}\) participants by drawing \(\:{n}_{obs}\) observations from a binomial distribution with success probability equal to the prevalence of the target condition (formula 1). A success (1) represented a participant with the target condition, whereas a failure (0) represents a participant without the target condition.
In step 2 the presence or absence of the target condition was used to simulate component test results. We distinguish two types of tests, component tests and an index test. In all scenarios, we will simulate 4 component tests. Component tests are used in constructing the reference standard. The reference standard we use is an expert panel, so the component tests are the information that the expert panel uses to determine the probability of an individual having the target condition. These were calculated based on the simulated true target condition status and the true sensitivity and specificity of the test, which may differ between tests. Component test results of participants with the target condition were estimated using \(\:{n}_{D+}\) draws from a binomial distribution with a success probability equal to the sensitivity of the test (formula 2.1). Results of participants without the target condition were estimated using \(\:{n}_{D-}\) draws from a binomial distribution with a success probability equal to one minus the specificity of the test (formula 2.2). Component test results are simulated independently, meaning they are independent conditional on the target condition.
The index test is the test under review in a diagnostic accuracy study. The (binary) index was simulated in the same manner as the component tests, by drawing from a binomial distribution, in which parameter values depend on true target condition status.
In step 3 the component test results and prevalence of the target condition were used to calculate the probability estimates from the expert panel to use as the reference standard. The index test was never considered in the reference standard, only the component tests. Bayes’ Theorem (formula 3) was used to derive perfectly calibrated probabilities of target condition presence given a pattern of component test results. For several numerical example, see Table 2 in Jenniskens (2019) [12].
In step 4, we simulated random and systematic differences between experts on the expert panel. We simulated differences by drawing from a beta distribution with shape parameters alpha and beta equal to the expected number of positive index tests and the expected number of negative index tests for a particular pattern of component tests (formula 4). The beta distribution is bounded between 0 and 1, which is convenient for simulating probabilities.
We simulated random differences by drawing from a beta distribution for each participant, once per expert, based on the participants’ component test pattern. The result is a new estimate with a slight uncertainty, where estimates are different for each participant. This process results in a probability estimate from each expert on each participants probability to have the target condition.
Systematic variability or differences were introduced at the level of the individual expert by calculating the probability of each possible set of component test results and then drawing from the beta distribution once per expert per set of component test results. Experts then assigned probability estimates for participants based on their component test results. This resulted in each expert having a slightly different estimate than other experts for any participant. However, unlike in the case of random differences, here participants with the same set of component test results will receive the same estimate from the expert panel.
In some scenarios step 5 was added to introduce an overconfident expert (i.e., an expert whose probability estimates skew towards 0 when below 0.5 and skew towards 1 when above 0.5). This was simulated by randomly selecting an expert in the panel who had all their probability estimates adjusted by half the distance between the estimate and either 0 or 1 (formula 5). For example, an estimate of 0.6 would be adjusted to 0.8 and an estimate of 0.2 to 0.1 when an expert is overconfident.
Figure 1 is a graphical representation of the data-generating process. For each scenario, this process is repeated 1000 times (nsim = 1000).
Performance measures
We used the minimum, median, mean, and maximum of the probability estimates of target condition presence from the individual experts to come to a single panel level probabilistic estimate of target condition presence. We analyzed each simulated dataset by estimating the sensitivity and specificity of the index test for the target condition as determined by applying a prespecified cut-off to the panel level probabilistic estimate.
We calculated the mean squared error (MSE) of index test sensitivity and specificity estimates calculated against the reference standard, compared to the true sensitivity and specificity calculated against the underlying target condition. In all cases, lower values indicate estimates closer to the true values.
We present our results in nested loop plots designed for clarity in reporting simulation results [29].
Additionally, we extracted a scenario with MSE near the median to illustrate the absolute bias in sensitivity and specificity of an index test across a range of true values (at increments of 0.05).
Results
We will provide the overall impact of various scenarios on the MSE for expert panels with no random or systematic difference, only systematic differences, and both random and systematic differences. The effects of study and expert panel characteristics will be discussed. Additionally, a scenario with MSE near the median will be highlighted to illustrate the impact of using an expert panel on bias in diagnostic accuracy estimates of the index test under study.
MSE in accuracy estimates without random or systematic differences
Fig. 2 shows the MSE of sensitivity and specificity for 1000 simulations of all scenarios without any random or systematic differences between experts (i.e., identical probabilities for each participant with the same component test pattern). The different consensus mechanisms overlap. Because there was no variability between experts, there was no difference between the different mechanisms
MSE in sensitivity and specificity estimates without random or systematic differences. Nested loop plot showing results of 468 expert panel scenarios without random or systematic differences between experts. Solid lines show the MSE in sensitivity and dashed lines show the MSE in specificity. Different consensus mechanisms are shown in color. At the bottom of the graph the study and panel characteristics are pictured. The different levels of the study and panel characteristics describe the scenarios corresponding to the results pictured above. The true index test sensitivity and specificity was 0.8
MSE in accuracy estimates with systematic differences and without random differences
Fig. 3 shows the MSE in accuracy estimates in scenarios with systematic differences and without random differences between experts. The MSE was larger compared to scenarios without random or systematic differences, an increase in the highest MSE from 0.025 to 0.065
MSE in sensitivity and specificity estimates with systematic differences. Nested loop plot showing results of 468 expert panel scenarios with systematic differences between experts, but without random differences. Solid lines show the MSE in sensitivity and dashed lines show the MSE in specificity. Different consensus mechanisms are shown in color. At the bottom of the graph the study and panel characteristics are pictured. The different levels of the study and panel characteristics describe the scenarios corresponding to the results pictured above. The true index test sensitivity and specificity was 0.8
MSE in accuracy estimates with random and systematic differences
Fig. 4 shows the MSE in accuracy estimates in scenarios with random and systematic differences between experts. The MSE was larger compared to scenarios without random or systematic differences, an increase in the highest MSE from 0.025 to 0.065
MSE in sensitivity and specificity estimates with random and systematic differences. Nested loop plot showing results of 468 expert panel scenarios with random and systematic differences between experts. Solid lines show the MSE in sensitivity and dashed lines show the MSE in specificity. Different consensus mechanisms are shown in color. At the bottom of the graph the study and panel characteristics are pictured. The different levels of the study and panel characteristics describe the scenarios corresponding to the results pictured above. The true index test sensitivity and specificity was 0.8. The pink arrow indicates a scenario that is further elaborated on in Fig. 5
Fig. 5 shows the observed and true sensitivity and specificity in one of the biased scenarios, indicated with a pink arrow in Fig. 4. This scenario had 1000 study participants, 10 experts in the panel, 40% target condition prevalence, 20% classification threshold, and sensitivity and specificity of 70% for all four component tests. In this scenario, the MSE in sensitivity was 0.015 and the MSE in specificity is 0.004. These values are also shown in Fig. 3
Observed and true sensitivity and specificity of a scenario with 1000 participants, 10 experts, 40% prevalence, a 20% classification threshold, 70% sensitivity and specificity in the component tests. This scenario includes random and systematic differences between experts as well as an overconfident expert. Different consensus mechanisms are presented in color with 95% intervals around each estimate pictured in grey. This specific scenario is indicated in Fig. 4 by a pink arrow
The absolute differences between true and observed sensitivity are pictured. For example, given a true sensitivity of 0.75, the observed sensitivity is 0.59, hence an absolute bias of 0.16. The differences were smaller for specificity, for a true specificity of 0.75 the observed specificity is 0.70, hence an absolute bias of 0.05.
Sensitivity and specificity of the component tests
The sensitivity and specificity of the component tests had a relatively large effect on the MSE. Scenarios with a component test sensitivity and specificity of 80% showed markedly lower MSE than scenarios where it is 70%. When the component tests are ‘mirrored’ and included two tests with low sensitivity and high specificity and two tests with high sensitivity and low specificity, MSE was in between that of 70% and 80%.
Prevalence
MSE also changed with target condition prevalence. When prevalence increased, average MSE in sensitivity estimates decreased (from an MSE of 0.01 at 20% prevalence to an MSE of 0.004 at 50% prevalence) and average MSE in specificity estimates increased (from an MSE of 0.002 at 20% prevalence to an MSE of 0.006 at 50% prevalence). The effect appeared to be relatively complex, as some scenarios in Fig. 2 showed increased MSE in sensitivity estimates, even though the overall direction was a decrease in MSE.
Number of experts in the expert panel
Increasing the number of experts in the panel did not, on average, change the MSE. The average MSE for sensitivity was 0.007 and the average MSE for specificity was 0.004 at 2, 3, and 10 experts. As seen in Figs. 3 and 4, when the consensus mechanism was the maximum or minimum of the expert estimates, more experts lead to more extreme results. I.e., increasing the number of experts lead to higher MSE in sensitivity when the consensus mechanism was the maximum and lower MSE in sensitivity when the consensus mechanism was the minimum. The reverse was true for the MSE in specificity.
Number of participants
Increasing the number of participants did not substantially change the MSE. The average MSE in sensitivity was 0.007 and the average MSE in specificity was 0.004 at 100, 360, and 1000 participants. There was no apparent change overall and there did not appear to be any interaction with other factors in Figs. 2 and 3 or 4.
Classification threshold
The choice of classification threshold had a large effect on MSE. The MSE of sensitivity was markedly higher for scenarios where the classification threshold was 20% (0.014) than for scenarios where the classification threshold was 50% (0.005) or 80% (0.002). Some interaction with the target condition prevalence is present in Figs. 3 and 4. However, in no situation did a classification threshold of 20% result in a lower MSE than a classification threshold of 50% or higher.
Overconfident expert
In some scenarios, including an overconfident expert appeared to lower MSE. The average MSE without an overconfident expert was 0.0067 and with an overconfident expert was 0.0054. However, in other cases an overconfident expert led to an increase in MSE. The highest MSE in sensitivity was 0.034 without an overconfident expert and 0.066 in the corresponding scenario with an overconfident expert. The effect appeared highly complex, as different combinations of factors saw highly different effects.
Discussion
We assessed the impact of study and expert panel characteristics on diagnostic test accuracy estimates. Our results show that diagnostic test accuracy results are often biased when the reference standard is an expert panel that determines the presence of a target condition by making dichotomous classifications from probability estimates based on several component tests. Bias is reduced when the expert panel has access to more accurate component tests, but increasing the number of experts or study participants did not necessarily lead to a decrease in bias.
In most simulation scenarios the diagnostic accuracy of the index test is underestimated. Underestimation may not pose a problem if a desired minimum threshold for sensitivity or specificity is reached, for example if the test outperforms tests used in practice, but it does mean that it is not possible to infer the exact sensitivity and specificity of the test, which obscures the true value of the test. However, in some scenarios the simulation intervals also include the true value for sensitivity and/or specificity, i.e., the spread of simulated expert panels in those scenarios included overestimation and underestimation of sensitivity and specificity. So, while it is generally likely that test accuracy is underestimated when using an expert panel as reference standard, accurate estimation or overestimation is also possible. However, when true index test accuracy is high, underestimation is more likely.
Our study is in line with previously published literature. Several methods for constructing reference standards in the absence of a ‘golden standard’ have been shown to lead to potentially biased results, including composite reference standards [5], latent class models [6], and well-calibrated expert panels [12]. Despite its potential bias, expert panels are a commonly used reference standard when no gold standard reference standard is available [7].
We expand on this existing literature by exploring the impact of expert panels that are not well-calibrated. Expert panels, in practice, may suffer from random and systematic differences between experts, may include experts with different specializations, and may include experts that are more confident of their target condition assessment than is warranted. Additionally, we have performed a full-factorial analysis of all combinations of the defined input parameters. This, combined with the MSE over a series of true accuracy values of the index test, allows for inspection of bias across all possible scenarios.
When interpreting findings from our study, some limitations have to be taken into account. First, we are limited by the complex interplay in real-world expert panel meetings. While we did account for possible differences between experts, expert panels are complex systems with social dimensions. This means that they are exceedingly difficult to simulate, which meant some simplifications and approximations had to be used. A second limitation is the dichotomization of the target condition classifications (i.e., target condition present vs. absent). Expert panels are typically asked to provide target condition classifications, without reporting their own confidence in the classification. This is typically how expert panels are used in practice but is likely to result in loss of information compared to expert panels providing probability estimates for the target condition. Lastly, we focused on sensitivity and specificity estimates. There are other accuracy estimates that may be of interest, such as positive predictive value, negative predictive value, likelihood ratios and more.
It is worthwhile exploring whether asking expert panels to provide probability estimates instead of dichotomized target condition classifications leads to improvements in diagnostic test accuracy estimates. To accommodate this, further research is needed on developing and validating methods to compute accuracy results based on probabilities, and to assess whether computation of accuracy estimates from probabilities, instead of classifications, leads to less biased results.
In conclusion, our study highlights the large impact study and expert panel characteristics can have on bias in diagnostic test accuracy results. Both sensitivity and specificity are typically underestimated. Surprisingly, increasing the number of experts or study participants does not lead to a reduction in this bias. However, we observed that bias can be reduced by providing the expert panel with more accurate component tests. Furthermore, we suggest that asking expert panels to provide probability estimates of target condition presence, rather than solely asking for a dichotomous classification. This will provide insight in the remaining uncertainty surrounding target condition classification and could in the future enable alternative methods for calculating diagnostic test accuracy estimates.
Data availability
All data and code that support the findings of this study are available at the following URL: https://github.com/BasKellerhuis/Expert-Panel-Reference-Standard-Bias.
Abbreviations
- Se:
-
Sensitivity
- Sp:
-
Specificity
- MSE:
-
Mean squared error
References
Knottnerus JA, van Weel C, Muris JW. Evaluation of diagnostic procedures [published correction appears in BMJ 2002;324(7350):1391]. BMJ. 2002;324(7335):477–80. https://doiorg.publicaciones.saludcastillayleon.es/10.1136/bmj.324.7335.477.
Bossuyt PM. Interpreting diagnostic test accuracy studies. Semin Hematol. 2008;45(3):189–95. https://doiorg.publicaciones.saludcastillayleon.es/10.1053/j.seminhematol.2008.04.001.
Rutjes AW, Reitsma JB, Coomarasamy A, Khan KS, Bossuyt PM. Evaluation of diagnostic tests when there is no gold standard. A review of methods. Health Technol Assess. 2007;11(50):iii–51. https://doiorg.publicaciones.saludcastillayleon.es/10.3310/hta11500.
Reitsma JB, Rutjes AW, Khan KS, Coomarasamy A, Bossuyt PM. A review of solutions for diagnostic accuracy studies with an imperfect or missing reference standard. J Clin Epidemiol. 2009;62(8):797–806. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.jclinepi.2009.02.005.
Schiller I, van Smeden M, Hadgu A, Libman M, Reitsma JB, Dendukuri N. Bias due to composite reference standards in diagnostic accuracy studies. Stat Med. 2016;35(9):1454–70. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/sim.6803.
van Smeden M, Oberski DL, Reitsma JB, Vermunt JK, Moons KG, de Groot JA. Problems in detecting misfit of latent class models in diagnostic research without a gold standard were shown. J Clin Epidemiol. 2016;74:158–66. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.jclinepi.2015.11.012.
Bertens LC, Broekhuizen BD, Naaktgeboren CA, et al. Use of expert panels to define the reference standard in diagnostic research: a systematic review of published methods and reporting. PLoS Med. 2013;10(10):e1001531. https://doiorg.publicaciones.saludcastillayleon.es/10.1371/journal.pmed.1001531.
Hawkins DM, Garrett JA, Stephenson B. Some issues in resolution of diagnostic tests using an imperfect gold standard. Stat Med. 2001;20(13):1987–2001. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/sim.819.
Valenstein PN. Evaluating diagnostic tests with imperfect standards. Am J Clin Pathol. 1990;93(2):252–8. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/ajcp/93.2.252.
Lempesi E, Toulia E, Pandis N. Expert panels as a reference standard in orthodontic research: an assessment of published methods and reporting. Am J Orthod Dentofac Orthop. 2017;151(4):656–68. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.ajodo.2016.09.020.
van Houten CB, Naaktgeboren CA, Ashkenazi-Hoffnung L, et al. Expert panel diagnosis demonstrated high reproducibility as reference standard in infectious diseases. J Clin Epidemiol. 2019;112:20–7. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.jclinepi.2019.03.010.
Jenniskens K, Naaktgeboren CA, Reitsma JB, Hooft L, Moons KGM, van Smeden M. Forcing dichotomous disease classification from reference standards leads to bias in diagnostic accuracy estimates: A simulation study. J Clin Epidemiol. 2019;111:1–10. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.jclinepi.2019.03.002.
Fedorov V, Mannino F, Zhang R. Consequences of dichotomization. Pharm Stat. 2009;8(1):50–61. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/pst.331.
Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11):2074–102. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/sim.8086.
Kellerhuis B, Jenniskens K, Kusters MPT, Schuit E, Hooft L, Moons KG et al. Expert panel as reference standard procedure in diagnostic accuracy studies: a systematic scoping review and methodological guidance. MedRxiv. 2024;11.12.24317219.
Salamonsen MR, Lo AKC, Ng ACT, Bashirzadeh F, Wang WYS, Fielding DIK. Novel use of pleural ultrasound can identify malignant entrapped lung prior to effusion drainage. Chest. 2014;146(5):1286–1293. https://doiorg.publicaciones.saludcastillayleon.es/10.1378/chest.13-2876. PMID: 25010364.
McWhirter L, Ritchie C, Stone J, Carson A. Identifying functional cognitive disorder: a proposed diagnostic risk model. CNS Spectr. 2022;27(6):754–63. Epub 2021 Sep 17. PMID: 34533113.
Sanders DS, Grabsch H, Harrison R, Bateman A, Going J, Goldin R, Mapstone N, Novelli M, Walker MM, Jankowski J, AspECT Trial Management Group andTrial Principal Investigators. Comparing virtual with conventional microscopy for the consensus diagnosis of Barrett’s neoplasia in the aspect Barrett’s chemoprevention trial pathology audit. Histopathology. 2012;61(5):795–800. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/j.1365-2559.2012.04288.x. Epub 2012 Jun 20. PMID: 22716297.
Arita Y, Akita H, Fujiwara H, Hashimoto M, Shigeta K, Kwee TC, Yoshida S, Kosaka T, Okuda S, Oya M, Jinzaki M. Synthetic magnetic resonance imaging for primary prostate cancer evaluation: diagnostic potential of a non-contrast-enhanced bi-parametric approach enhanced with relaxometry measurements. Eur J Radiol Open. 2022;9:100403. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.ejro.2022.100403. PMID: 35242886; PMCID: PMC8857584.
Chen CCG, Long A, Rwabizi D, Mbabazi G, Ndizeye N, Dushimiyimana B, Ngoga E. Validation of an obstetric fistula screening questionnaire: a case-control study with clinical examination. Reprod Health. 2022;19(1):12. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12978-021-01317-2. PMID: 35042512; PMCID: PMC8764794.
Chiang M, Guth D, Pardeshi AA, Randhawa J, Shen A, Shan M, Dredge J, Nguyen A, Gokoffski K, Wong BJ, Song B, Lin S, Varma R, Xu BY. Glaucoma Expert-Level detection of angle closure in goniophotographs with convolutional neural networks: the Chinese American eye study. Am J Ophthalmol. 2021;226:100–7. Epub 2021 Feb 9. PMID: 33577791; PMCID: PMC8286291.
Mohr P, Birgersson U, Berking C, Henderson C, Trefzer U, Kemeny L, Sunderkötter C, Dirschka T, Motley R, Frohm-Nilsson M, Reinhold U, Loquai C, Braun R, Nyberg F, Paoli J. Electrical impedance spectroscopy as a potential adjunct diagnostic tool for cutaneous melanoma. Skin Res Technol. 2013;19(2):75–83. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/srt.12008. Epub 2013 Jan 27. PMID: 23350668.
Leeuwenburgh MM, Wiarda BM, Jensch S, van Es HW, Stockmann HB, Gratama JW, Cobben LP, Bossuyt PM, Boermeester MA, Stoker J, OPTIMAP Study group. Accuracy and interobserver agreement between MR-non-expert radiologists and MR-experts in reading MRI for suspected appendicitis. Eur J Radiol. 2014;83(1):103–10. Epub 2013 Oct 8. PMID: 24168926.
Blake H, McKinney M, Treece K, Lee E, Lincoln NB. An evaluation of screening measures for cognitive impairment after stroke. Age Ageing. 2002;31(6):451-6. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/ageing/31.6.451. PMID: 12446291.
Ambagtsheer RC, Thompson MQ, Archibald MM, Casey MG, Schultz TJ. Diagnostic test accuracy of self-reported screening instruments in identifying frailty in community-dwelling older people: A systematic review. Geriatr Gerontol Int. 2020;20(1):14–24. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/ggi.13810. Epub 2019 Nov 14. PMID: 31729157.
Clegg A, Rogers L, Young J. Diagnostic test accuracy of simple instruments for identifying frailty in community-dwelling older people: a systematic review. Age Ageing. 2015;44(1):148–52. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/ageing/afu157. Epub 2014 Oct 29. PMID: 25355618.
Cheung HHT, Joynt GM, Lee A. Diagnostic test accuracy of preoperative nutritional screening tools in adults for malnutrition: a systematic review and network meta-analysis. Int J Surg. 2024;110(2):1090–8. https://doiorg.publicaciones.saludcastillayleon.es/10.1097/JS9.0000000000000845. PMID: 37830947; PMCID: PMC10871615.
Antipova D, Eadie L, Macaden AS, Wilson P. Diagnostic value of transcranial ultrasonography for selecting subjects with large vessel occlusion: a systematic review. Ultrasound J. 2019;11(1):29. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13089-019-0143-6. PMID: 31641895; PMCID: PMC6805840.
Rücker G, Schwarzer G. Presenting simulation results in a nested loop plot. BMC Med Res Methodol. 2014;14:129. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/1471-2288-14-129. Published 2014 Dec 12.
Acknowledgements
Not applicable.
Funding
This research was not funded by grants or external parties.
Author information
Authors and Affiliations
Contributions
B.K. wrote the main manuscript text, programmed and analysed the simulations, and prepared Figs. 1, 2, 3, 4 and 5. K.J. and J.R. came up with the research question. B.K., K.J., E.S., L.H., K.M. and J.R. contributed to interpretation of the data. K.J., E.S., L.H., K.M. and J.R. provided substantial revisions to the manuscript. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Kellerhuis, B., Jenniskens, K., Schuit, E. et al. Drivers of bias in diagnostic test accuracy estimates when using expert panels as a reference standard: a simulation study. BMC Med Res Methodol 25, 106 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12874-025-02557-7
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12874-025-02557-7