Enthusiasts for critical appraisal sometimes forget this, assuming that more is better. Gøtzsche and Olsen, fearing that the medical community had committed a type I error (promoting an ineffective or harmful screening test), set a high standard for judging validity (eg, dismissing trials with any baseline differences). But when data from 8 trials demonstrate a large effect size (21% reduction in mortality) with narrow confidence intervals (13%-29%),8 the risk of a type II error overwhelms that of a type I error. The implications of the error are stark for a disease that claims 40,000 lives per year.13 Rejecting evidence under these conditions is more likely to cause death than accepting it.
The pivotal question in rejecting evidence should not be whether there is a design flaw but something more precise: the probability that the observed outcomes are due to factors other than the intervention under consideration (eg, chance, confounding). This can be just as likely in the absence of design flaws (eg, a perfectly conducted uncontrolled case series) as when a study is performed poorly, and it can be low even when studies are imperfect. It is a mistake to reject evidence reflexively because of a design flaw without studying these probabilities.
For example, what worried Gøtzsche and Olsen about flawed randomization was that factors other than mammography might account for lower mortality. But how likely was that (compared with the probability that early detection was efficacious)? Suppose a trial with imbalanced randomization carries a 50% probability of producing spurious results due to chance or confounding. The probability that the same phenomenon would occur in 8 trials (conducted independently in 4 countries in different decades, using different technologies, views, and screening intervals14) would be in the neighborhood of (0.50)8 or 0.39%. The probability would be higher if one believed that subversion is systematic among researchers, but without data such speculation is an exercise in cynicism rather than science.
Suppose that the criticisms of Gøtzsche and Olsen justify the termination of mammography. Are we applying the same standard for other screening and clinical practices, or do we move the goal posts? Physicians screen for prostate cancer without one well-designed controlled trial showing that it lowers mortality.15 We say this is not evidence-based,16 but what kind of study would change our minds? The conventional answer is a randomized controlled trial showing a reduction in prostate cancer mortality is required,17 but The Lancet article finds that even 8 such trials are unconvincing. Indeed, flawless trials lose influence if the end points or setting lack generalizability.18 If the consequence of such high standards is that 30 years of trials involving a total of 482,000 women (and untold cost) cannot establish efficacy, the prospects for making the rest of medicine evidence-based are slim indeed.
The search for perfect data also has epistemologic flaws. No study can provide the absolute certainty that extremism in critical appraisal seeks. The willingness to reject studies based on improbable theories of confounding or research misbehavior may have less to do with good science than with discomfort with uncertainty, that is, unease with any possibility that the inferences of the investigators are wrong. The wait for better evidence is in vain, however, because science can only guess about reality. Even in a flawless trial a P value of .05 means that claims of significance will be wrong 5% of the time. Good studies have better odds in predicting reality, but they do not define it. It is legitimate to reject poorly designed studies because the probability of being wrong is too high. But to reject studies because there is any probability of being wrong is to wait futilely for a class of evidence that does not exist.
Inadequate critical appraisal
The POEM about The Lancet article highlights the other extreme of critical appraisal, accepting studies at face value. The review mentioned none of the limitations in The Lancet analysis, thus giving readers little reason to doubt the conclusion that mammography lacks scientific support and potentially convincing them to stop screening. Physicians should decide whether this is the right choice only after having heard all the issues. That this study reported a null effect and used meta-analysis does not lesson the need for critical appraisal. Like removing a counterweight from a scale, the omission of critical appraisal unduly elevates study findings (positive or negative), thus fomenting overreaction by not putting the information in context.
Several new resources, POEMs among them, have become available to alert physicians to important evidence. Some features (eg, the “Abstracts” section in the Journal of the American Medical Association) simply reprint abstracts. Others associated with the evidence-based medicine (EBM) movement offer critical appraisals. In family practice, these include POEMs and Evidence-Based Practice.19 In other specialties, they include the American College of Physicians’ ACP Journal Club and the EBM journals from the BMJ Publishing Group (eg, Evidence-Based Medicine, Evidence-Based Nursing). These efforts try to approach critical appraisal systematically. ACP Journal Club and the BMJ journals apply a quality filter (excluding studies failing certain criteria20) and append a commentary that mentions design limitations. POEMs go further, devoting a section to study design and validity and giving the authors explicit criteria for assessing quality.21