Taking Critical Appraisal to Extremes

,

Commentary

Taking Critical Appraisal to Extremes

J Fam Pract. 2000 December;49(12):1081-1085

By

Steven H. Woolf, MD, MPH

The Need for Balance in the Evaluation of Evidence

References

1. Gøtzsche PC, Olsen O. Is screening for breast cancer with mammography justifiable? Lancet 2000;355:129-33.

2. Health watch: mammography controversy. CBS Evening News January 7, 2000. Vanderbilt University Television News Archive, available at tvnews.vanderbilt.edu.

3. Mammography assessed. Washington Post, January 7, 2000, A14.

4. Reaves J. Here’s why your oncologist is angry. Time January 13, 2000. Available at www.time.com/time/daily/0,2960,37449,00.html.

5. Mammography screening deemed unjustifiable. Reuters Medical News January 7, 2000. Available at www.medscape.com/reuters/prof/test/2000/o1/01.07/pbo1070c.html.

6. Wilkerson BF, Schooff M. Screening mammography may not be effective at any age. J Fam Pract 2000;49:302-371.

7. Screening mammography re-evaluated. Lancet 2000;355:747-52.

8. Kerlikowske K, Grady D, Rubin SM, Sandrock C, Ernster VL. Efficacy of screening mammography: a meta-analysis. JAMA 1995;273:149-54.

9. Schulz KF. Subverting randomization in controlled trials. JAMA 1995;274:1456-58.

10. Moher D, Pham B, Jones A, et al. Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses?. Lancet 1998;352:609-13.

11. Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA 1995;273:408-12.

12. Berry DA. Benefits and risks of screening mammography for women in their forties: a statistical appraisal. J Natl Cancer Inst 1998;90:1431-39.

13. American Cancer Society. Cancers facts & figures 2000. Atlanta, Ga: American Cancer Society; 2000.

14. Fletcher SW, Black W, Harris R, Rimer BK, Shapiro S. Report of the International Workshop on Screening for Breast Cancer. J Natl Cancer Inst 1993;85:1644-56.

15. Collins MM, Stafford RS, Barry MJ. Age-specific patterns of prostate-specific antigen testing among primary care physician visits. J Fam Pract 2000;49:169-72.

16. Lefevre ML. Prostate cancer screening: more harm than good? Am Fam Phys 1998;58:432-38..

17. Woolf SH, Rothemich SF. Screening for prostate cancer: the role of science, policy, and opinion in determining what is best for patients. Ann Rev Med 1999;50:207-21.

18. Bucher HC, Guyatt GH, Cook DJ, Holbrook A, McAlister FA. Users’ guides to the medical literature: XIX. Applying clinical trial results. A. How to use an article measuring the effect of an intervention on surrogate end points. Evidence-Based Medicine Working Group. JAMA 1999;282:771-78.

19. Advertiement Evidence-Based Practice Montvale, NJ: Quadrant HealthCom Inc.; 2000.

20. Purpose and procedure. Evidence-Based Medicine. Available at www.bmjpg.com/data/ebmpp.htm.

21. Assessing validity and relevance. Available at www.infopoems.com/EBP_Validity.htm.

22. Gerstein HC. Commentary on “Intensive blood glucose control reduced type 2 diabetes mellitus-related end points.” ACP J Club 1999; 2-3. Comment on: UK Prospective Diabetes Study Group. Intensive blood-glucose control with sulphonylureas or insulin compared with conventional treatment and risk of complications. Lancet 1998;352:837-53.

23. Woolf SH, Davidson MB, Greenfield S, et al. Controlling blood glucose levels in patients with type 2 diabetes mellitus: an evidence-based policy statement by the American Academy of Family Physicians and American Diabetes Association. J Fam Pract 2000;49:453-60.

24. Cook DJ, Mulrow CD, Haynes RB. Systematic reviews: synthesis of best evidence for clinical decisions. Ann Intern Med 1997;126:376-80.

25. Instructions for POEMs authors. Available at www.infopoems.com/authors.htm.

26. Slawson DC, Shaughnessy AF. Becoming an information master: using POEMs to change practice with confidence. J Fam Pract 2000;49:63-67.

27. Hayward RS, Wilson MC, Tunis SR, Bass EB, Guyatt G. Users’ guides to the medical literature. VII. How to use clinical practice guidelines. A. Are the recommendations valid? The Evidence-Based Medicine Working Group. JAMA 1995;274:570-74.

28. Woolf SH, George JN. Evidence-based medicine: interpreting studies and setting policy. Hem Oncol Clin N Amer 2000;14:761-84.

29. Shekelle PG, Woolf SH, Eccles M, Grimshaw J. Developing guidelines. BMJ 1999;318:593-96.

30. Woolf SH. Practice guidelines: what the family physician should know. Am Fam Phys 1995;51:1455-63.

31. Lazar PA. Does oral metronidazole prevent preterm delivery in normal-risk pregnant women with asymptomatic bacterial vaginosis (BV)? J Fam Pract 2000;49:495-96.

32. Cook D, Giacomini M. The trials and tribulations of clinical practice guidelines. JAMA 1999;281:1950-51.

33. Grilli R. Practice guidelines developed by specialty societies: the need for a critical appraisal. Lancet 2000;355:103-06.

34. Geyman JP. POEMs as a paradigm shift in teaching, learning, and clinical practice. J Fam Pract 1999;48:343-44.

35. Dickinson WP, Stange KC, Ebell MH, Ewigman BG, Green LA. Involving all family physicians and family medicine faculty members in the use and generation of new knowledge. Fam Med 2000;32:480-90.

38. JFP online. Available at www.jfponline.com.

39. POEMs for primary care. Available at www.infopoems.com.

In January 2000 an article in The Lancet drew attention when it questioned the supporting evidence for screening mammography.¹ Danish investigators Peter Gøtzsche and Ole Olsen presented a series of apparent flaws in the 8 randomized trials of mammography, ultimately concluding that screening is unjustified. Their cogent arguments and the press coverage they received left many physicians wondering whether they should continue to order mammograms. The story led the CBS Evening News² and was featured in the Washington Post,³ Time,⁴ and Reuters.⁵

A Patient-Oriented Evidence that Matters (POEM) review in the April 2000 issue of The Journal of Family Practice⁶ that addressed The Lancet study lent apparent support to these concerns. The POEM related the arguments in The Lancet article without challenging them and concluded that “mammography screening has never been shown to help women to live longer.” The authors of this POEM suggested that the only reasons for screening to continue are “politics, patients’ preconceptions, and the fear of litigation.” Unlike most POEMs, this one included no critical appraisal of the methods or assumptions of the reviewed study. This lack of comment, combined with the authors’ negative remarks about mammography, may have convinced family physicians that the criticisms of Gøtzsche and Olsen were beyond dispute.

However, controversy does surround their arguments, as the many letters to the editor published in The Lancet attest.⁷ For example, The Lancet critique made much of inconsistent sample sizes and baseline dissimilarities between screened and unscreened women. The authors asserted that such age and socioeconomic differences were “incompatible with adequate randomization.” That premise is contestable. It is normal and predictable that a proportion of population variables will differ between groups for statistical reasons, no matter how perfect the randomization. Also, the observed age difference (1 to 6 months) would not explain the 21% reduction in mortality observed in the trials.⁸

For Gøtzsche and Olsen the discrepant age patterns and sample sizes were less a cause of the results than a warning sign that randomization had been subverted (because of failure to conceal allocation). Since mortality in the screened and unscreened groups differed by only a relatively small number of deaths, they reasoned that very little bias would be necessary to tip the scales in favor of mammography.

Several arguments weaken their case, however. First, they offered no evidence that subversion or unconcealed allocation actually occurred. They equated inexplicit documentation of procedures (and dissimilar group characteristics) with improper randomization. Second, even if unconcealed allocation occurred, it does not in itself thwart randomization. Investigators who know to which group a patient will be assigned can still follow the rules and make the correct assignment. Anecdotal reports of subversion (by deciphering assignment sequences to divert or target patients for allocation) do not offer denominator data to assess how often this occurs.⁹ It would have had to occur in every trial that favored mammography to uphold the authors’ allegations. Third, even if the trials were subverted there is no indication that case mix differed enough to skew outcomes. Age differences were minor; the authors speculated that sizable imbalances in unmeasured factors could have altered results, but they gave no evidence. They cited reports that poorly concealed allocation is associated with a 37% to 41% exaggeration in odds ratios,^10,11 but these reports concerned other trials and made arguable assumptions. Finally, their confirmatory finding—that only the 6 “flawed” trials reported a benefit for mammography and that the 2 acceptable trials showed no effect—was based on recalculated relative risk rates. The original trial data show no such pattern.⁸

This is not to suggest that weaknesses in the mammography trials do not merit scrutiny. Others have also voiced criticisms.¹² But the alarm raised by Gøtzsche and Olsen goes further, compelling us to rethink the purpose of critical appraisal and the extremes at which it might cause more harm than good.

Excessive critical appraisal

We seek perfection in evidence to safeguard patients. Prematurely adopting (or abandoning) interventions through uncritical acceptance of findings risks overlooking potential harms or more effective alternatives. But critical appraisal can do harm if valid evidence is rejected. Deciding whether to accept evidence counterbalances the risks of acceptance against the risks of rejection, which are inversely related. At one extreme of the spectrum, where data are accepted on face value (no appraisal), the risk of a type I error (accepting evidence of efficacy when the intervention does not work or causes harm) is high, and that of a type II error (discarding evidence when the intervention actually works) is low. At the other extreme (excessive scrutiny) the risk of a type II error is great; such errors harm patients because knowledge is rejected that can save (or improve) lives. Obviously, patients are best served somewhere in the middle, striking an optimal balance between the risks of type I and type II errors.