The issue of scientific reproducibility has come to the fore in the past several years, driven by noteworthy failures to replicate critical findings in several much-publicized reports coupled to a series of scandals calling into question the role of journals and granting agencies in maintaining quality and oversight.
In a special Nature online collection, the journal assembled articles and perspectives from 2011 to the present dealing with this issue of research reproducibility in science and medicine. These articles were supplemented with current editorial comment.
Seeing these broad spectrum concerns pulled together in one place makes it difficult not to be pessimistic about the current state of research investigations across the board. The saving grace, however, is that these same reports show that a lot of people realize that there is a problem – people who are trying to make changes and who are in a position to be effective.
According to the reports presented in the collection, the problems in research accountability and reproducibility have grown to an alarming extent. In one estimate, irreproducibility ends up costing biomedical research some $28 billion wasted dollars per year (Nature. 2015 Jun 9. doi: 10.1038/nature.2015.17711).
A litany of concerns
In 2012, scientists at AMGEN (Thousand Oaks, Calif.) reported that, even cooperating closely with the original investigators, they were able to reproduce only 6 of 53 studies considered to be benchmarks of cancer research (Nature. 2016 Feb 4. doi: 10.1038/nature.2016.19269).
Scientists at Bayer HealthCare reported in Nature Reviews Drug Discovery that they could successfully reproduce results in only a quarter of 67 so-called seminal studies (2011 Sep. doi: 10.1038/nrd3439-c1).
According to a 2013 report in The Economist, Dr. John Ioannidis, an expert in the field of scientific reproducibility, argued that in his field, “epidemiology, you might expect one in ten hypotheses to be true. In exploratory disciplines like genomics, which rely on combing through vast troves of data about genes and proteins for interesting relationships, you might expect just one in a thousand to prove correct.”
This increasing litany of irreproducibility has raised alarm in the scientific community and has led to a search for answers, as so many preclinical studies form the precursor data for eventual human trials.
Despite the concerns raised, human clinical trials seem to be less at risk for irreproducibility, according to an editorial by Dr. Francis S. Collins, director, and Dr. Lawrence A. Tabak, principal deputy director of the U.S. National Institutes of Health, “because they are already governed by various regulations that stipulate rigorous design and independent oversight – including randomization, blinding, power estimates, pre-registration of outcome measures in standardized, public databases such as ClinicalTrials.gov and oversight by institutional review boards and data safety monitoring boards. Furthermore, the clinical trials community has taken important steps toward adopting standard reporting elements,” (Nature. 2014 Jan. doi: 10.1038/505612a).
The paucity of P
Today, the P-value, .05 or less, is all too often considered the sine qua non of scientific proof. “Most statisticians consider this appalling, as the P value was never intended to be used as a strong indicator of certainty as it too often is today. Most scientists would look at [a] P value of .01 and say that there was just a 1% chance of [the] result being a false alarm. But they would be wrong.” The 2014 report goes on to state how, according to one widely used calculation by authentic statisticians, a P value of .01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect; a P value of .05 raises that chance of a false alarm to at least 29% (Nature. 2014 Feb. doi: 10.1038/506150a).
Beyond this assessment problem, P values may allow for considerable researcher bias, conscious and unconscious, even to the extent of encouraging “P-hacking”: one of the few statistical terms to ever make it into the Urban Dictionary. “P-hacking is trying multiple things until you get the desired result” – even unconsciously, according to one researcher quoted.
In addition, “unless statistical power is very high (and much higher than in most experiments), the P value should be interpreted tentatively at best” (Nat Methods. 2015 Feb 26. doi: 10.1038/nmeth.3288).
So bad is the problem that “misuse of the P value – a common test for judging the strength of scientific evidence – is contributing to the number of research findings that cannot be reproduced,” the American Statistical Association warns in a statement released in March, adding that the P value cannot be used to determine whether a hypothesis is true or even whether results are important (Nature. 2016 Mar 7. doi: 10.1038/nature.2016.19503).