In a decade where the sciences have largely been castigated for their statistical methods, it seems another axe has just fallen upon a popular member. A group out of the UK–including John Ioannidis, critic of medicine, translational animal research, and psychology–now bring pressure to bear upon the neurosciences.
The authors’ paper in the most recent edition of Nature Reviews Neuroscience discusses numerous statistical problems now being addressed across a variety of fields: including selective publication, data analysis or data reporting–but in this case they focus mostly on power. No, no that kind of power—I mean statistical power, a key point of concern in the application of modern statistics.
Basically, statistical power is the ability of researchers to detect whatever it is they are trying to measure or study; importantly, power increases when the sample size or effect size gets larger. It’s actually kind of ironic that we have power problems in modern science, when so many resources are beginning funneled towards collecting data: but the truth is that chronically, many research studies do not collect nearly enough samples to adequately test the effects under study.
A good way to think about how statistical power affects study results is by equating power to a very potent sense of smell. Your pet dog can clearly and unambiguously detect scents whether they are weak or strong, whereas most of us mere humans can barely differentiate between the smell of a tasty dinner and a not-so-savory… well let’s not go there.
Clearly, dogs have a great amount of power to detect scents that really are present, whereas people, who have much less “scent-power”, will often fail to recognize smells even when in the same room with them (a type II error). Something that should be fairly apparent to most of us is that when we fail to correctly identify a smell because of low scent-power, we are also more likely to mis-identify the smell as something other than what it is—kind of like the “everything-tastes-like-chicken” meme, but for scents. Obviously if I come home to the luxuriant smell of beef and say “mmm, smells like chicken”, then my nose has deceived me. In statistics, this “false positive” is called a type I error.
This second type of mistake (sniffing out a false effect and calling it “true”) is supposed to be a major no-no in statistical applications—and is why virtually all journals require that reported results have a less than 5% probability (p<0.05) of occurring due to chance. However, as this most recent study notes, even when “statistically significant” findings have a low p-value, if a study has very low power then the odds that these findings turn out to be false is quite a bit higher than 5%. Low power can cause us to miss real effects and to identify false effects as being true. Mathematically we can define power as the probability that we will find a true effect (e.g. the opposite of a type II error), and the generally accepted threshold for power is about 80%, where 100% means that all real effects being measured will pass the threshold of statistical significance.
A second key feature of low powered studies is that even when researchers do actually identify a true effect, these effects tend to be inflated. For example, if you were to test the strength of odor from two types of flowers in your garden (red and pink daffodils) and you reported that the red daffodils produce a smell three times more potent that the pink daffodils, we could probably say that this estimated difference in smell is grossly inflated relative to the real difference in smell.
In the present study, the authors decided to address this issue by producing an estimate of the level of statistical power that studies in neuroscience tend to have.To do this they screened all neuroscience meta-analyses from 2011, identifying 48 studies from diverse fields within the neurosciences which reported sufficient data to be included. They also expanded their search to include under-represented fields from 2011, which allowed them to identify 8 structural neuroimaging (e.g. brain volume) meta-analyses 2006-2009 and a “representative sample” of animal studies investigating sex differences in maze learning performance.
From the first broader collection of 48 studies the authors find a combined statistical power of about 21%, which was even lower when they excluded 8 meta-analyses of “neurological” (e.g. brain damage) studies which tend to have large effects and small sample sizes and so were well powered. The MRI volume studies had a power of about 8% on average, whereas the animal studies of sex differences had average power between 18 and 31%.
What is so startling about these findings is just how low these power estimates are. Almost across the board, the estimated power was about one quarter of the recommended 80%. To put this in context, when considering that most neuroscience studies are exploratory (e.g. searching for new and novel findings, which means that a majority of the ideas under study will be false), the authors estimate that, given an average power of about 20%, around half of all reported, statistically significant results may in fact be false positives. This doesn’t even attempt to consider other caveats: for example, because the authors estimate power based upon studies with inflated effect sizes, they may actually be over-estimating the amount of power that many studies have. Other issues like the file-drawer problem or researchers running multiple analyses (and only reporting a few) could make this problem worse.
So what are the implications? For one, this doesn’t invalidate all of the studies in existence. The meta-analysis really only tells us directly about the studies which were meta-analyzed in the first place (which is nowhere near a majority of all research), and while we learn a lot about the average study (really, because this is an analysis of meta-analyses, it is the “meta-average”), we learn very little about individual studies of which there are a very many high quality pieces of work. Still, given how much it costs to fund the thousands of neuroscience studies conducted every year, and potential ethical concerns, it is pretty critical that something change.
The authors point to a number of key fixes:
1) Define methods and analyses and calculate study power before conducting the research
2) Disseminate results materials and data widely
3) Collaborate to ensure it is possible to have sufficient power and to replicate results
Will any of this come to pass? Hard to say. There are undoubtedly lots of entrenched interests and most of these recommendations are much easier to do for confirmatory research. Today, and for the time-being, all of the glory comes out of identifying new ideas which means the incentive structure supports very few of these initiatives. And as long as it is possible to publish highly with under-powered studies (I don’t believe any neuroscience journals require power analyses to be reported), don’t expect any of this to change all that much.