Tuesday, April 3, 2018

Why So Much of Neuroscience News Is Unreliable

After a reader finishes reading this site's posts, he or she will return to his regular web browsing habits, which may include periodic visits to science web sites such as LiveScience.com or RealClearScience.com. Such sites may periodically have stories that seem to contradict this site's assertions. But there are reasons why you might take such stories with a grain of salt.

The first reason has to do with the runaway hype and exaggeration that is currently going on in regard to Internet sites reporting scientific research. Major websites have learned the following fundamental formula:

                                       Clicks= Cash Income

The reason for this is that major websites make money from online ads. So the more people click on a link to some science story, the more money the website makes. This means that science reporting sites have a tremendous financial incentive to hype and exaggerate science stories. If they have a link saying, “Borderline results from new neuron study,” they may make only five dollars from that story. But if they have a story saying, “Astonishing breakthrough unveils the brain secret of memory,” they may make five hundred dollars from that story. With such a situation, it is no wonder that the hyping and exaggeration of scientific research is at epidemic levels.

Part of the problem is university press offices, which nowadays are shameless in exaggerating the importance of research done at their university. They know the more some scientific research is hyped, the more glory and attendees will flow to their university.

A scientific paper reached the following conclusions, indicating a huge hype and exaggeration crisis both among the authors of scientific papers and the media that reports on such papers:

Thirty-four percent of academic studies and 48% of media articles used language that reviewers considered too strong for their strength of causal inference....Fifty-eight percent of media articles were found to have inaccurately reported the question, results, intervention, or population of the academic study.

Another huge problem involves what is called the Replication Crisis. This is the fact that a very large fraction of scientific research results are never replicated. The problem was highlighted in a widely cited 2005 paper by John Ioannidis entitled, “Why Most Published Research Studies Are False.” 

A scientist named C. Glenn Begley and his colleagues tried to reproduce 53 published studies called “ground-breaking.” He asked the scientists who wrote the papers to help, by providing the exact materials to publish the results. Begley and his colleagues were only able to reproduce 6 of the 53 experiments. In 2011 Bayer reported similar results. They tried to reproduce 67 medical studies, and were only able to reproduce them 25 percent of the time.

So that “breakthrough” story on your science news site has a large chance of being just some fluke that no other researcher will ever reproduce. And if some other researcher tries to reproduce the study, and fails, you will probably never hear about it (because replication failures have a hard time getting published, and get little press).

Then there's the fact that our science journalists typically act as kind of cheerleaders, as kind of “pom-pom journalists” rather than people applying critical scrutiny to the claims of scientists.

Let us imagine a country in which the press reported uncritically the assertions of the government. In this country, each time the leader of a country stated something, it would be reported as gospel truth by the press. In this country when some group of government officials such as a Senate committee came to a decision, the journalists would report that decision as if it were something that could scarcely be doubted. And whenever a president wanted to start a war, the press would publish the White House spin without criticism. Now clearly in such a country the press would not be doing its proper job. The proper job of the press is to not to just report what authorities in power are saying, but to subject such claims to critical scrutiny.

Thankfully we do not live in a country with a press that is a lap-dog to authorities in government. But we do live in a country where the press is pretty much a lap-dog to authorities in academia. Our science writers typically treat the pronouncements of professors with kind of the same reverence that North Korean journalists treat the utterances of government officials.

When a US president or senator makes a dubious claim, the press are very good about quoting people with alternate viewpoints who dispute such claims. When a scientist makes a dubious claim, our science journalists will typically uncritically repeat such a claim, without citing someone else who disputes it.

When quoting people, science journalists virtually always stay strictly within the community of mainstream scientists. If there are 50 known knowledgeable online critics who have vigorously disputed some type of claim that a scientist makes, none of them will be quoted the next time a scientist makes such a claim. If political reporters operated in a similar way, the rule would be that you can't quote anyone except anyone except a Republican to talk about a Republican's statement, and you can't quote anyone except a Democrat to talk about a Democrat's statement. 

Another reason why you may get many poor quality stories on science web sites is that scientists often mix up facts, experimental results, speculations and doubtful assumptions, without being careful to label their speculations as speculations, and without being careful to note their unproven assumptions are unproven.   It's all poured together into a mixture labeled as "science," making it very hard for readers to sort out what part of a study is hard fact and which part may be the author's dubious interpretation or spin on what he or she observed.


Another reason why you may get lots of poor quality stories on science web sites has to do with the rather perverse incentive system that exists for scientists to “tow the party line,” and produce papers that reflect existing ideas rather than challenge them. Consider the process of peer review, whereby each paper submitted to a journal is reviewed by several scientists in the same field. A paper won't get published if the reviewers give it a negative rating. A paper may get a bad rating if it has obvious errors or math errors, but it may also get a bad rating if it challenges or seems to defy prevailing assumptions. What this means is that peer review tends to act like a censorship racket, preventing journals from publishing results that might be deemed as controversial.

Two emeritus professors state it this way:

Peer review is self-evidently useful in protecting established paradigms and disadvantaging challenges to entrenched scientific authority. Second, peer review, by controlling access to publication in the most prestigious journals helps to maintain the clearly recognised hierarchies of journals, of researchers, and of universities and research institutes. Peer reviewers should be experts in their field and will therefore have allegiances to leaders in their field and to their shared scientific consensus; conversely, there will be a natural hostility to challenges to the consensus, and peer reviewers have substantial power of influence (extending virtually to censorship) over publication in elite (and even not-so-elite) journals.

So imagine you're a scientist doing research on memory, consciousness, or cognition. You know that your chance of getting your paper published may be low if it challenges prevailing assumptions. So what do you? You crank out papers and studies that conform with prevailing dogmas, such as the dogma that the brain generates the mind. And you know that you have to pile up lots of such papers, because the more papers you write, the better your chance of becoming a tenured professor. The result is a great number of poorly designed studies with dubious methodology, and a great number of studies drawing dubious conclusions that conform with prevailing dogmas. 

A scientific paper analyzed 128 other scientific papers, looking for cases of spin (doubtful or debatable interpretation in the paper). The paper said the following:

Among the 128 assessed articles assessed, 107 (84 %) had at least one example of spin in their abstract. The most prevalent strategy of spin was the use of causal language, identified in 68 (53 %) abstracts.

So in more than half of the scientific papers the authors were making statements suggesting a causal relation that was "spin," and not directly implied by the data collected. This type of thing goes on all the time in neuroscience papers, where authors routinely draw causal conclusions or causal suggestions that are not justified by the observational data found in the paper. 

Many neuroscience studies follow this formula:

(1) A scientist will monitor the brain with something like a brain wave reader or an MRI scanner, while  a subject is doing some activity such as thinking, imagining or remembering.
(2)  The scientist will do some write-up based on the assumption that the brain was producing whatever the person was doing. 
(3) The research will get reported with some headline such as "New Insight as to How the Brain Thinks," or "What Your Brain Does to Produce Ideas," or "Scientists Shed Light on How Brains Remember Things."  

The fact that such studies show very little may be realized if you consider that exactly the same approach could be used by monitoring liver activity during a person's mental activities; and headlines could then be written up such as "How Your Liver Thinks" or "New Clues As to How the Liver Remembers Things." Monitoring the activity of Item X during Activity Y does nothing to prove that Item X is actually producing Activity Y.  

Very often, we will have news reports that do not involve any type of neuroscience research or any type of brain research, but which are represented as findings about the brain.  For example, a study may be done testing the memories or creativity of some people, without involving any type of brain scan. We may then see research presented with a headline such as "Study Shows Your Brain Remembers More If You Listen to Bach" or "Study Shows Your Brain Has More Ideas If You Use Google."  What is going on here? It's simply that the author assumed that memory is stored in the brain,  and that your brain is what is producing ideas. So we have "brain" used in the headline even though no research was done on the brain.  Such an approach to headline writing can be very misleading. 

An example of this ever-so-common sloppy reporting is a story published by the British Psychological Society with the title, "Our brains rapidly and automatically process opinions we agree with as if they are facts." The story was coverage of research that did nothing to study the brain, and which merely studied performance on a special test. The author has simply assumed that processing of opinions is something that is done by the brain -- something which we have no evidence of. We know that minds process opinions, but do not know that brains process them. 

Scientific studies that use small sample sizes are often reliable, and often present false alarms, suggesting a causal relation when there is none. Such small sample sizes are particularly common in neuroscience studies, which often require expensive brain scans, not the type of thing that can be inexpensively done with many subjects. In 2013 the leading science journal Nature published a paper entitled "Power failure: why small sample size undermines the reliability of neuroscience." There is something called statistical power that is related to the chance of a study producing a false alarm. The Nature paper found that the statistical power of the average neuroscience study is between 8% and 31%. With such a low statistical power, false alarms and false causal suggestions will be very common. The Nature paper said, "It is possible that false positives heavily contaminate the neuroscience literature." 


An article on this important Nature paper states the following:

The group discovered that neuroscience as a field is tremendously underpowered, meaning that most experiments are too small to be likely to find the subtle effects being looked for and the effects that are found are far more likely to be false positives than previously thought. It is likely that many theories that were previously thought to be robust might be far weaker than previously imagined

Scientific American reported on the paper with a headline of "New Study: Neuroscience Gets an 'F' for Reliability."

So, for example, when some neuroscience paper suggests that some part of your brain controls or mediates some mental activity, there is a large chance that may simply be a false positive. As this paper makes clear, the more comparisons a study makes, the larger a chance for a false positive. The study has an example: if you test whether jelly beans cause acne, you'll probably get a negative result, but if your sample size is small, and you test 30 different colors of jelly bean, you'll probably be able to say something like "there's a possible link between green jelly beans and acne"  -- simply because the more comparisons, the larger the chance of a false positive.  So when a neuroscientist tries to look for some part of your brain that causes some mental activity, and makes 30 different comparisons using different brain regions, in a small sample size, he'll probably come up with some link he can report as "such and such a region of the brain is related to this activity." But there will be a high chance this is simply a false positive.  

The 2013 "Power Failure" paper discussed above was widely discussed in the neuroscience field, but a 2017 paper indicated that little or nothing had been done to fix the problem. Referring to an issue of the Nature Neuroscience journal, the author states, "Here I reproduce the statements regarding sample size from all 15 papers published in the August 2016 issue, and find that all of them except one essentially confess they are probably statistically underpowered," which is what happens when too small a sample size is used. 

A 2017 study entitled "Effect size and statistical power in the rodent fear conditioning literature -- A systematic review" looked at what percentage of 410 experiments used the standard of 15 animals per study group (needed for a moderately compelling statistical power of 80 percent).  The study found that only 12 percent of the experiments met such a standard.  What this basically means is that 88 percent of the experiments had low statistical power, and are not compelling evidence for anything.


rodent experiments

The 2017 scientific paper "Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature" contains some analysis and graphs suggesting that neuroscience is less reliable than psychology. Below is a quote from the paper:

With specific respect to functional magnetic resonance imaging (fMRI), a recent analysis of 1,484 resting state fMRI data sets have shown empirically that the most popular statistical analysis methods for group analysis are inadequate and may generate up to 70% false positive results in null data. This result alone questions the published outcomes and interpretations of thousands of fMRI papers. Similar conclusions have been reached by the analysis of the outcome of an open international tractography challenge, which found that diffusion-weighted magnetic resonance imaging reconstructions of white matter pathways are dominated by false positive outcomes  Hence, provided that here we conclude that FRP [false report probability] is very high even when only considering low power and a general bias parameter (i.e., assuming that the statistical procedures used were computationally optimal and correct), FRP is actually likely to be even higher in cognitive neuroscience than our formal analyses suggest.

The paper draws a shocking conclusion that most published neuroscience results are false. The paper states the following: "In all, the combination of low power, selective reporting, and other biases and errors that have been well documented suggest that high FRP [false report probability] can be expected in cognitive neuroscience and psychology. For example, if we consider the recent estimate of 13:1 H0:H1 odds, then FRP [false report probability] exceeds 50% even in the absence of bias." The paper says of the neuroscience literature, "False report probability is likely to exceed 50% for the whole literature." 

A scientific paper states the following:

In this article, we show that despite the nominal endorsement of a maximum false-positive rate of 5% (i.e., p ≤ .05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish “statistically significant” evidence consistent with any hypothesis. The culprit is a construct we refer to as researcher degrees of freedom.

We can imagine how such "degrees of freedom" come into play for a neuroscience researcher. If you have brain scan data that you are trying to correlate with some behavior or mental phenomenon, you can pick hundreds of different areas of the brain for a comparison. Since each such area can be compared in several different ways, you have a choice of more than 1000 different things you might check to find some correlation. But the correlation found through such a fishing expedition will probably not be good evidence of anything.  Similarly, if I have the freedom to check any of 1000 different sections of Central Park flowers looking for a correlation between flower wilting and the New York Yankees losing, it won't be too hard to find one section which seems to show such a correlation. 

Given all these problems, reading a science news site requires a lot of critical judgment. The statements from such a source should be viewed with as much scrutiny as you would view statements from a politician. Whenever you read a report on a scientific experiment, always ask, “Did they really prove anything?”

But what about peer-reviewed review papers discussing some topic in neuroscience -- are these free from the problems we see in popular news stories about neuroscience? Not at all. Such papers will often contain many references to small-sample studies with inadequate statistical power, citing their findings as if they were reliable, which they are not. Almost never will such papers point out when a study had a too-small sample size.  So many a study that uses a too-small sample size creates a kind of "double radiation" of error.  Such a study may spur a dozen popular press accounts reporting an untrue result, and also a dozen mentions of the study in other scientific papers, reporting the same untrue result. 

Postscript: In my original post I used an assumption that 15 research animals per study group are needed for a moderately persuasive result. It seems that this assumption may have been too generous. In her post “Why Most Published Neuroscience Findings Are False,” Kelly Zalocusky PhD calculates (using Ioannidis’s data) that the median effect size of neuroscience studies is about .51. She then states the following, talking about statistical power: 

To get a power of 0.2, with an effect size of 0.51, the sample size needs to be 12 per group. This fits well with my intuition of sample sizes in (behavioral) neuroscience, and might actually be a little generous. To bump our power up to 0.5, we would need an n of 31 per group. A power of 0.8 would require 60 per group.

If we describe a power of .5 as being moderately convincing, it therefore seems that 31 animals per study group is needed for a neuroscience study to be moderately convincing. But most experimental neuroscience studies involving rodents and memory use far fewer than 15 animals per study group. 
Zalocusky states the following:


If our intuitions about our research are true, fellow graduate students, then fully 70% of published positive findings are “false positives”.This result furthermore assumes no bias, perfect use of statistics, and a complete lack of “many groups” effect. (The “many groups” effect means that many groups might work on the same question. 19 out of 20 find nothing, and the 1 “lucky” group that finds something actually publishes). Meaning—this estimate is likely to be hugely optimistic.

See my post "The Building Blocks of Bad Science Literature" in which I discuss 17 ways in which science literature might create misleading impressions and ideas.

No comments:

Post a Comment