Head Truth: Study Finds "Poor Overall Reliability" of Brain Scanning Studies

For decades neuroscientists have been trying to use brain imaging to get evidence that particular regions of the brain cause particular mental effects. The technique they use typically works like this:

(1) Put a small number of subjects in an MRI brain scanner, and either have them do some mental task or expose them to some kind of mental stimulus.
(2) Use the brain scanner to make images of the brain during such activity.
(3) Then analyze the brain scans, looking for some area of higher activation.

Often sleazy and misleading techniques are used to present the data from such studies. Techniques are very often used that make very small differences in brain signal strength look like very big differences. A discussion of such techniques, which I call "lying with colors" can be read here.

Claims that particular regions of the brain show larger activity during certain mental activities are typically not well-replicated in followup studies. A book by a cognitive scientist states this (page 174-175):

"The empirical literature on brain correlates of emotion is wildly inconsistent, with every part of the brain showing some activity correlated with some aspect of emotional behavior. Those experiments that do report a few limited areas are usually in conflict with each other....There is little consensus about what is the actual role of a particular region. It is likely that the entire brain operates in a coordinated fashion, complexly interconnected, so that much of the research on individual components is misleading and inconclusive."

An article on neursosciencenews.com states the following:

"Small sample sizes in studies using functional MRI to investigate brain connectivity and function are common in neuroscience, despite years of warnings that such studies likely lack sufficient statistical power. A new analysis reveals that task-based fMRI experiments involving typical sample sizes of about 30 participants are only modestly replicable. This means that independent efforts to repeat the experiments are as likely to challenge as to confirm the original results."

There have been statistical critiques of brain imaging studies. One critique found a common statistical error that “inflates correlations.” The paper stated, “The underlying problems described here appear to be common in fMRI research of many kinds—not just in studies of emotion, personality, and social cognition.”

Another critique of neuroimaging found a “double dipping” statistical error that was very common. New Scientist reported a software problem, saying “Thousands of fMRI brain studies in doubt due to software flaws.”

Flaws in brain imaging studies were highlighted by a study that found "correlations of consciousness" by using an fMRI brain scan on a dead salmon. See here for an image summarizing the study. The dead salmon study highlighted a problem called the multiple comparisons problem. This is the problem that the more comparisons you make between some region of the brain and an average, the more likely you will be to find a false positive, simply because of chance variations. A typical brain scan study will make many such comparisons, and in such a study there is a high chance of false positives.

Considering the question of “How Much of the Neuroimaging Literature Should We Discard?” a PhD and lab director states, “Personally I’d say I don’t really believe about 95% of what gets published...I think claims of 'selective' activation are almost without exception completely baseless ” This link says that a study, "published open-access in the Proceedings of the National Academy of Sciences, suggests that the methods used in fMRI research can create the illusion of brain activity where there is none—up to 70% of the time."

A new study has raised additional concerns about the use of brain imaging in neuroscience. The study was announced in a Duke University press release entitled, "Studies of Brain Activity Aren't as Useful as Scientists Thought." The study discusses a meta-analysis which looked at the question of how reliably there occurs a region of higher brain activation, in cases when a particular subject had his brain scanned at two different times.

What neuroscientists would like for there to be is a tendency to get the same result in two different scans of a person's brain taken on two different days, when the person was engaged in the same activity or exposed to the same stimulus. But that doesn't happen. We read the following in the press release, which quotes Ahmad R. Hariri:

"Hariri said the researchers recognized that 'the correlation between one scan and a second is not even fair, it’s poor.'...For six out of seven measures of brain function, the correlation between tests taken about four months apart with the same person was weak....Again, they found poor correlation from one test to the next in an individual. The bottom line is that task-based fMRI in its current form can’t tell you what an individual’s brain activation will look like from one test to the next, Hariri said....'We can’t continue with the same old ‘"hot spot" research,' Hariri said. “We could scan the same 1,300 undergrads again and we wouldn’t see the same patterns for each of them.”

The press release is talking about a scientific study by Hariri and others that can be read here or here. The study is entitled, "What is the test-retest reliability of common task-fMRI measures? New empirical evidence and a meta-analysis." The study says, "We present converging evidence demonstrating poor reliability of task-fMRI measures...A meta-analysis of 90 experiments (N=1,008) revealed poor overall reliability."

In a neuoscience study, the sample size is how many subjects (animal or human) were tested. Figure 1 of the Hariri study deserves careful attention. It has three graphs comparing the kind of sample sizes we would need to get reliable results in brain imaging studies (ranging from between 100 and 1000) to the median samples size of brain image studies (listed as only 25). This highlights a problem that I have many times written about: that the sample sizes used in neuroscience studies are typically way too small to produce reliable results. As it happens, the problem is even worse than depicted in Figure 1, because the median sample size of a neuroscience study is actually much less than 25. According to the paper here, "Highly cited experimental and clinical fMRI studies had similar median sample sizes (medians in single group studies: 12 and 14.5; median group sizes in multiple group studies: 11 and 12.5)."

Neuroscientists have known about this shortcoming for years. It has been pointed out many times that the sample sizes used in neuroscience studies are typically way too small for reliable results. But our neuroscientists keep grinding out countless studies with too small a statistical power. In the prevailing culture of academia, you are rewarded for the number of papers published with your name on it, and not too much attention is paid to the reliability of such studies. So if you are a professor with a budget that is sufficient to fund either 100 fMRI scans on 100 subjects in a single study of relatively high reliability, or 10 little low-reliability studies with only 10 subjects each, the prevailing rewards system in academia makes it a better career move for you to do 10 unreliable studies resulting in 10 separate papers rather than a single study resulting in a single paper.

Figure 5 of the Hariri study is also interesting. It rates reliability in various tests of mental activity while subjects had their brains scanned at two different times. There's data for a single task involving memory, which failed to reach a reliability of either "excellent" or "good." This task involved a retest of only 20 different subjects. On the left of the figure, we have results for an Executive Function (EF) test tried twice on 45 subjects, and a "relational" test tried twice on 45 subjects. The relational test is discussed here. In the test you have to look at some visual figures, and mentally discern whether the type of transformation (either shape or texture) that occurred in a first row of figures was the same transformation used in the second row of figures.

So we have here the interesting case of two thinking tasks applied to 45 subjects on two different days, while their brains were scanned. This makes a better-than-average test of whether some brain region should reliably be activated more strongly during thinking.

The result was actually a flop and a fail for the hypothesis that your brain produces thinking. In the Executive Function test (corresponding to the third column of circles shown below), none of the 8 brain regions examined produced a greater activation that appeared to an extent that was either Excellent, Good, or Fair. In the relational test (corresponding to the fifth column of circles shown below), none of the 8 brain regions examined produced a greater activation that appeared to an extent that was either Excellent, Good, or Fair. The figure is shown below:

Figure 5 of the Hariri study (link)

The brain regions used in the tests graphed above were not random brain regions, but were typically the regions thought most likely to produce a correlation.

Such results are quite consistent with the claim I have long made on this blog that the brain is not the source of the human mind, and is not the source of human thinking.

Postscript: The study "Reliability and stability challenges in ABCD task fMRI data" (by James T. Kennedy and others) finds results like that of the Hariri study, but much more damaging to claims of reliability in brain scanning studies. The Kennedy study examined how well retest scans of thousands of individuals matched results found in the original scans done days or months earlier. Referring to a numerical measure in which any result less than .4 means "poor" with little replication between different scans of the same subject, we read the following ridiculously bad results:

"Reliability and stability (quantified as the ratio of non-scanner related stable variance to all variances) was poor in virtually all brain regions, with an average value of 0.088 and 0.072 for short term (within-session) reliability and long-term (between-session) stability, respectively, in regions of interest (ROIs) historically-recruited by the tasks. Only one reliability or stability value in ROIs exceeded the ‘poor’ cut-off of 0.4, and in fact rarely exceeded 0.2 (only 4.9%)."

What this means is the brain scans showed virtually no replication beween scans done on the same individual doing the same task on different days. The result is consistent with the idea that brain scans merely pick up random fluctuations, without revealing any evidence for claims that brains make minds.

Head Truth

Tuesday, June 16, 2020

Study Finds "Poor Overall Reliability" of Brain Scanning Studies

No comments:

Post a Comment