Tuesday, April 21, 2026

The "Research Dystopia" of Dogma-Driven Neuroscience Experimentation

A dystopia is a fictional world in which things have gone horribly wrong. You might use the term "research dystopia" to describe certain fields of scientific research in which researchers are dedicated to proving untrue or implausible dogmas, by the use of poor methods of experimentation or analysis. Such a research dystopia is largely a world of fiction, in which false or implausible claims keep being repeated. In such a research dystopia, things have gone horribly wrong, because there is a predomination of poor techniques of scientific experimentation and scientific analysis.

Sadly, the field of research known as cognitive neuroscience research is a field you could call a research dystopia, without being too far off the mark. Such a field is a largely a world of fiction, in which researchers keep making untrue claims about brains being the source of minds and brains being the storage place of memories. And it is a world in which things have gone horribly wrong, because researchers keep churning out miserably designed studies guilty of various types of Questionable Research Practices. 

The latest evidence that cognitive neuroscience research is a research dystopia can be found in a press release on the clickbait-heavy site earth.com, and in the scientific paper that press release is promoting. The press release has the very untrue title "Scientists can now 'edit' brain circuits to enhance memory."  We read this very false claim: "New research shows that trimming specific synapses in a mouse brain circuit can strengthen memories and help them last longer." We read about some weird experiment in which scientists fiddled with synapses in the brains of a few mice. 

Making the untrue claim that a standard measure of memory was used (a claim that is untrue for reasons I will soon explain), the press release says this:

"Mice with edited hippocampal circuits froze more during recall tests, a standard memory measure.  With mild training, that advantage appeared two days after learning and remained 23 days later, strengthening both recent and long-term memory. With more intense training, the treated group held steady while controls faded, so the difference was not just a lucky one-off."

To help create the illusion that some reliable research was done, we have no mention of the number of mice used in the experiment. A look at the scientific paper being discussed by the press release gives us the answer to that question. The scientific paper is the very low-quality science paper here, one entitled "Remodeling synaptic connections via engineered neuron-astrocyte interactions."  In the scientific paper we read that the number of mice being tested was ridiculously low. The study group sizes were way-too-small study group sizes such as only 3 mice or only 6 mice. No study of this type should be taken seriously unless the study group sizes were at least 15 or 20 animals per study group. You do not have any decent evidence of a real effect if you merely use study group sizes of 6 animals per study group in a study comparing performance between altered mice and unaltered mice. It is way, way too easy to get a false alarm using a study group size so small. 

Below is a graph from the paper, found in Figure 8 of the paper:


This is what the paper is offering as its main evidence for a change in memory performance produced by the brain fiddling that the experimenters did. Each of the dots represents the claimed "freezing behavior" of one mouse in only one trial. By counting the number of circles, we can see that the study group sizes were only 6. 

The paper "Prevalence of Mixed-methods Sampling Designs in Social Science Research" has a Table 2 giving recommendations for minimum study group sizes for different types of research. The minimum number of subjects for an experimental study is 21 subjects per study group. 

minimum sample sizes

We simply cannot take seriously any study of this type using such a way-too-small study group of only six mice per study group. Using a study group size that small, it is way, way too easy to get a false alarm result, purely by chance. Similarly, if I do test of the effectiveness of rubbing a lucky rabbit foot charm in two groups of six people, and one of the groups report having better luck on the few days of the test, that is no decent evidence for the effectiveness of rubbing a rabbit's foot charm to increase luck. It is way, way too easy to get such a result from pure chance. 

Another reason why the reported result is worthless as evidence is that it used the utterly unreliable technique of trying to judge memory performance by judging the "freezing behavior" of mice. That technique is not a reliable technique for judging fear or recall in rodents, for reasons explained at length in my post here. 

The press release promoting this very low-quality paper makes the claim that a "a standard memory measure" was used. That is not correct. Although very often used in the dysfunctional world of rodent neuroscience research, the technique of attempting to measure "freezing behavior" in rodents is actually a technique that involves no standard measurement technique. The long appendix at the end of this post documents the utter lack of standards when such "freezing behavior" estimations occur. And when "freezing behavior" estimations occurs, it is not even memory that is being measured. What is being measured is what percent of some time interval a rodent is not moving. 

Neuroscientists love the technique of "freezing behavior" estimations, because it is a "see whatever you are hoping to see" type of technique, in which the desired positive result can almost always be claimed, by fiddling around with how the "freezing behavior" estimation occurs. The lack of any real standard in such estimations is only part of the reason why "freezing behavior" estimations are an utterly reliable technique for measuring fear or recall in rodents. 



We have in the very poor-quality paper "Remodeling synaptic connections via engineered neuron-astrocyte interactions" no decent evidence that manipulating synaptic connections has any effect on memory. The experimenters have used a study group size so low that it would not be good evidence of a memory change even if a reliable technique had been used to measure recall. And no such reliable technique for measuring recall has been used, but only the worthless  unreliable technique of attempting to judge "freezing behavior."  The authors might have discovered how way-too-small their study group sizes were if they had done a sample size calculation. But they make no mention of doing such a calculation.

Below are some quotes mentioning the use of too-small study group sizes and too-low statistical power in neuroscience studies. All references to underpowered studies are references to studies using too-small study group sizes. 

  • "Postmortem studies need n = 26 subjects to detect the same effect 80 % of the time, while MRI studies need n = 84 subjects; thus, most individual MRI studies and both postmortem studies were underpowered." (Link)
  • "The median neuroimaging study sample size is about 25...Reproducible brain-wide association studies require thousands of individuals." (Link)
  • "Critical appraisal indicated that studies were underpowered, did not match cases with controls and failed to account for confounding factors." (Link)
  • "Power calculations suggested that studies were underpowered." (Link)
  • "The small sample sizes of the current literature make it very likely that studies were underpowered, resulting in a host of issues such as imprecise association estimates, imprecise estimated effect sizes, low reproducibility, and reduced chances of detecting a true effect or, conversely, that 'detected' effects are indeed true." (Link)
  • "Most validation studies were underpowered and hence may have given a misleading impression of accuracy."  (Link)
  • "We reviewed 513 behavioral, systems and cognitive neuroscience articles in five top-ranking journals (Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience) and found that 78 used the correct procedure and 79 used the incorrect procedure. An additional analysis suggests that incorrect analyses of interactions are even more common in cellular and molecular neuroscience." (Link)
The study here concludes, "Our results indicate that the median statistical power in neuroscience is 21%." This is an abysmal number, an appalling figure. It has long been said that in experimental research, the goal should be a statistical power of 80%, which roughly corresponds to a likelihood of 80% that the result will be replicated.  A study with a statistical power of 21% is a low quality study that is likely to be announcing a false alarm. When a research field has a median statistical power of 21%, that means half of the studies have a statistical power of 21% or less.  If such an estimation is correct, it means the great majority of neuroscience studies report results that are unreliable or untrue. 

The combination of very bad research practices and the enormous bias of researchers eagerly trying to prove old, untenable dogmas about brains makes the field of neuroscience experimentation something you might call a research dystopia,  a kind of experimental wasteland. 

Appendix:The Lack of Any Standards in "Freezing Behavior" Estimations

 A paper describing variations in how "freezing behavior" is judged reveals that no standard is being followed. The paper is entitled "Systematic Review and Methodological Considerations for the Use of Single Prolonged Stress and Fear Extinction Retention in Rodents." The paper has the section below telling us that statistical techniques to judge "freezing behavior" in rodents are "all over the map," with no standard statistical method being used:

"For example, studies using cued fear extinction retention testing with 10 cue presentations reported a variety of statistical methods to evaluate freezing during extinction retention. Within the studies evaluated, approaches have included the evaluation of freezing in individual trials, blocks of 2–4 trials, and subsets of trials separated across early and late phases of extinction retention. For example, a repeated measures analysis of variance (RMANOVA) of baseline and all 10 individual trials was used in Chen et al. (2018), while a RMANOVA was applied on 10 individual trials, without including baseline freezing, in Harada et al. (2008). Patterns of trial blocking have also been used for cued extinction retention testing across 10 trials, including blocks of 2 and 4 trials (Keller et al., 2015a). Comparisons within and across an early and late phase of testing have also been used, reflecting the secondary extinction process that occurs during extinction retention as animals are repeatedly re-exposed to the conditioned cue across the extinction retention trials. For example, an RMANOVA on trials separated into an early phase (first 5 trials) and late phase (last 5 trials) was used in Chen et al. (2018) and Chaby et al. (2019). Similarly, trials were averaged within an early and late phase and measured with separate ANOVAs (George et al., 2015). Knox et al. (2012a,b) also averaged trials within an early and late phase and compared across phases using a two factors design.

Baseline freezing, prior to the first extinction retention cue presentation, has been analyzed separately and can be increased by SPS (George et al., 2015) or not affected (Knox et al., 2012bKeller et al., 2015a). To account for potential individual differences in baseline freezing, researchers have calculated extinction indexes by subtracting baseline freezing from the average percent freezing across 10 cued extinction retention trials (Knox et al., 2012b). In humans, extinction retention indexes have been used to account for individual differences in the strength of the fear association acquired during cued fear conditioning (Milad et al., 20072009Rabinak et al., 2014McLaughlin et al., 2015) and the strength of cued extinction learning (Rabinak et al., 2014).

In contrast with the cued fear conditioning studies evaluated, some studies using contextual fear conditioning used repeated days of extinction training to assess retention across multiple exposures. In these studies, freezing was averaged within each day and analyzed with a RMANOVA or two-way ANOVA across days (Yamamoto et al., 2008Matsumoto et al., 2013Kataoka et al., 2018). Representative values for a trial day are generated using variable methodologies: the percentage of time generated using sampling over time with categorically handscoring of freezing (Kohda et al., 2007), percentage of time yielded by a continuous automated software (Harada et al., 2008), or total seconds spent freezing (Imanaka et al., 2006Iwamoto et al., 2007). Variability in data processing, trial blocking, and statistical analysis complicate meta-analysis efforts, such that it is challenging to effectively compare results of studies and generate effects size estimates despite similar methodologies."

As far as the techniques that are used to judge so-called "freezing behavior" in rodents, the techniques are "all over the map," with the widest variation between researchers. The paper tells us this:

"Another source of variability is the method for the detection of behavior during the trials (detailed in Table 1). Freezing behavior is quantified as a proxy for fear using manual scoring (36% of studies; 12/33), automated software (48% of studies; 16/33), or not specified in 5 studies (15%). Operational definitions of freezing were variable and provided in only 67% of studies (22/33), but were often explained as complete immobility except for movement necessary for respiration. Variability in freezing measurements, from the same experimental conditions, can derive from differential detection methods. For example, continuous vs. time sampling measurements, variation between scoring software, the operational definition of freezing, and the use of exclusion criteria (considerations detailed in section Recommendations for Freezing Detection and Data Analysis). Overall, 33% of studies did not state whether the freezing analysis was continuous or used a time sampling approach (11/33). Of those that did specify, 55% used continuous analysis and 45% used time sampling (12/33 and 10/33, respectively). Several software packages were used across the 33 studies evaluated: Anymaze (25%), Freezescan (14%), Dr. Rat Rodent's Behavior System (7%), Packwin 2.0 (4%), Freezeframe (4%), and Video Freeze (4%). Software packages vary in the level of validation for the detection of freezing and the number and role of automated vs. user-determined thresholds to define freezing. These features result in differential relationships between software vs. manually coded freezing behavior (Haines and Chuang, 1993Marchand et al., 2003Anagnostaras et al., 2010). Despite the high variability that can derive from software thresholds (Luyten et al., 2014), threshold settings are only occasionally reported (for example in fear conditioning following SPS). There are other software features that can also affect the concordance between freezing measure detected manually or using software, including whether background subtraction is used (Marchand et al., 2003) and the quality of the video recording (frames per second, lighting, background contrast, camera resolution, etc.; Pham et al., 2009), which were also rarely reported. These variables can be disseminated through published protocols, supplementary methods, or recorded in internal laboratory protocol documents to ensure consistency between experiments within a lab. Variability in software settings can determine whether or not group differences are detected (Luyten et al., 2014), and therefore it is difficult to assess the degree to which freezing quantification methods contribute to variability across SPS studies with the current level of detail in reporting. Meuth et al. (2013) tested the differences in freezing measurements across laboratories by providing laboratories with the same fear extinction videos to be evaluated under local conditions. They found that some discrepancies between laboratories in percent freezing detection reached 40% between observers, and discordance was high for both manual and automated freezing detection methods." 

It's very clear from the quotes above that once a neuroscience researcher has decided to use "freezing behavior" to judge the amount of fear or recall in mice, then he pretty much has a nice little "see whatever I want to see" situation. Since no standard protocol is being used in these estimations of so-called "freezing behavior," a neuroscientist can pretty much report exactly whatever he wants to see in regard to "freezing behavior," by just switching around the way in which "freezing behavior" is estimated, until the desired result appears. We should not make here the mistake of assuming that those using automated software for judging "freezing behavior" are getting objective results.  Most software has user-controlled options that a user can change to help him see whatever he wants to see. 

When "freezing behavior" judgments are made, there are no standards in regard to how long a length of time an animal should be observed when recording a "freezing percentage"  (a percentage of time the animal was immobile). An experimenter can choose any length of time between 30 seconds and five minutes or more (even though it is senseless to assume rodents might "freeze in fear" for as long as a minute).  Neuroscience experiments typically fail to pre-register experimental methods, leaving experimenters to make analysis choices "on the fly." So you can imagine how things work. An experimenter might judge how much movement occurred during five minutes or ten minutes after a rodent was exposed to a fear stimulus. If a desired above-average amount of immobility (or a desired below-average amount of immobility) occurred over 30 seconds, then 30 seconds would be chosen as the interval to be used for a "freezing percentage" graph. Otherwise,  if a desired above-average amount of immobility (or a desired below-average amount of immobility) occurred over 60 seconds, then 60 seconds would be chosen as the interval to be used for a "freezing percentage" graph. Otherwise,  if a desired above-average amount of immobility (or a desired below-average amount of immobility) occurred over two minutes, then two minutes would be chosen as the interval to be used for a "freezing percentage" graph. And so on and so forth, up until five minutes or ten minutes. If the researcher still has no "more freezing" effect he can report, the researcher can always do something like report on only the last minute of a larger time length, or the last two minutes, or the last three minutes, or the last four minutes. 

And also the researchers can arbitrarily choose what time length of immobility will be counted as some "freezing" to be added to the "freezing percentage" figure.  That time length of immobility can be 1 second or 2 seconds or any number of seconds between 1 and 10.

Because there are 20 or 30 or 50 different ways in which the data can be analyzed, each with about a 50% chance of success of yielding the desired result, the likelihood of the researcher being able to report some "higher freezing level" is almost certain, even if the tested interventions or manipulations had no real effect on memory. Such shenanigans drastically depart from good, honest, reliable experimental methods.

No comments:

Post a Comment