A dystopia is a fictional world in which things have gone horribly wrong. You might use the term "research dystopia" to describe certain fields of scientific research in which researchers are dedicated to proving untrue or implausible dogmas, by the use of poor methods of experimentation or analysis. Such a research dystopia is largely a world of fiction, in which false or implausible claims keep being repeated. In such a research dystopia, things have gone horribly wrong, because there is a predomination of poor techniques of scientific experimentation and scientific analysis.
Sadly, the field of research known as cognitive neuroscience research is a field you could call a research dystopia, without being too far off the mark. Such a field is a largely a world of fiction, in which researchers keep making untrue claims about brains being the source of minds and brains being the storage place of memories. And it is a world in which things have gone horribly wrong, because researchers keep churning out miserably designed studies guilty of various types of Questionable Research Practices.
The latest evidence that cognitive neuroscience research is a research dystopia can be found in a press release on the clickbait-heavy site earth.com, and in the scientific paper that press release is promoting. The press release has the very untrue title "Scientists can now 'edit' brain circuits to enhance memory." We read this very false claim: "New research shows that trimming specific synapses in a mouse brain circuit can strengthen memories and help them last longer." We read about some weird experiment in which scientists fiddled with synapses in the brains of a few mice.
Making the untrue claim that a standard measure of memory was used (a claim that is untrue for reasons I will soon explain), the press release says this:
"Mice with edited hippocampal circuits froze more during recall tests, a standard memory measure. With mild training, that advantage appeared two days after learning and remained 23 days later, strengthening both recent and long-term memory. With more intense training, the treated group held steady while controls faded, so the difference was not just a lucky one-off."
To help create the illusion that some reliable research was done, we have no mention of the number of mice used in the experiment. A look at the scientific paper being discussed by the press release gives us the answer to that question. The scientific paper is the very low-quality science paper here, one entitled "Remodeling synaptic connections via engineered neuron-astrocyte interactions." In the scientific paper we read that the number of mice being tested was ridiculously low. The study group sizes were way-too-small study group sizes such as only 3 mice or only 6 mice. No study of this type should be taken seriously unless the study group sizes were at least 15 or 20 animals per study group. You do not have any decent evidence of a real effect if you merely use study group sizes of 6 animals per study group in a study comparing performance between altered mice and unaltered mice. It is way, way too easy to get a false alarm using a study group size so small.
Below is a graph from the paper, found in Figure 8 of the paper:
- "Postmortem studies need n = 26 subjects to detect the same effect 80 % of the time, while MRI studies need n = 84 subjects; thus, most individual MRI studies and both postmortem studies were underpowered." (Link)
- "The median neuroimaging study sample size is about 25...Reproducible brain-wide association studies require thousands of individuals." (Link)
- "Critical appraisal indicated that studies were underpowered, did not match cases with controls and failed to account for confounding factors." (Link)
- "Power calculations suggested that studies were underpowered." (Link)
- "The small sample sizes of the current literature make it very likely that studies were underpowered, resulting in a host of issues such as imprecise association estimates, imprecise estimated effect sizes, low reproducibility, and reduced chances of detecting a true effect or, conversely, that 'detected' effects are indeed true." (Link)
- "Most validation studies were underpowered and hence may have given a misleading impression of accuracy." (Link)
- "We reviewed 513 behavioral, systems and cognitive neuroscience articles in five top-ranking journals (Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience) and found that 78 used the correct procedure and 79 used the incorrect procedure. An additional analysis suggests that incorrect analyses of interactions are even more common in cellular and molecular neuroscience." (Link)
Appendix:The Lack of Any Standards in "Freezing Behavior" Estimations
A paper describing variations in how "freezing behavior" is judged reveals that no standard is being followed. The paper is entitled "Systematic Review and Methodological Considerations for the Use of Single Prolonged Stress and Fear Extinction Retention in Rodents." The paper has the section below telling us that statistical techniques to judge "freezing behavior" in rodents are "all over the map," with no standard statistical method being used:
"For example, studies using cued fear extinction retention testing with 10 cue presentations reported a variety of statistical methods to evaluate freezing during extinction retention. Within the studies evaluated, approaches have included the evaluation of freezing in individual trials, blocks of 2–4 trials, and subsets of trials separated across early and late phases of extinction retention. For example, a repeated measures analysis of variance (RMANOVA) of baseline and all 10 individual trials was used in Chen et al. (2018), while a RMANOVA was applied on 10 individual trials, without including baseline freezing, in Harada et al. (2008). Patterns of trial blocking have also been used for cued extinction retention testing across 10 trials, including blocks of 2 and 4 trials (Keller et al., 2015a). Comparisons within and across an early and late phase of testing have also been used, reflecting the secondary extinction process that occurs during extinction retention as animals are repeatedly re-exposed to the conditioned cue across the extinction retention trials. For example, an RMANOVA on trials separated into an early phase (first 5 trials) and late phase (last 5 trials) was used in Chen et al. (2018) and Chaby et al. (2019). Similarly, trials were averaged within an early and late phase and measured with separate ANOVAs (George et al., 2015). Knox et al. (2012a,b) also averaged trials within an early and late phase and compared across phases using a two factors design.
Baseline freezing, prior to the first extinction retention cue presentation, has been analyzed separately and can be increased by SPS (George et al., 2015) or not affected (Knox et al., 2012b; Keller et al., 2015a). To account for potential individual differences in baseline freezing, researchers have calculated extinction indexes by subtracting baseline freezing from the average percent freezing across 10 cued extinction retention trials (Knox et al., 2012b). In humans, extinction retention indexes have been used to account for individual differences in the strength of the fear association acquired during cued fear conditioning (Milad et al., 2007, 2009; Rabinak et al., 2014; McLaughlin et al., 2015) and the strength of cued extinction learning (Rabinak et al., 2014).
In contrast with the cued fear conditioning studies evaluated, some studies using contextual fear conditioning used repeated days of extinction training to assess retention across multiple exposures. In these studies, freezing was averaged within each day and analyzed with a RMANOVA or two-way ANOVA across days (Yamamoto et al., 2008; Matsumoto et al., 2013; Kataoka et al., 2018). Representative values for a trial day are generated using variable methodologies: the percentage of time generated using sampling over time with categorically handscoring of freezing (Kohda et al., 2007), percentage of time yielded by a continuous automated software (Harada et al., 2008), or total seconds spent freezing (Imanaka et al., 2006; Iwamoto et al., 2007). Variability in data processing, trial blocking, and statistical analysis complicate meta-analysis efforts, such that it is challenging to effectively compare results of studies and generate effects size estimates despite similar methodologies."
As far as the techniques that are used to judge so-called "freezing behavior" in rodents, the techniques are "all over the map," with the widest variation between researchers. The paper tells us this:
"Another source of variability is the method for the detection of behavior during the trials (detailed in Table 1). Freezing behavior is quantified as a proxy for fear using manual scoring (36% of studies; 12/33), automated software (48% of studies; 16/33), or not specified in 5 studies (15%). Operational definitions of freezing were variable and provided in only 67% of studies (22/33), but were often explained as complete immobility except for movement necessary for respiration. Variability in freezing measurements, from the same experimental conditions, can derive from differential detection methods. For example, continuous vs. time sampling measurements, variation between scoring software, the operational definition of freezing, and the use of exclusion criteria (considerations detailed in section Recommendations for Freezing Detection and Data Analysis). Overall, 33% of studies did not state whether the freezing analysis was continuous or used a time sampling approach (11/33). Of those that did specify, 55% used continuous analysis and 45% used time sampling (12/33 and 10/33, respectively). Several software packages were used across the 33 studies evaluated: Anymaze (25%), Freezescan (14%), Dr. Rat Rodent's Behavior System (7%), Packwin 2.0 (4%), Freezeframe (4%), and Video Freeze (4%). Software packages vary in the level of validation for the detection of freezing and the number and role of automated vs. user-determined thresholds to define freezing. These features result in differential relationships between software vs. manually coded freezing behavior (Haines and Chuang, 1993; Marchand et al., 2003; Anagnostaras et al., 2010). Despite the high variability that can derive from software thresholds (Luyten et al., 2014), threshold settings are only occasionally reported (for example in fear conditioning following SPS). There are other software features that can also affect the concordance between freezing measure detected manually or using software, including whether background subtraction is used (Marchand et al., 2003) and the quality of the video recording (frames per second, lighting, background contrast, camera resolution, etc.; Pham et al., 2009), which were also rarely reported. These variables can be disseminated through published protocols, supplementary methods, or recorded in internal laboratory protocol documents to ensure consistency between experiments within a lab. Variability in software settings can determine whether or not group differences are detected (Luyten et al., 2014), and therefore it is difficult to assess the degree to which freezing quantification methods contribute to variability across SPS studies with the current level of detail in reporting. Meuth et al. (2013) tested the differences in freezing measurements across laboratories by providing laboratories with the same fear extinction videos to be evaluated under local conditions. They found that some discrepancies between laboratories in percent freezing detection reached 40% between observers, and discordance was high for both manual and automated freezing detection methods."
It's very clear from the quotes above that once a neuroscience researcher has decided to use "freezing behavior" to judge the amount of fear or recall in mice, then he pretty much has a nice little "see whatever I want to see" situation. Since no standard protocol is being used in these estimations of so-called "freezing behavior," a neuroscientist can pretty much report exactly whatever he wants to see in regard to "freezing behavior," by just switching around the way in which "freezing behavior" is estimated, until the desired result appears. We should not make here the mistake of assuming that those using automated software for judging "freezing behavior" are getting objective results. Most software has user-controlled options that a user can change to help him see whatever he wants to see.
When "freezing behavior" judgments are made, there are no standards in regard to how long a length of time an animal should be observed when recording a "freezing percentage" (a percentage of time the animal was immobile). An experimenter can choose any length of time between 30 seconds and five minutes or more (even though it is senseless to assume rodents might "freeze in fear" for as long as a minute). Neuroscience experiments typically fail to pre-register experimental methods, leaving experimenters to make analysis choices "on the fly." So you can imagine how things work. An experimenter might judge how much movement occurred during five minutes or ten minutes after a rodent was exposed to a fear stimulus. If a desired above-average amount of immobility (or a desired below-average amount of immobility) occurred over 30 seconds, then 30 seconds would be chosen as the interval to be used for a "freezing percentage" graph. Otherwise, if a desired above-average amount of immobility (or a desired below-average amount of immobility) occurred over 60 seconds, then 60 seconds would be chosen as the interval to be used for a "freezing percentage" graph. Otherwise, if a desired above-average amount of immobility (or a desired below-average amount of immobility) occurred over two minutes, then two minutes would be chosen as the interval to be used for a "freezing percentage" graph. And so on and so forth, up until five minutes or ten minutes. If the researcher still has no "more freezing" effect he can report, the researcher can always do something like report on only the last minute of a larger time length, or the last two minutes, or the last three minutes, or the last four minutes.
And also the researchers can arbitrarily choose what time length of immobility will be counted as some "freezing" to be added to the "freezing percentage" figure. That time length of immobility can be 1 second or 2 seconds or any number of seconds between 1 and 10.
Because there are 20 or 30 or 50 different ways in which the data can be analyzed, each with about a 50% chance of success of yielding the desired result, the likelihood of the researcher being able to report some "higher freezing level" is almost certain, even if the tested interventions or manipulations had no real effect on memory. Such shenanigans drastically depart from good, honest, reliable experimental methods.



No comments:
Post a Comment