Normally we assume that when scientists do experiments, they want to measure things as accurately as possible. But that may not always be the case. There are some reasons why scientists may actually prefer to use a method that poorly measures something. They include the following:
(1) Nowadays there exist these two very bad problems in science: publication quotas and publication bias. Publication bias is the tendency of science journals to prefer to publish scientific papers that announce positive results showing some effect, rather than null results that fail to show any effect. Publication quotas are prevailing traditions in academia that every professor is supposed to have his name as an author on a certain number of papers by some particular point in his career. Often described under the name of "publish or perish," publication quotas are typically informal, but very real. An assistant professor may not be formally told that he has to have a certain number of papers on his name to become a full professor, but he will tend to know that his chance of advancement in academia will be very low if he does not have enough published papers on his resume.
The combination of publication bias and publication quotas may create a strong preference for inaccurate and subjective measurement techniques. The more inaccurate and subjective a measurement technique, the greater the possibility of "see whatever you want to see," the greater the chance that the fervently desired "positive result" can be reported.
(2) Another very large problem in scientific research is ideological bias: the tendency of science publication to prefer papers that conform with the most popular ideas prevailing in research communities. Whenever an ideology is incorrect, it can be true that the more inaccurate and subjective a measurement technique, the greater the likelihood that the writer of a scientific paper can report some result that conforms with the ideology prevailing in his research community.
Let us look at a case in which scientists for decades have been senselessly using a ridiculously unreliable measurement technique: the case of "freezing behavior" estimations. "Freezing behavior" estimations occur in scientific experiments involving memory. "Freezing behavior" judgments work like this:
(1) A rodent is trained to fear some particular stimulus, such as a red-colored shock plate in his cage.
(2) At some later time (maybe days later) the same rodent is placed in a cage that has the stimulus that previously provoked fear (such as the shock plate).
(3) Someone (or perhaps some software) attempts to judge what percent of a certain length of time (such as 30 seconds or 60 seconds or maybe even four minutes) the rodent is immobile after being placed in the cage. Immobility of the rodent is interpreted as "freezing behavior" in which the rodent is "frozen in fear" because it remembered the fear-causing stimulus such as the shock plate. The percentage of time the rodent is immobile is interpreted as a measurement of how strongly the rodent remembers the fear stimulus.
This is a ridiculously subjective and inaccurate way of measuring whether a rodent remembers the fear stimulus. There are numerous problems with this technique:
(1) There are two contradictory ways in which a rodent might physically respond after seeing something associated with fear: a flight response (in which the rodent attempts to escape) and a freezing response (in which the rodent freezes, not moving). It is all but impossible to disentangle which response is displayed when the rodent is presented with a fear stimulus. A rodent who remembers a fear stimulus might move around trying to escape the feared stimulus. But under the "freezing behavior" method, such movement would not be recorded as memory of the feared stimulus, even though the fear stimulus was recalled.
(2) Rodents often have hard-to-judge movement behavior that neither seems like immobility nor fleeing behavior, and it is subjective and unreliable to judge whether such movement is or is not "freezing behavior" or immobility.
(3) Movement of a rodent in a cage may be largely random, and not a good indication of whether the rodent is afraid and whether the rodent is recalling some fear stimulus.
(4) Rodents encountering a fear-provoking stimulus in human homes (such as a mouse hearing a human shriek) almost never display freezing behavior, and much more commonly display fleeing behavior. I lived in a New York City apartment for many years in which I would suddenly encounter mice, maybe about 10 times a year. I never once saw a mouse freeze, but invariably saw them flee.
(5) Freezing behavior in a rodent may last for a mere instant, as in humans. So it may be extremely fallacious to do something such as trying to observe 30 seconds or 60 seconds or several minutes of rodent movement or non-movement, and try to judge whether fear or recall occurred by judging a "freezing percentage" over such an interval. Almost all of that time may be random behavior having nothing to do with fear in the rodent or memory recall in the rodent. Contrary to all sensible methods, what we often seen in neuroscience papers is some technique in which someone tries to judge "freezing behavior" by judging non-movement over a length of several minutes. An example is the science paper here, in which the authors senselessly judge fear recall by estimating non-movement in a rodent over the span of four minutes.
(6) Attempts to judge "freezing behavior" typically ignore a fact of key importance: whether the rodent avoided the stimulus the rodent was conditioned to fear. Let's imagine two cases. In case 1 a rodent put in a cage with a stimulus he was conditioned to fear (such as a shock plate) spends most of the measured interval not moving, and then goes directly to the fear stimulus, such as stepping on the shock plate. In case 2 a rodent nervously moves around in the cage, entirely avoiding the fear stimulus such as a shock plate. Clearly the rodent in case 2 acts like an animal who remembers the fear stimulus, and the animal in case 1 acts like an animal that does not remember the stimulus. But under the absurd method of judging fear recall by estimating "freezing behavior," the rodent in case 1 will be counted as better remembering the fear stimulus, because that rodent displayed more "freezing behavior." This example shows how absurd "freezing behavior" estimations are as a measure of whether a rodent recalled something or feared something. Obviously there's something very wrong if a technique can lead you to think that remembering rodents forget, and that forgetting rodents remembered.
How is it that memory and fear recall can reliably be measured in rodents? There are at least three techniques. One costs a little bit of money, and the other two can be done without spending much of anything.
Reliable Way of Measuring Rodent Fear Recall #1: Measuring Heart Rate Spikes
It has been shown that when animals such as mice are exposed to fear-inducing stimuli, their heart rate dramatically spikes. According to a scientific paper, a simple air puff will cause a mouse's heart rate to increase from nearly 500 beats per minute to near 700 beats per minute. We read this: "The mean HR [heart rate] responses from the seven mice showed that HR increased significantly from the basal level of 494±27 bpm to 690±24 bpm to the first air puff (P<0.001)." The same paper tells us that similar increases in heart rate occur when mice are dropped or subjected to a simulated earthquake by means of a little cage shaking. So rather than using the very unreliable method of trying to judge "freezing behavior" to determine how well a mouse remembered a fearful stimulus, scientists could use the reliable method of looking for sudden heart rate spikes.
Reliable Way of Measuring Rodent Fear Recall #2: Tracking Fearful Stimulus Avoidance
The method described above has the slight drawback of requiring the purchase of rodent heart rate monitors. But there's another method that does not have any such drawback: the method of simply recording whether a fearful stimulus was avoided. The method is shown in the diagram below.
Using this technique, a mouse is trained to avoid a fear stimulus -- the red shock plate shown in the center of the diagram. At some later date the mouse (in a hungry state) is put into the cage. If the mouse does not remember that the shock plate will cause pain, the mouse will take the direct route to the cheese, which requires crossing over the shock plate. If the mouse does remember that the shock plate will cause pain, the mouse will take an indirect and harder route, requiring it to jump up and down a set of stairs. This is an easy and foolproof method of testing memory recall in rodents. Here we have a nice binary result -- either the mouse touches the shock plate, or it doesn't. There's no subjective element at all.
The widely used Morris Water Maze test is a fairly reliable way of measuring recall in rodents. The water maze consists of a circular open tank rather like a child's bathing tub, deeper than a rodent's length, with a hidden platform on one side of the tank, about an inch or two below the water surface. A rodent is placed in the tub, and has to tread water to stay alive. Eventually the rodent will discover that by swimming to the hidden platform the rodent can comfortably rest, without having to tread water. You test the rodent's memory by exposing him to the water maze a certain number of times, until you find that the rodent immediately goes to the hidden platform. Then later the rodent's memory can be tested by putting the rodent in the same Morris Water Maze tank, and seeing whether it quickly swims to the platform. The main drawback of the Morris Water Maze is that if something was done to a mouse to inhibit muscular skills but not memory, a mouse may fail the Morris Water Maze test even though there was no change in memory.
Why Do Neuroscientists Continue to Use Unreliable "Freezing Behavior" Estimations for Judging Rodent Recall?
The methods discussed above are obviously superior to the error-prone and subjective "freezing behavior" estimation method. So why do experimental neuroscientists continue to cling to such a "freezing behavior" estimation method, using it so often? It is entirely reasonable to suspect that many neuroscientists cling to their "freezing behavior" method for the very reason that it is unreliable and subjective, allowing neuroscientists to see whatever they want to see. By clinging to unreliable "freezing behavior" estimation, neuroscientists have a better chance of being able to report some result they can call a positive result.
A paper describing variations in how "freezing behavior" is judged reveals that no standard is being followed. The paper is entitled "Systematic Review and Methodological Considerations for the Use of Single Prolonged Stress and Fear Extinction Retention in Rodents." The paper has the section below telling us that statistical techniques to judge "freezing behavior" in rodents are "all over the map," with no standard statistical method being used:
"For example, studies using cued fear extinction retention testing with 10 cue presentations reported a variety of statistical methods to evaluate freezing during extinction retention. Within the studies evaluated, approaches have included the evaluation of freezing in individual trials, blocks of 2–4 trials, and subsets of trials separated across early and late phases of extinction retention. For example, a repeated measures analysis of variance (RMANOVA) of baseline and all 10 individual trials was used in Chen et al. (2018), while a RMANOVA was applied on 10 individual trials, without including baseline freezing, in Harada et al. (2008). Patterns of trial blocking have also been used for cued extinction retention testing across 10 trials, including blocks of 2 and 4 trials (Keller et al., 2015a). Comparisons within and across an early and late phase of testing have also been used, reflecting the secondary extinction process that occurs during extinction retention as animals are repeatedly re-exposed to the conditioned cue across the extinction retention trials. For example, an RMANOVA on trials separated into an early phase (first 5 trials) and late phase (last 5 trials) was used in Chen et al. (2018) and Chaby et al. (2019). Similarly, trials were averaged within an early and late phase and measured with separate ANOVAs (George et al., 2015). Knox et al. (2012a,b) also averaged trials within an early and late phase and compared across phases using a two factors design.
Baseline freezing, prior to the first extinction retention cue presentation, has been analyzed separately and can be increased by SPS (George et al., 2015) or not affected (Knox et al., 2012b; Keller et al., 2015a). To account for potential individual differences in baseline freezing, researchers have calculated extinction indexes by subtracting baseline freezing from the average percent freezing across 10 cued extinction retention trials (Knox et al., 2012b). In humans, extinction retention indexes have been used to account for individual differences in the strength of the fear association acquired during cued fear conditioning (Milad et al., 2007, 2009; Rabinak et al., 2014; McLaughlin et al., 2015) and the strength of cued extinction learning (Rabinak et al., 2014).
In contrast with the cued fear conditioning studies evaluated, some studies using contextual fear conditioning used repeated days of extinction training to assess retention across multiple exposures. In these studies, freezing was averaged within each day and analyzed with a RMANOVA or two-way ANOVA across days (Yamamoto et al., 2008; Matsumoto et al., 2013; Kataoka et al., 2018). Representative values for a trial day are generated using variable methodologies: the percentage of time generated using sampling over time with categorically handscoring of freezing (Kohda et al., 2007), percentage of time yielded by a continuous automated software (Harada et al., 2008), or total seconds spent freezing (Imanaka et al., 2006; Iwamoto et al., 2007). Variability in data processing, trial blocking, and statistical analysis complicate meta-analysis efforts, such that it is challenging to effectively compare results of studies and generate effects size estimates despite similar methodologies."
As far as the techniques that are used to judge so-called "freezing behavior" in rodents, the techniques are "all over the map," with the widest variation between researchers. The paper tells us this:
"Another source of variability is the method for the detection of behavior during the trials (detailed in Table 1). Freezing behavior is quantified as a proxy for fear using manual scoring (36% of studies; 12/33), automated software (48% of studies; 16/33), or not specified in 5 studies (15%). Operational definitions of freezing were variable and provided in only 67% of studies (22/33), but were often explained as complete immobility except for movement necessary for respiration. Variability in freezing measurements, from the same experimental conditions, can derive from differential detection methods. For example, continuous vs. time sampling measurements, variation between scoring software, the operational definition of freezing, and the use of exclusion criteria (considerations detailed in section Recommendations for Freezing Detection and Data Analysis). Overall, 33% of studies did not state whether the freezing analysis was continuous or used a time sampling approach (11/33). Of those that did specify, 55% used continuous analysis and 45% used time sampling (12/33 and 10/33, respectively). Several software packages were used across the 33 studies evaluated: Anymaze (25%), Freezescan (14%), Dr. Rat Rodent's Behavior System (7%), Packwin 2.0 (4%), Freezeframe (4%), and Video Freeze (4%). Software packages vary in the level of validation for the detection of freezing and the number and role of automated vs. user-determined thresholds to define freezing. These features result in differential relationships between software vs. manually coded freezing behavior (Haines and Chuang, 1993; Marchand et al., 2003; Anagnostaras et al., 2010). Despite the high variability that can derive from software thresholds (Luyten et al., 2014), threshold settings are only occasionally reported (for example in fear conditioning following SPS). There are other software features that can also affect the concordance between freezing measure detected manually or using software, including whether background subtraction is used (Marchand et al., 2003) and the quality of the video recording (frames per second, lighting, background contrast, camera resolution, etc.; Pham et al., 2009), which were also rarely reported. These variables can be disseminated through published protocols, supplementary methods, or recorded in internal laboratory protocol documents to ensure consistency between experiments within a lab. Variability in software settings can determine whether or not group differences are detected (Luyten et al., 2014), and therefore it is difficult to assess the degree to which freezing quantification methods contribute to variability across SPS studies with the current level of detail in reporting. Meuth et al. (2013) tested the differences in freezing measurements across laboratories by providing laboratories with the same fear extinction videos to be evaluated under local conditions. They found that some discrepancies between laboratories in percent freezing detection reached 40% between observers, and discordance was high for both manual and automated freezing detection methods."
It's very clear from the quotes above that once a neuroscience researcher has decided to use "freezing behavior" to judge fear, then he pretty much has a nice little "see whatever I want to see" situation. Since no standard protocol is being used in these estimations of so-called "freezing behavior," a neuroscientist can pretty much report exactly whatever he wants to see in regard to "freezing behavior," by just switching around the way in which "freezing behavior" is estimated, until the desired result appears. We should not make here the mistake of assuming that those using automated software for judging "freezing behavior" are getting objective results. Most software has user-controlled options that a user can change to help him see whatever he wants to see.
To help get reliable and reproducible results, neuroscientists doing experiments involving recall or fear recall in animals should use only a simple and reliable method for measuring fear or recall in rodents: either the measurement of heart rate spikes, or the Fear Stimulus Avoidance technique described above, or the Morris Water Maze test. But alas, experimental neuroscientists seem to prefer to use an unreliable "see whatever you want to see" method, quite possibly because that vastly increases the opportunity for them to report "statistically significant" results or positive results rather than null results.
What we must always remember is that the modern experimental neuroscientist is not primarily interested in producing accurate results, but is instead primarily interested in producing publishable results, defined as any result that will end up getting published in a scientific journal. The modern experimental neuroscientist is also extremely interested in producing "citation magnet" results, defined as any results that will end up getting more paper citations. Alas, today's neuroscientists are not judged by whether they use intelligent and accurate experimental methods. Today's neuroscientists are rather mindlessly judged by their peers on the basis of how many papers they can claim to have co-authored, and how many citations such papers have got. And so we see neuroscience papers like the one below, in which more than 100 scientists appear as the authors of a single paper, as if the main idea was just to up the paper count of as many people as possible.
A simple rule should be followed about this matter: any and all papers writing up experimental research and depending upon claims of freezing behavior by rodents should be regarded as junk science unworthy of serious attention. Trying to measure "freezing behavior" is not a reliable way of measuring memory recall or fear in rodents. Very many of the most widely reported neuroscience studies rely on this junk method, and all such studies are junk studies. A high use of "freezing behavior " estimation is only one of the glaring defects of neuroscience experimental research, where Questionable Research Practices are extremely common. Other glaring procedural defects very common in neuroscience experimental research include the all-too-common use of way-too-small study group sizes, a failure to pre-register a hypothesis and methods to be used for gathering and analyzing data, p-hacking, a failure to follow blinding protocols, and a failure to do sample size calculations to determine how large study group sizes could be.
You should not assume that peer review prevents bad neuroscience research from getting published. The people who peer-review neuroscience research routinely fail to exclude poorly designed experimental research. The peer reviewers of such research are typically neuroscientists who perform the same kind of poorly designed research themselves. Peer reviewers senselessly follow a rule of "allow papers to be published if they resemble recent previously published papers." When some group of scientists is following bad customs (such as we see massively in theoretical physics, theoretical phylogenetics, theoretical cosmology, and experimental neuroscience), such a rule completely fails to block junk research from being published.
Postscript: The paper "To freeze or not to freeze" gives us additional reasons for disbelieving that "freezing behavior" judgments are reliable ways of measuring fear or recall in rodents. We read that "Male and female rats respond to a fearful experience in different ways, but this was not previously taken into account in research." Below are some quotes:
"Gruene, Shansky and their colleagues – Katelyn Flick and Alexis Stefano of Northeastern, and Stephen Shea of Cold Spring Harbor Laboratories – found that instead of freezing, many female rats display a brief, high-velocity movement termed darting...Gruene et al. found that female rats performed more darts per minute than males. However, not all females dart, and not all males freeze: in the experiments approximately 40% of the females engaged in darting behavior, but only about 10% of males did so....The finding that a higher proportion of female rats dart may explain why previous studies have reported less freezing in females (e.g., Maren et al., 1994; Pryce et al., 1999)."
The paper "The Difference between Male and Female Rats in Terms of Freezing and Aversive Ultrasonic Vocalization in an Active Avoidance Test" tells us this: "We found that males were more likely to experience freezing (40%) than females (3.7%)." Evidently male rats perform much differently than female rats in regard to freezing, but our neuroscientists very often fail to even specify which sex was used some experiment they did.
When "freezing behavior" judgments are made, there are no standards in regard to how long a length of time an animal should be observed when recording a "freezing percentage" (a percentage of time the animal was immobile). An experimenter can choose any length of time between 30 seconds and five minutes or more (even though it is senseless to assume rodents might "freeze in fear" for as long as a minute). Neuroscience experiments typically fail to pre-register experimental methods, leaving experimenters to make analysis choices "on the fly." So you can imagine how things work. An experimenter might judge how much movement occurred during five minutes or ten minutes after a rodent was exposed to a fear stimulus. If a desired above-average amount of immobility (or a desired below-average amount of immobility) occurred over 30 seconds, then 30 seconds would be chosen as the interval to be used for a "freezing percentage" graph. Otherwise, if a desired above-average amount of immobility (or a desired below-average amount of immobility) occurred over 60 seconds, then 60 seconds would be chosen as the interval to be used for a "freezing percentage" graph. Otherwise, if a desired above-average amount of immobility (or a desired below-average amount of immobility) occurred over two minutes, then two minutes would be chosen as the interval to be used for a "freezing percentage" graph. And so on and so forth, up until five minutes or ten minutes. Such shenanigans drastically depart from good, honest, reliable experimental methods, and any researcher engaging in such shenanigans should be ashamed of himself.
It should be crystal-clear by now: no one is reliably measuring fear or recall or memory in a paper relying on "freezing behavior" judgments, and in such a paper we should trust no claims made about fear or recall or memory in rodent subjects.
No comments:
Post a Comment