Head Truth: Y-Maze Memory Tests Are Almost as Unreliable as Freezing Behavior Tests

Sunday, March 23, 2025

Y-Maze Memory Tests Are Almost as Unreliable as Freezing Behavior Tests

It cannot be said that the reliability of an experimental neuroscience paper is directly proportional to the reliability of measurement techniques it uses. There are various reasons why you might have an utterly unreliable neuroscience experiment that used reliable measurement techniques, such as the reason that the experiment may have used too-small a study group size to have produced a reliable result. But it can be said (roughly speaking) that the unreliability of an experimental neuroscience paper is directly proportional to the unreliability of any measurement techniques upon which the experiment depends. That is why when examining neuroscience experiments, we should always pay extremely close attention to whether the experiment used reliable measurement techniques.

For decades very many neuroscience researchers have been senselessly using a ridiculously unreliable measurement technique: the case of "freezing behavior" estimations. "Freezing behavior" estimations occur in scientific experiments involving memory. "Freezing behavior" judgments work like this:

(1) A rodent is trained to fear some particular stimulus, such as a red-colored shock plate in his cage.

(2) At some later time (maybe days later) the same rodent is placed in a cage that has the stimulus that previously provoked fear (such as the shock plate).

(3) Someone (or perhaps some software) attempts to judge what percent of a certain length of time (such as 30 seconds or 60 seconds or maybe even four minutes) the rodent is immobile after being placed in the cage. Immobility of the rodent is interpreted as "freezing behavior" in which the rodent is "frozen in fear" because it remembered the fear-causing stimulus such as the shock plate. The percentage of time the rodent is immobile is interpreted as a measurement of how strongly the rodent remembers the fear stimulus.

This is a ridiculously subjective and inaccurate way of measuring whether a rodent remembers the fear stimulus. There are numerous problems with this technique, which I explain in my post " All Papers Relying on Rodent 'Freezing Behavior' Estimations Are Junk Science." The technique is so unreliable that all experimental neuroscience studies relying on such a technique should be dismissed as worthless.

There are other techniques used in neuroscience experiments. There are various types of maze techniques used. A mouse may be trained to find some food that requires traversing a particular maze. It is easy to time exactly how long the mouse takes to find the food, after a series of training trials. Then some modification might be made to the mouse (such as giving it an injection or removing part of its brain). The mouse can be put again in the maze, and a measurement can be made of how long it takes to find the food. It if took much longer to find the food, this might be evidence of a reduction in memory or learned knowledge.

This seems like a pretty reliable technique. But there's another much less reliable technique called the "free exploratory paradigm." When this technique is used, a mouse is given some compartments to explore. The mouse is first only allowed to explore half or two-thirds of the compartments. Then later the mouse is given the freedom to explore all of the compartments, including previously unexplored compartments. Some attempt is made to measure what percent of the time the mouse spends in the never-previously-explored compartments compared to the previously explored compartments.

A figure in the paper "The free-exploratory paradigm as a model of trait anxiety in rats: Test–retest reliability" shows how this method might be used. First the mouse is allowed to explore only the three compartments on the right, with access to the left compartments blocked. Then the mouse is allowed to access all of the compartments, and some attempt is made to judge whether the mouse spent more time in the left compartments than the right.

The assumption is made that this can be some kind of test of memory. The experiment designers seem to have assumed that when a mouse goes to compartments already visited, the mouse will kind of recognize those compartments, and be less likely to explore them, perhaps having some kind of "I need not explore something I've already explored" experience. This is a very dubious assumption.

It's as if the designers of this apparatus were assuming that a mouse is thinking something like this:

"My, my, these experimenter guys have given me six compartments to explore! Well, there's no point in exploring any of the three compartments I already explored. Been there, done that. So I guess I'll spend more time exploring the compartments I have not been to. I'm sure there will just be exactly the same stuff in the three compartments I've already explored, and that I need not spend any time re-exploring them to check whether there's something new in them."

The assumptions behind this experimental design seem very dubious. It is not at all clear that a mouse would have any such tendency to recognize previous compartments the mouse had been in, and to think that such previously visited compartments were less worthy of exploration.

The best way to test whether such assumptions are correct is by experimentation. Without doing anything to modify a mouse's memory, you can simply test normal mice, and see whether they are less likely to spend time in compartments they previously visited. Figure 2 of the paper "The free-exploratory paradigm as a model of trait anxiety in rats: Test–retest reliability" gives us a good graph testing how reliable this "free-exploratory paradigm" is, using a 10-minute observation period. The test involved 30 mice:

The figure suggests that this "free-exploratory paradigm" is not a very reliable technique for judging whether mice remembered something. In the first test, there was no tendency of the mice to spend more time exploring the unexplored compartments. In the second test there was only a slightly greater tendency of the mice to explore the previously unexplored compartments. Overall the mice spent only 55 percent of their time in the previously unexplored compartments, versus 45 percent of their time in the previously explored compartments.

What is the relevance of this? It means that any neuroscience experiment that is based on this "free-exploratory paradigm" and fails to use a very large study group size is worthless. An example of a worthless study based on such a technique is the study hailed by a press release this year, one with a headline of "Boosting brain’s waste removal system improves memory in old mice." No good evidence for any such thing was produced.

The press release is promoting a study called "Meningeal lymphatics-microglia axis regulates synaptic physiology" which you can read here. That study all hinges upon an attempt to measure recall or recognition by mice, using something called a Y-maze, which consists of 3 compartments, the overall structure being shaped like the letter Y. The Y-maze (not actually a maze) is an implementation of the unreliable "free-exploratory paradigm" measurement technique described above. The study used a study group size of only 17 mice. But since the "free-exploratory paradigm" requires study group sizes much larger than 17 to provide any compelling evidence for anything, the study utterly fails as reliable evidence.

Using a binomial probability calculator, we can compute the chance of getting a false alarm, using a measurement technique like the "free-exploratory paradigm." Figure 1C of the paper "Meningeal lymphatics-microglia axis regulates synaptic physiology" shows only a very slight difference between the "free-exploratory paradigm" performance for the modified mice and the unmodified mice:

Given this "free-exploratory paradigm" that is something like only 55% effective in measuring recognition memory, the probability of getting results like this by chance (even if the experimental intervention has no real effect) is roughly the same as what we see in the calculation below:

Produced using the calculator here

The chance of getting purely by chance a result like the result reported in the paper is roughly the 1 in 3 shown in the bottom line above. When we consider publication bias and the "file drawer" effect, getting a result like the reported result means nothing. Why? Because it would be merely necessary to try the experiment a few times before you could report a success, even if the experimental intervention had no effectiveness whatsoever.

We should never be persuaded by results like this, because what could easily be happening is something like this:

Team 1 at some college tries this intervention, seeing no effect. Realizing null results are hard to get published, Team 1 files its results in its file drawer.
Team 2 at some other college tries this intervention, seeing no effect. Realizing null results are hard to get published, Team 2 files its results in its file drawer.
Team 3 tries this intervention, seeing a "statistically significant" effect of a type you would get in maybe 1 time in three tries. Team 3 submits its positive result for publication, and gets a paper published.

In a scenario like the one above, there is no real evidence for the effect. All that is happening is a result like what we would expect to get by chance, even if the effect does not exist.

What we must also consider is that any researcher wanting to tilt the scales a bit can do so when using this free-exploratory paradigm. When these type of experiments are done, the compartments are not empty. Instead some items are put in the compartments. There is no standard protocol about what is put in the compartments. A researcher can put in the compartments anything he wants. Each compartment is supposed to have a few items, but there is no standard number or size of items to use. So imagine you are trying to show what looks like a loss of memory recognition in some experiment using this free-exploratory paradigm. All you need to do is put some less interesting items or fewer items in the unexplored compartments. And if you want to show what looks like an improvement in memory, you need merely put some more interesting items or more items in the unexplored compartments. Since there is no standard protocol used using this free-exploratory paradigm, an experimenter can get whatever result he wants, by varying conditions in the compartments.

At the top I give a graph from the the paper "The free-exploratory paradigm as a model of trait anxiety in rats: Test–retest reliability," which showed a mere 55% reliability using this free-exploratory paradigm in ten minute tests, but a greater reliability with 15 minute tests. How long a time length does the paper "Meningeal lymphatics-microglia axis regulates synaptic physiology" use? Only 2 or 3 minutes. I doubt very much that there is any evidence that such tests have much more than 50% reliability with such a short time span. This is a common defect of both the free-exploratory paradigm and the "freezing behavior" approach: they can produce wildly different results depending on the time interval used. And since there is no standard for a time interval used, an experimenter can use any time interval, including some interval that has not been verified as having any decent reliability. This is all the more reason to think that such methods are "see whatever you are hoping to see" affairs that have no validity as solid measurement techniques for measuring recall or recognition in rodents. I can imagine how things might work: an animal may be tested for 10 minutes using either technique; and if the experimenter doesn't like the result in the full ten minutes, he can simply report in his paper on the first 5 minutes; and if he does not like that result, he can report in his paper on only the first three minutes; and so on and so forth. If the paper is not a pre-registered paper committing itself to an exact detailed observational protocol, an experimenter can get away with that; and few neuroscience experiments these days follow such a pre-registered approach. Today's experimental neuroscience is such a standard-weak freewheeling farce of loose and bad methods that it is probably considered permissible to gather a particular type of data for ten minutes, and then report on only the results gathered in any arbitrary fraction of those minutes, as long as you start from the beginning.

In the paper here, it says, "In the Y-maze

continuous procedure, the rat or mouse is placed in the maze for a

defined period (typically 5 min) and the sequence of arm choices

is recorded." But in the paper "Meningeal lymphatics-microglia axis regulates synaptic physiology" discussed above, the Y-maze test time was only 3 minutes; so we have a deviation from the typical procedure with this device. The same paper tells us "Hippocampectomized animals notoriously adopt side preferences, e.g., always turning right on a T-maze," something we can suspect may also be true in a Y-maze, giving another reason for doubting the suitability of such tests (both examples of the free exploratory paradigm) for testing memory modifications such as hippocampus lesions.

The sad truth is that experiments done with this free-exploratory paradigm (such as a Y-maze experiment or a T-maze experiment) are worthless unless they use large study group sizes of at least 30 subjects per study group, and also an exact protocol that has been proven to be a reliable method of measuring recall or recognition in rodents. So we can have no confidence in the results reported by the study referred to above, the one called "Meningeal lymphatics-microglia axis regulates synaptic physiology" which you can read here. That study all hinges upon an attempt to measure recall or recognition using the free exploratory paradigm, but does not use a large enough study group size to produce a reliable result using that paradigm. And we have no evidence of exactly following a precise protocol proven to be a reliable measure of rodent recall.

Neither the free-exploratory paradigm (such as Y-maze experiments) nor "freezing behavior" experiments produce reliable results when anyone uses study group sizes smaller than 30. Both are poor, unreliable ways of measuring recall or recognition in rodents, allowing so much flexibility and opportunity for bias that it's just a "see whatever you want to see" type of affair. But what kind of methods tend to produce good, reliable results in measuring recall in rodents? I can think of four:

(1) A "find the food reward" maze technique like the one described above, in which you measure how many seconds a rodent takes to find a food reward, using a maze the rodent had been previously trained on to find a food reward.

(2) The Morris water maze test, a widely used test that is not really a maze test, but a test of how well a rodent will remember to find a submerged platform after previously being trained to find that platform in a water tank. However a scientific paper cautions that the Morris water maze test may not work well with many strain of mice, saying this: "Neuroscientists have been warned that many strains [of mice] perform poorly on the submerged-platform water escape test task, which is better suited to rats than to mice, yet it is used widely for the study of memory in mice." Another paper gives a similar reason for thinking that the Morris water maze test (MWM) may only be suitable for rats, stating this: "Interestingly, when MWM data were analyzed in a large dataset of 1500 mice by factor analysis, the principle factors

affecting MWM performance in mice were noncognitive

(Lipp and Wolfer 1998).... It is important to note

that this is not the case in rats, but the fact that performance

factors are salient in mice provides an important cautionary

note when interpreting mouse MWM data."

(3) A fear recall technique, measuring spikes in heart rate. The heart rate of a mouse will very dramatically spike when the mouse is afraid. So a mouse can be trained to fear some painful stimulus such as a shock plate. Then the mouse can be placed in a cage that has the fear-inducing stimulus. If the mouse's heart rate speeds up very much, that is good evidence that the mouse has remembered the fear-inducing stimulus such as the shock plate.

(4) The Fear Stimulus Avoidance technique depicted below, which does not require heart-rate measurement. After being trained to fear some fearful stimulus such as a shock plate, the mouse can be placed in a cage that offers two paths to a food reward: one path that requires going through the fearful stimulus such as a shock plate, and another other path to the food reward that is physically much harder to traverse, such as a path requiring climbing steep stairs. If a rodent takes the much harder path to get to the food reward, that is good evidence that it remembered the pain caused by the fearful stimulus such as the shock plate.

The mere use of a more reliable measurement technique does not guarantee a reliable result. While the Morris water maze test seems to be a reliable test when used with rats, it must be used with a big enough study group size, and very many neuroscience experimenters fail to do that. A paper notes the problem, stating this about the Morris Water Maze test (MWM):

"Many MWM experiments are reported with small group sizes. In our experience with the MWM and other water mazes, group sizes less than 10 can be unreliable and we use 15 to 20 animals per group, especially for mice, whose performance in learning and memory tests tends to be more variable than for rats. It is noteworthy that regulatory authorities require that safety studies have 20 or 25 animals per group. This number is for each of at least four groups (control and three dose levels) (Food and Drug Administration 2007; Gad 2009; Tyl and Marr 2012). Such group sizes are used by the US Environmental Protection Agency, the US Food and Drug Administration, the Organization for Economic Cooperation and Development, and Japanese and European Union regulatory agencies. Although the 3 Rs (reduce, refine, and replace) are worthwhile goals in the use of animals in research, it is not a justification to underpower experiments and run the risk of false positives, which, in the long run, cost more time, more animals, and more money to prove or disprove."

Postscript: The term "spontaneous alternation behavior" is used to describe a case in which a rodent that has explored one arm of a T-maze or Y-maze is exposed to the maze again, and switches to a different arm. The higher the average "spontaneous alternation percentage" is (the higher above 50%) in control rodents, the more reliable such a T-maze or Y-maze is as a test of memory; and a well-established average "spontaneous alternation percentage" of maybe 75% would indicate a pretty good test. The graph here shows female controls showing such behavior only 55% of the time, and male controls showing such behavior 60% of the time; but the sample size is only 5. Figure 4 of the paper here shows control rats using such "spontaneous alternation behavior" only 50% of the time in a Y-maze. The sample size is only 6. The graph here shows only about 60% "spontaneous alternation behavior" for 9 control rodents tested with a Y-maze. Figure 3 of the paper here shows male control rodents showing such "spontaneous alternation behavior" only about 50% of the time. Figure 1 of the paper here shows a "spontaneous alternation percentage" of only about 57% for 6 control mice. In Figure 1 of the paper here, the "spontaneous alternation percentage" is only 35% in control rodents. These results are consistent with my claim above that such tests are not-very-reliable tests requiring large study group sizes to produce even borderline, modest evidence of a memory effect.

The problem with a measurement technique that only gives you the right answer about 60% of the time is that when using such a technique it is really easy to get false alarms, particularly with small study group sizes. So you have no basis for strong confidence in some study testing only about 15 rats using a T-maze or a Y-maze.

Head Truth

Sunday, March 23, 2025

Y-Maze Memory Tests Are Almost as Unreliable as Freezing Behavior Tests

No comments:

Post a Comment