To
explain why this story and similar stories do not tell us anything
reliable about memory, we should consider the issue of small sample
sizes in neuroscience studies. The issue was discussed in a paper in the journal Nature, one entitled Power
failure: why small sample size undermines the reliability of
neuroscience.
The article tells us that neuroscience studies tend to be unreliable
because they are using too small a sample size. When there is too
small a sample size, there's a too high chance that the effect
reported by a study is just a false alarm. An
article on this important Nature article states the following:
The
group discovered that neuroscience as a field is tremendously
underpowered, meaning that most experiments are too small to be
likely to find the subtle effects being looked for and the effects
that are found are far more likely to be false positives than
previously thought. It is likely that many theories that were
previously thought to be robust might be far weaker than previously
imagined.
I
can give a simple example illustrating the problem. Imagine you try
to test extrasensory perception (ESP) using a few trials with your
friends. You ask them to guess whether you are thinking of a man or a
woman. Suppose you try only 10 trials with each friend, and the best
result is that one friend guessed correctly 70% of the time. This
would be very unconvincing as evidence of anything. There's about a
5 percent chance of getting such a result on any such test, purely by
chance; and if you test with five people, you have perhaps 1 chance
in 4 that one of them will be able to make 7 such guesses correctly,
purely by chance. So having one friend get 7 out of 10 guesses
correctly is no real evidence of anything. But if you used a much
larger sample size it would be a different situation. For example, if
you tried 1000 trials with a friend, and your friend guessed
correctly 700 times, that would have a probability of less than 1 in
a million. That would be much better evidence.
Now,
the problem with many a neuroscience study is that very small sample
sizes are being used. Such studies fail to provide convincing
evidence for anything. The snail memory test is an example.
The
study involved giving shocks to some snails, extracting RNA from
their tiny brains, and then injecting that into other snails that had
not been shocked. It was reported that such snails had a higher
chance of withdrawing into their shells, as if they were afraid and remembered being shocked when they had not. But
it might have been that such snails were merely acting randomly, not
experiencing any fear memory transferred from the first set of
snails. How can you have confidence that mere chance was not involved? You would have to do many trials or use a sample size that
guarantees that sufficient trials will occur. This paper states that in
order to have moderate confidence in results, getting what is called a statistical power of .8, there should be
at least 15 animals in each group. This statistical power of .8 is a standard for doing good
science.
But
judging from the snail paper, the scientists did not do a large number of
trials. Judging from the paper, the effect described involved
only 7 snails (the number listed on lines 571 -572 of the paper).
There is no mention of trying the test more than once on such snails.
Such a result is completely unimpressive, and could easily have been
achieved by pure chance, without any real “memory transfer” going
on. Whether the snail does or does not withdraw into its shell is
like a coin flip. It could easily be that by pure chance you might
see some number of “into the shell withdrawals” that you
interpret as “memory transfer.”
Whether
a snail is withdrawing into its shell requires a subjective judgment,
where scientists eager to see one result might let their bias
influence their judgments about whether the snail withdrew into its
shell or not. Also, a snail might withdraw into its shell simply
because it has been injected with something, not because it is
remembering something. Given such factors and the large chance of a
false alarm when dealing with such a small sample size, this “snail
memory transfer” experiment offers no compelling evidence for
anything like memory transfer. We may also note the idea that RNA is
storing long-term memories in animals is entirely implausible,
because of RNA's very short lifetime. According to this source, RNA
molecules typically last only about two minutes, with 10 to 20
percent lasting between 5 and 10 minutes. And according to this
source, if you were to inject RNA into a bloodstream, the RNA
molecules would be too large to pass through cell membranes.
The Tonegawa memory research lab
at MIT periodically puts out sensational-sounding press releases on
its animal experiments with memory. Among the headlines on its site
are the following:
- “Neuroscientists identify two neuron populations that encode happy or fearful memories.”
- “Scientists identify neurons devoted to social memory.”
- “Lost memories can be found.”
- “Researchers find 'lost' memories”
- “Neuroscientists reverse memories' emotional associations.”
- “How we recall the past.”
- “Neuroscientists identify brain circuit necessary for memory formation.”
- “Neuroscientists plant false memories in the brain.”
- “Researchers show that memories reside in specific brain cells.”
But when we take a close look at
the issue of sample size and statistical power, and the actual
experiments that underlie these claims, it seems that few or none of
these claims are based on solid, convincing experimental evidence. Although the
experiments underlying these claims are very fancy and high-tech, the
experimental results seem to involve tiny sample sizes so small that
very little of it qualifies as convincing evidence.
A
typical experiment goes like this: (1) Some rodents are
given electrical shocks; (2) the scientists try to figure out where
in the rodent's brain the memory was; (3) the scientists then use an
optogenetic switch to “light up” neurons in a similar part of
another rodent's brain, one that was not fear trained; (4) a judgment
is made on whether the rodent froze when such a thing was done.
Such
experiments have the same problems I mentioned above with the snail
experiment: the problem of subjective interpretations and alternate
explanations. The MIT memory experiments typically involve a judgment
of whether a mouse froze. But that may often be a hard judgment to
make, particularly in borderline cases. Also, we have no way of
telling whether a mouse is freezing because he is remembering
something. It could be that the optogenetic zap that the mouse gets
is itself sufficient to cause the mouse to freeze, regardless of
whether it remembers something. If you're walking along, and someone
shoots light or energy into your brain, you might stop merely because
of the novel stimulus. A science paper says that it is possible to induce freezing in rodents by stimulating a wide variety of regions. It says, "It
is possible to induce freezing by activating a variety of brain areas
and projections, including the hippocampus (Liu
et al., 2012),
lateral, basal and central amygdala (Ciocchi
et al., 2010); Johansen
et al., 2010;
Gore
et al., 2015a),
periaqueductal gray (Tovote
et al., 2016),
motor and primary sensory cortices (Kass
et al., 2013),
prefrontal projections (Rajasethupathy
et al., 2015)
and retrosplenial cortex (Cowansage
et al., 2014).”
But
the main problem with such MIT memory experiments is that they
involve very small sample sizes, so small that all of the results
could easily have happened purely because of chance. Let's look at
some sample sizes, remembering that according to this scientific
paper, there should be at least 15 animals in each group to have
moderate confidence in your results, sufficient to reach the standard
of a “statistical power of .8.”.
Let's
start with their paper, “Memory retrieval by activating engram
cells in mouse models of early Alzheimer’s disease,” which can be accessed from the link above after clicking underneath "Lost memories can be found." The paper states that “No statistical
methods were used to predetermine sample size.” That means the
authors did not do what they were supposed to have done to make sure their
sample size was large enough. When we look at page 8 of the paper, we
find that the sample sizes used were merely 8 mice in one group and 9
mice in another group. On page 2 we hear about a group with only 4 mice per group, and on page 4 we hear about a group with only 4
mice per group. Such a paltry sample size does not result in any
decent statistical power, and the results cannot be trusted, since
they very easily could be false alarms. The study therefore provides
no convincing evidence of engram cells.
Another
example is this paper by the MIT memory lab, with the grandiose title
“Creating a False Memory in the Hippocampus.” When we look at
Figure 2 and Figure 3, we see that the sample sizes used were paltry:
the different groups of mice had only about 8 or 9 mice per group.
Such a paltry sample size does not result in any decent statistical
power, and the results cannot be trusted, since they very easily
could be false alarms. No convincing evidence has been provided of
creating a false memory.
A
third example is this paper with the grandiose title “Optogenetic
stimulation of a hippocampal engram activates fear memory recall.” Figure
2 tells us that in one of the groups of mice there were only 5 mice, and that in another group there were only 3 mice.
Figure 3 tells us that in two other groups of mice there were only 12
mice. Figure 4 tells us that in some other group there was only 5
mice. Such a paltry sample size does not result in any decent
statistical power, and the results cannot be trusted, since they very
easily could be false alarms. No convincing evidence has been
provided of artificially activating a fear memory by the use of
optogenetics.
Another
example is this paper entitled “Silent memory engrams as the basis
for retrograde amnesia.” Figure 1 tells us that the number of mice
in particular groups used for the study ranged between 4 and 12.
Figures 2 and 3 tell us that the number of mice in particular groups
used for the study ranged between 3 and 12. Such a paltry sample size
does not result in any decent statistical power, and the results
cannot be trusted, since they very easily could be false alarms. Another unsound paper is the 2015 paper "Engram Cells Retain Memory Under Retrograde Amnesia," co-authored by Tonegawa. When we look at the end of the supplemental material, and look at figure s13, we find that the experimenters were using a number of mice that was equal to only 8 in one study group, and 7 in another study group. Such a paltry sample size does not result in any decent statistical power, and the results cannot be trusted, since they very easily could be false alarms.
We see the same "low statistical power" problem in this paper claiming an important experimental result regarding memory. The paper states in its Figure 2 that only 6 mice were used for a study group, and 6 mice for the control group. The same problem is shown in Figure 3 and Figure 4 of the paper. We see the same "low statistical power" problem in this paper entitled "Selective Erasure of a Fear Memory." The paper states in its Figure 3 that only 6 to 9 mice were used for a study group, That's only about half of the "15 animals per study group" needed for a modestly reliable result. The same defect is found in this memory research paper and in this memory research paper.
We see the same "low statistical power" problem in this paper claiming an important experimental result regarding memory. The paper states in its Figure 2 that only 6 mice were used for a study group, and 6 mice for the control group. The same problem is shown in Figure 3 and Figure 4 of the paper. We see the same "low statistical power" problem in this paper entitled "Selective Erasure of a Fear Memory." The paper states in its Figure 3 that only 6 to 9 mice were used for a study group, That's only about half of the "15 animals per study group" needed for a modestly reliable result. The same defect is found in this memory research paper and in this memory research paper.
The
term “engram” means a cell or cells that store memories. Decades
after the term was created, we still have no convincing evidence for
the existence of engram cells. But memory researchers are shameless
in using the term “engram” matter-of-factly even though no
convincing evidence of an engram has been produced. So, for example,
one of the MIT Lab papers may again and again refer to some cells
they are studying as “engram cells,” as if they could try to
convince us that such cells are actually engram cells by telling us
again and again that they are engram cells. Doing this is rather
like some ghost researcher matter-of-factly using the term “ghost
blob” to refer to particular patches of infrared light that he is
studying after using an infrared camera. Just as a blob of infrared
light merely tells us only that some patch of air was slightly colder
(not that such a blob is a ghost), a scientist observing a mouse
freezing is merely entitled to say he saw a mouse freezing (not that
the mouse is recalling a fear memory); and a scientist seeing a snail
withdrawing into its shell is merely entitled to tell us that he saw
a snail withdrawing into its shell (not that the snail was recalling
some fear memory).
The
relation between the chance of a false alarm and the statistical
power of a study is clarified in this paper by R. M. Christley. The
paper has an illuminating graph which I present below with some new
captions that are a little more clear than the original captions. We
see from this graph that if a study has a statistical power of only
about .2, then the chance of the study giving a false alarm is
something like 1 in 3 if there is a 50% chance of the effect existing, and
much higher (such as 50% or greater) if there is less than a 50%
chance of the effect existing. But if a study has a statistical
power of only about .8, then the chance of the study giving a false
alarm is only about 1 in 20 if there is a 50% chance of the effect
existing, and much higher if there is less than a 50% chance of the
effect existing. Animal
studies using much fewer than 15 animals per study (such as those I
have discussed) will result in the relatively high chance of false
alarms shown in the green line.
The PLOS paper here
analyzed 410 experiments involving fear conditioning with rodents, a
large fraction of them memory experiments. The paper found that such
experiments had a “mean normalized effect size” of only .29. An experiment with an effect size
of only .29 is very weak, with a high chance of a false alarm.
Effect size is discussed in detail here, where we learn that with an
effect size of only .3, there's typically something like a 40 percent chance of
a false alarm.
To determine whether a
sample size is large enough, a scientific paper is supposed to do
something called a sample size calculation. The PLOS paper here reported
that only one of the 410 memory-related neuroscience papers it studied had such a calculation. The PLOS paper reported
that in order to achieve a moderately convincing effect size of .80,
an experiment typically needs to have 15 animals per group; but only
12% of the experiments had that many animals per group. Referring to
statistical power (a measure of how likely a result is to be real and
not a false alarm), the PLOS paper states, “no correlation was
observed between textual descriptions of results and power.” In
plain English, that means that there's a whole lot of BS flying
around when scientists describe their memory experiments, and that
countless cases of very weak evidence have been described by
scientists as if they were strong evidence.
Our
science media shows very little sign of paying any attention to the
statistical power of neuroscience research, partially because rigor
is unprofitable. A site can make more money by trumpeting borderline
weakly-suggestive research as if it were a demonstration of truth,
because the more users click on a sensational-sounding headline, the
more money the site make from ads. Our neuroscientists show little
sign of paying much attention to whether their studies have a decent
statistical power. For the neuroscientist, it's all about publishing as
many papers as possible, so it's a better career move to do 5
underpowered small-sample studies (each with a high chance of a false
alarm) than a single study with an adequate sample size and high
statistical power.
In this post I used an assumption (which I got from one estimate) that 15 research animals per study group are needed for a moderately persuasive result. It seems that this assumption may have been too generous. In her post “Why Most Published Neuroscience Findings Are False,” Kelly Zalocusky PhD calculates (using Ioannidis’s data) that the median effect size of neuroscience studies is about .51. She then states the following, talking about statistical power:
In this post I used an assumption (which I got from one estimate) that 15 research animals per study group are needed for a moderately persuasive result. It seems that this assumption may have been too generous. In her post “Why Most Published Neuroscience Findings Are False,” Kelly Zalocusky PhD calculates (using Ioannidis’s data) that the median effect size of neuroscience studies is about .51. She then states the following, talking about statistical power:
To
get a power of 0.2, with an effect size of 0.51, the sample size
needs to be 12 per group. This fits well with my intuition of sample
sizes in (behavioral) neuroscience, and might actually be a little
generous. To bump our power up to 0.5, we would need an n of 31 per
group. A power of 0.8 would require 60 per group.
If
we describe a power of .5 as being moderately convincing, it
therefore seems that 31 animals per study group is needed for a
neuroscience study to be moderately convincing. But most experimental
neuroscience studies involving rodents and memory use fewer than
15 animals per study group.
Zalocusky
states the following:
If
our intuitions about our research are true, fellow graduate students,
then fully 70% of published positive findings are “false
positives.” This result furthermore assumes no bias, perfect use of
statistics, and a complete lack of “many groups” effect. (The
“many groups” effect means that many groups might work on the
same question. 19 out of 20 find nothing, and the 1 “lucky” group
that finds something actually publishes). Meaning—this estimate is
likely to be hugely optimistic.
All of these things make it rather clear that a large fraction or most animal memory experiments are dubious. There is another reason why the great majority of these experiments tell us nothing about human memory. It is that most such experiments involve rodents, and given the vast differences between men and rodents, nothing reliable about human memory can be determined by studying rodent memory.
Postscript: The paper here is another example of a memory experiment failing to actually prove anything because of its too-small-sample size. Widely reported in the press with headlines suggesting scientists had added memories to mice while the mice slept, the study says, "We induced an explicit memory trace, leading to a goal-directed behavior toward the place field." Typically this type of study will be behind a pay wall, allowing the scientists to hide their too-small sample sizes where the public won't be able to see them without paying. But luckily www.researchgate.net often publishes the graphs from such studies, where anyone can see them. In this case the graph explanation allows us to see the scientists were using only 5 to 7 animals per study group, which means that the reported result isn't strong evidence for anything, being the type of result we might easily get from chance effects.
Post-Postscript: The latest example of a memory experiment failing to actually prove anything (because of its too-small-sample size) is a study in Nature that has been hyped with headlines such as "Artificial memory created." The study has the inaccurate title, "Memory formation in the absence of experience." The study fails to prove any such thing occurred. When we look at the number of animals involved, we often find that the study fails to meet the minimum standard of 15 animals per study group. In Figure 1 we learn that two of the study groups consisted of only 8 mice. In Figure 2 we learn that two of the study groups consisted of only 10 mice. In Figure 3 we learn that one of the study groups consisted of only 7 mice. Moreover, the methodology used in the study is so convoluted that it fails to provide clear and convincing evidence for anything interesting. The only evidence of memory recall is that the mice supposedly avoided some area, something that might have occurred for any number of reasons other than a recall of some memory. A robust test of an artificial memory would test an actual acquired skill, such as the ability to navigate a maze in a certain time.
All of these things make it rather clear that a large fraction or most animal memory experiments are dubious. There is another reason why the great majority of these experiments tell us nothing about human memory. It is that most such experiments involve rodents, and given the vast differences between men and rodents, nothing reliable about human memory can be determined by studying rodent memory.
Postscript: The paper here is another example of a memory experiment failing to actually prove anything because of its too-small-sample size. Widely reported in the press with headlines suggesting scientists had added memories to mice while the mice slept, the study says, "We induced an explicit memory trace, leading to a goal-directed behavior toward the place field." Typically this type of study will be behind a pay wall, allowing the scientists to hide their too-small sample sizes where the public won't be able to see them without paying. But luckily www.researchgate.net often publishes the graphs from such studies, where anyone can see them. In this case the graph explanation allows us to see the scientists were using only 5 to 7 animals per study group, which means that the reported result isn't strong evidence for anything, being the type of result we might easily get from chance effects.
Post-Postscript: The latest example of a memory experiment failing to actually prove anything (because of its too-small-sample size) is a study in Nature that has been hyped with headlines such as "Artificial memory created." The study has the inaccurate title, "Memory formation in the absence of experience." The study fails to prove any such thing occurred. When we look at the number of animals involved, we often find that the study fails to meet the minimum standard of 15 animals per study group. In Figure 1 we learn that two of the study groups consisted of only 8 mice. In Figure 2 we learn that two of the study groups consisted of only 10 mice. In Figure 3 we learn that one of the study groups consisted of only 7 mice. Moreover, the methodology used in the study is so convoluted that it fails to provide clear and convincing evidence for anything interesting. The only evidence of memory recall is that the mice supposedly avoided some area, something that might have occurred for any number of reasons other than a recall of some memory. A robust test of an artificial memory would test an actual acquired skill, such as the ability to navigate a maze in a certain time.