It is huge mistake to rely on AI tools such as ChatGPT or Gemini when dealing with any controversial topic. Such tools make use of computer systems that have no real understanding of anything. The answers they give are produced through a combination of various complicated methods. The main way in which such AI tools get their "smarts" is by stealing text written by human authors.
- A phrase or sentence ending with a question mark, followed by some lines of text.
- A header beginning with the words "How" or "Why" and followed by some lines of text (for example, a header of "How the Allies Expelled the Nazis from France" followed by an explanation).
- A header not beginning with the words "How" or "Why" and not ending with a question mark, but followed by some lines that can be combined with the header to make a question and answer (for example, a header of "The Death of Abraham Lincoln," along with a description, which could be stored as a question "How did Abraham Lincoln die?" and an answer).
- A header written in the form of a request or an imperative, and some lines following such a header (for example a header of "write a program that parses a test line and says 'you mentioned a fruit' whenever the person mentioned a fruit" would be stored so that the header was converted to a question of "how do you write a program" and the solution stored as the answer.
Crawling the entire Internet and vast online libraries of books such as www.archive.org and Google Books, the corporation can create a database of hundreds of millions or possibly even billions of questions and answers. In many cases the database would have multiple answers to the same question. But there could be some algorithm that would handle such diversity. The system might give whichever type of answer was the most popular. Or it might choose one answer at random. Or it might give an answer giving multiple answers, adding text such as "Some people say..." or "It is generally believed" and "Some people say." Included in this question and answer database would be the answer to almost every riddle ever posed. So suppose someone asked the system a tricky riddle such as "which timepiece has the most moving parts?" The system might instantly answer "an hourglass." This would not occur by the system doing anything like thinking. The system would simply be retrieving an answer to that question it had already stored. And when you asked the system to write a program in Python that lists all prime numbers between 20,000 and 30,000, the system might simply find a closest match stored in its vast database of questions and answers, and massage the answer by doing some search and replace.
A recent science news article at the Phys.Org site is entitled "Massive study detects AI fingerprints in millions of scientific papers." Referring to LLM (Large Language Models) that are the basis of AI tools such as ChatGPT and Gemini, we read this:
"This spike in questionable authorship has raised concerns in the academic community that AI-generated content has been quietly creeping into peer-reviewed publications. To shed light on just how widespread LLM content is in academic writing, a team of U.S. and German researchers analyzed more than 15 million biomedical abstracts on PubMed to determine if LLMs have had a detectable impact on specific word choices in journal articles. Their investigation revealed that since the emergence of LLMs there has been a corresponding increase in the frequency of certain stylist word choices within the academic literature. These data suggest that at least 13.5% of the papers published in 2024 were written with some amount of LLM processing."
Here is a quote from the scientific paper the article is referring to. The LLM acronym refers to Large Language Models that are AI.
"Our analysis of the excess frequency of such LLM-preferred style words suggests that at least 13.5% of 2024 PubMed abstracts were processed with LLMs. With ~1.5 million papers being currently indexed in PubMed per year, this means that LLMs assist in writing at least 200,000 papers per year. This estimate is based on LLM marker words that showed large excess usage in 2024, which strongly suggests that these words are preferred by LLMs like ChatGPT that became popular by that time. This is only a lower bound: Abstracts not using any of the LLM marker words are not contributing to our estimates, so the true fraction of LLM-processed abstracts is likely higher."
What is the problem if those writing biology papers are massively using AI tools such as ChatGPT to help write their papers? There are two main problems.
(1) The false statements in abstracts problem. There is a very massive problem in biology papers these days that paper abstracts are very commonly making claims that are not justified by any research done by the authors of the paper. If a scientist uses some AI system to write a paper's abstract after submitting the main text of the paper to the AI system, this problem will tend to become worse. When I ask Google about the topic of "exaggeration when AI is used to summarize a scientific paper," I get this answer:
- AI summaries are more prone to overgeneralization than human summaries: Studies have shown that AI summaries are significantly more likely to overstate the scope of research findings compared to summaries written by the original authors or expert reviewers.
- Newer AI models may be worse: Some studies suggest that newer AI models, such as ChatGPT-4o and DeepSeek, may be even more likely to produce broad generalizations than older ones.
- Ignoring nuances and limitations: AI summaries tend to ignore or downplay uncertainties, limitations, and specific conditions mentioned in the original paper, leading to a potentially misleading presentation of the research. This can have dangerous consequences, especially in fields like medicine, where overgeneralized findings could lead to incorrect medical decisions.
- 'Unwarranted confidence': AI models might prioritize generating fluent and confident-sounding responses, even if the underlying evidence does not fully support the strong claims they make in their summaries.
(2) The bad citation problem and legend recitation problem. Scientific papers very frequently reiterate false or groundless claims about previous scientific research. For example, in the world of neuroscience very many thousands of very low-quality papers have been published, describing poorly-designed experiments guilty of multiple examples of Questionable Research Practices such as way-too-small study group sizes. What happens is that these junk science papers end up getting cited over and over again by other papers. You might call this "the afterlife of junk science."
Very often when this happens the authors of the scientific paper will not even have ever read the body of the shoddy scientific paper they are citing. Again and again and again we have papers claiming that some grand result was established by neuroscience researchers. There follows a list citing a set of papers. But a careful examination of the papers cited will show that none of them provided any good evidence for the grand result claimed. The citation of low-quality research is extremely abundant in neuroscience papers. When the citation of low-quality research becomes common, we have a situation in which the neuroscience literature serves to propel and propagate myths and legends, groundless boasts of achievements.
We have in the research described above yet another giant reason why all statements in neuroscience papers should by default be distrusted. We cannot trust neuroscientists to write abstracts and paper titles accurately summarizing what was accomplished by the research described underneath such titles and abstracts. And we cannot trust neuroscientists to accurately describe what was demonstrated by research done by other neuroscientists.
No comments:
Post a Comment