Saturday, July 19, 2025

Science is Dead, Long Live Science

AI Scientist
From Dave Strom at Hat Hair, Medical Journals Flooded With AI Generated 'Research' wen Gregorian@OwenGregorian

Low-quality papers based on public health data are flooding the scientific literature | Miryam Naddaf, Nature

The appearance of thousands of formulaic biomedical studies has been linked to the rise of text-generating AI tools.

Data from five large open-access health databases are being used to generate thousands of poor-quality, formulaic papers, an analysis has found. Its authors say that the surge in publications could indicate the exploitation of these databases by people using large language models (LLMs) to mass-produce scholarly articles, or even by paper mills — companies that churn out papers to order.

The findings, posted as a preprint on medRxiv on 9 July1, follow an earlier study2 that highlighted an explosion of such papers that used data from the US National Health and Nutrition Examination Survey (NHANES). The latest analysis flags a rising number of studies featuring data from other large health databases, including the UK Biobank and the US Food and Drug Administration’s Adverse Event Reporting System (FAERS), which documents the side effects of drugs.

Between 2021 and 2024, the number of papers using data from these databases rose from around 4,000 to 11,500 — around 5,000 more papers than expected on the basis of previous publication trends.

The study’s authors warn that a large number of these papers — many of which have repetitive, template-like titles — are likely to be of low quality and could flood the scientific literature. Their analysis is intended as “an early warning system … so that peer reviewers, editors and researchers can understand where the vulnerabilities in the system lie”, says co-author Matt Spick, a biomedical scientist at the University of Surrey in Guildford, UK.

Unexpected growth

Spick and his colleagues analysed changes in publication counts, title wording and author affiliations for papers that were based on data from 34 open-access health databases. The team used an algorithm to predict the growth in the numbers of papers expected for each data set from 2014 to 2024 — a period during which text-generating LLM tools such as ChatGPT and Gemini became mainstream.


When they compared their predictions with actual publication rates, the researchers identified six data sets that had significantly exceeded the growth rates predicted by the algorithm. All but one also showed a rise in the number of papers with ‘template-like’ titles. These data sets were NHANES, UK Biobank, FAERS, the Global Burden of Disease (GBD) study and the Finnish genetic database FinnGen. By 2024, the number of papers using FinnGen data grew by nearly 15 times from 2021, for example, while those using FAERS increased by nearly 4 times and UK Biobank by 2.4 times over the same period.

The researchers also uncovered some dubious papers, which often linked complex health conditions to a single variable. One paper used Mendelian randomization — a technique that helps to determine whether a particular health risk factor causes a disease — to study whether drinking semi-skimmed milk could protect against depression, whereas another looked into how education levels affect someone’s chances of developing a hernia after surgery.

“A lot of those findings might be unsafe, and yet they’re also accessible to the public, and that really worries me,” says Spick.

“This whole thing undermines the trust in open science, which used to be a really non-controversial thing,” adds Csaba Szabó, a pharmacologist at the University of Fribourg in Switzerland.

 

Broad perspective

Igor Rudan, a global-health researcher at the University of Edinburgh, UK, and a co-editor-in-chief of the Journal of Global Health, praises the study for having “systematically addressed this problem in the entirety of the scientific literature”. “We need to understand this issue better. From the perspective of a single journal, you cannot do that,” he adds.

Rudan says that, in 2022, Journal of Global Health editors noticed an unusual rise in submissions for papers that used open-access data sets, including the UK Biobank, GBD and NHANES. In 2023 and 2024, these manuscripts constituted 10% and 15% respectively of all submissions to the journal. That has now risen to nearly 20%, and the journal is receiving manuscripts on these databases almost daily, he adds.

In response, the journal introduced guidelines earlier this month for researchers submitting research on open-access data sets. These require authors to declare how many papers they published in the last three years that analysed such data sets, disclose the use of artificial intelligence in preparing manuscripts and explain how they rule out false positives in their results.

Read more: https://archive.is/1qURI
The scientific literature system relies heavily on trust. Editors and peer reviewers are by and large, scientists in their real life, and article editing and checking is an unpaid add on to their day job, and they don't have the time or motivation to look deeply for signs of fabrication. 

The motivation from fabrication is obvious, in a "publish or perish" culture there is ample reason to skip the hard parts of actually doing the work and going straight to the write up. Now that we have LLM to do the writing, too, it was only a matter of time until those scientists willing to cheat adopted them. Using them has become a chronic problem in college; well, these scientists were all students once. 

The Wombat has Rule Five Sunday: The Ghost of Safeguard-chan Watches Over Us up and at 'em at The Other McCain.

No comments:

Post a Comment