Technology

AI-written code can beat humans at biomedical analysis, some studies find. What does that mean for the field?

April 06, 2026 5 min read views
AI-written code can beat humans at biomedical analysis, some studies find. What does that mean for the field?
  1. Health
AI-written code can beat humans at biomedical analysis, some studies find. What does that mean for the field? MEMBER EXCLUSIVE

News By Patrick Sullivan published 6 April 2026

LLMs can accelerate medical research, scientists say, but they come with risks.

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

A woman with dark straight hair pulled back wearing navy blue scrubs and a stethascope taps on a glass panel lit up with various technological images Large language models can be a force multiplier for medical researchers but not without well-defined guardrails or humans in the loop. (Image credit: Krongkaew via Getty Images)
  • Copy link
  • Facebook
  • X
  • Whatsapp
  • Reddit
  • Pinterest
  • Flipboard
  • Email
Share this article 0 Join the conversation Follow us Add us as a preferred source on Google Newsletter Live Science Sign up for the Live Science daily newsletter now

Get the world’s most fascinating discoveries delivered straight to your inbox.

Become a Member in Seconds

Unlock instant access to exclusive member features.

Contact me with news and offers from other Future brands Receive email from us on behalf of our trusted partners or sponsors By submitting your information you agree to the Terms & Conditions and Privacy Policy and are aged 16 or over.

You are now subscribed

Your newsletter sign-up was successful

Want to add more newsletters?

Daily Newsletter

Delivered Daily

Daily Newsletter

Sign up for the latest discoveries, groundbreaking research and fascinating breakthroughs that impact you and the wider world direct to your inbox.

Signup + Life's Little Mysteries

Once a week

Life's Little Mysteries

Feed your curiosity with an exclusive mystery every week, solved with science and delivered direct to your inbox before it's seen anywhere else.

Signup + How It Works

Once a week

How It Works

Sign up to our free science & technology newsletter for your weekly fix of fascinating articles, quick quizzes, amazing images, and more

Signup + Space.com Newsletter

Delivered daily

Space.com Newsletter

Breaking space news, the latest updates on rocket launches, skywatching events and more!

Signup + Watch This Space

Once a month

Watch This Space

Sign up to our monthly entertainment newsletter to keep up with all our coverage of the latest sci-fi and space movies, tv shows, games and books.

Signup + Night Sky This Week

Once a week

Night Sky This Week

Discover this week's must-see night sky events, moon phases, and stunning astrophotos. Sign up for our skywatching newsletter and explore the universe with us!

Signup +

Join the club

Get full access to premium articles, exclusive features and a growing list of member rewards.

Explore An account already exists for this email address, please log in. Subscribe to our newsletter

As the general public has embraced large language models (LLMs) such as ChatGPT, Claude and Gemini, scientists have been exploring how these artificial intelligence (AI) tools could enhance medical research.

Some argue that LLMs could dramatically boost researchers' efficiency in completing certain types of medical studies, and research published in February in the journal Cell Reports Medicine exemplifies that vision for the technology.

The study used massive datasets of patient biomedical information to predict the risk of preterm birth in a given pregnancy. These types of predictions have been a powerful AI use case for years, and were possible with more traditional types of machine learning than LLMs employ. But this study was notable in that LLMs enabled junior researchers — a graduate student and a high school student — to efficiently generate very accurate code.

You may like
  • A scientists looks down a microscope. AI may accelerate scientific progress — but here's why it can't replace human scientists
  • Digital generated image of abstract multicoloured AI data cloud against light blue background. ​​AI can develop 'personality' spontaneously with minimal prompting, research shows. What does that mean for how we use it?
  • Photo of a health professional speaking with a senior patient. Can AI detect cognitive decline better than a doctor? New study reveals surprising accuracy

That code predicted a baby's gestational age at birth and the likelihood of preterm birth. The AI's output matched and, in one case, even beat analyses from expert teams who had used human-generated code to crunch the same data.

"What I saw with junior scientists here and how effective they could be truly inspired and amazed me," said study co-author Marina Sirota, interim director of the Baker Computational Health Sciences Institute at the University of California, San Francisco.

One big promise of LLMs is to lower the barrier for researchers to produce code and conduct complex analyses — but it comes with risks. As AI quickly improves, researchers must grapple with myriad questions. What guardrails need to be established to ensure AI's accuracy? How do we measure its output? And how will the role of human researchers evolve as these systems gain prominence?

How AI prediction works

Sirota's team drew on data used in the Dialogue for Reverse Engineering Assessments and Methods (DREAM) Challenges, international competitions in which teams of scientists tackle complex biomedical problems using shared datasets.

Sign up for the Live Science daily newsletter nowContact me with news and offers from other Future brandsReceive email from us on behalf of our trusted partners or sponsorsBy submitting your information you agree to the Terms & Conditions and Privacy Policy and are aged 16 or over.

The open-source datasets included blood transcriptomics, which looks at RNA, a molecule that reflects which genes are active in the body. They included epigenetic information from placental cells, which described chemical tags that sit "on top of" DNA and control which genes can be switched on, and microbiome data describing the bacteria present in vaginal fluid samples.

These data points were flagged with the type of sample they came from — blood, placental tissue or vaginal fluid — and labeled with outcomes of interest, namely gestational age and preterm birth. Machine learning algorithms can then be trained to spot links between a sample's source and its label. For example, they may reveal that microbiome samples with certain mixes of bacteria often come from people who have given birth early.

Once trained on a subset of data, the algorithm can be tested on samples that lack labels, to see if it can predict the label that should be there. For instance, it should flag samples with bacterial mixes similar to those in the training data linked to a higher risk of preterm birth.

What to read next
  • A robot looking at itself in a mirror. Giving AI the ability to monitor its own thought process could help it think like humans
  • Vancouver, Canada - January 15, 2012: A hobgoblin archer from the Wizards of the Coast tabletop Dungeons and Dragons game, posed on a rocky background. How well can AI and humans work together? Scientists are turning to Dungeons & Dragons to find out
  • Artificial intelligence brain with circuitry and big data. AI hallucinations work both ways, study shows — using chatbots can amplify and reinforce our own delusions

But we can speed that up as well — the cleaning part and normalization of data — with generative AI.

Marina Sirota, interim director of the Baker Computational Health Sciences Institute at the University of California, San Francisc

The final step is to evaluate the models' accuracy and compare them. "Accuracy" in the context of machine learning has a specific definition: the number of correct predictions divided by the total number of predictions.

Human- vs. AI-generated code

The DREAM Challenge was aimed at uncovering links between these medical metrics and the risk of preterm birth. Some risk factors, including having infections during pregnancy, are already well known. But the DREAM Challenge wanted to see what signals might be gleaned from clinical samples, like blood.

It's the kind of work that normally demands months of effort from trained bioinformaticians. But instead of writing the analysis code themselves, the junior researchers in the recent study gave each of eight LLMs a single prompt describing the data available and the labeling task at hand: predicting gestational age or preterm birth.

LLMs tested

  • ChatGPT o3-mini-high
  • ChatGPT 4o
  • DeepSeek R1
  • Gemini 2.0 FlashExpThink
  • Qwen 2.5 Coder
  • Llama 3.2
  • Phi-4
  • DeepSeek-R1-Distill-Qwen

With this simple prompting, four of the eight models — DeepSeekR1, Gemini, and ChatGPT's o3-mini-high and 4o — produced code that ran successfully. The best performer, OpenAI's o3-mini, was as accurate as the original human DREAM Challenge teams. For one task, which involved estimating gestational age from epigenetic data, it was more accurate than humans had been.

What's more, the junior researchers generated results in about three months and submitted a manuscript describing their results within six months, whereas the same process took the original DREAM Challenge teams years.

"We got lucky with the review process here, but six months to generate the results and write the paper is pretty incredible, especially for a junior scientist," Sirota told Live Science.

Preterm birth, before 37 complete weeks of pregnancy, affects roughly 11% of infants worldwide. Babies born too early are at higher risk than full-term babies for a host of health troubles, including but not limited to problems affecting their brains, eyes and digestive systems. Being able to predict which pregnant patients are more likely to give birth early could mean closer monitoring and treatments to protect the baby and make full-term birth more likely, experts say.

Beyond writing code

The data used in the Cell Reports Medicine paper started "in good shape," Sirota noted, in tables that AI could easily read. "But we can speed that up as well — the cleaning part and normalization of data — with generative AI," she said.

Sirota's team is now exploring other LLM applications, including a new tool called Chat PTB (short for "preterm birth") that they've developed. The Chat GPT-based tool is embedded in papers published by the March of Dimes research network, part of a nonprofit aimed at improving maternal and infant health. Instead of manually combing through this literature, researchers can now query Chat PTB and get synthesized answers with references — a task that used to take hours, compressed into seconds.

But tools like Chat PTB and the code-writing approach in Sirota's study represent only the first wave. AI-enhanced medical research is moving toward "agentic" AI, meaning systems that don't respond to only one prompt but instead carry out multistep research workflows with increasing autonomy.

How might AI affect the workflow of biomedical research? (Image credit: Getty Images/Moor Studio)

Instead of responding with only text, an agentic agent is capable of checking and iterating on its own work until it reaches its objective. It can also take action on a user’s behalf, like searching the internet and running code, rather than just writing it.

That shift toward greater AI autonomy and less human oversight brings both enormous potential and serious risk. In a January study published in the journal Nature Biomedical Engineering, researchers evaluated LLMs on 293 coding tasks drawn from 39 published biomedical studies, initially allowing the LLMs to come up with workflows on their own. They found that the overall accuracy came in below 40%.

Their solution was to separate planning from execution: They had the AI produce a step-by-step analysis plan that a human researcher reviewed before any code got written. The approach boosted the accuracy to 74%.

The goal of AI is not perfection, but to do better than people.

Ian McCulloh, professor of computer science at Johns Hopkins University's Whiting School of Engineering

"The goal is not to ask researchers to blindly trust an AI system," study co-author Zifeng Wang, who was a doctoral student at the University of Illinois Urbana-Champaign at the time of the study, told Live Science in an email.

Instead, the goal is to "design frameworks where the reasoning, planning, and intermediate steps are visible enough that researchers can supervise and validate the process," said Wang, who is a co-founder of Keiji AI.

Why safeguards matter

These risks don't mean researchers should shy away from AI, but they do need to apply the same rigor to AI-generated work that they would to any other collaborator's output, scientists caution.

"The question is not whether LLMs accelerate science or create 'AI slop,'" Ian McCulloh, a professor of computer science at Johns Hopkins University's Whiting School of Engineering, told Live Science in an email. "The question is how we leverage this powerful technology within the scientific method."

But McCulloh also cautioned against holding AI to an impossible standard. People tend to assume AI is error-prone and downplay human error, he said, when, in reality, both humans and machines make mistakes. He anecdotally described a consulting client who lamented AI's 15% miss rate on a certain task, not realizing his human employees' miss rate was 25%.

"The goal of AI is not perfection," McCulloh said, "but to do better than people."

That effort will involve agreeing on how to measure AI's success. Dr. Ethan Goh, a physician-researcher at Stanford University, pointed out that health care still lacks standardized benchmarks for evaluating AI's performance. Goh recently published a randomized trial in JAMA Network Open that studied how LLMs influence doctors' reasoning in determining diagnoses.

RELATED STORIES

  • Can AI detect cognitive decline better than a doctor? New study reveals surprising accuracy
  • 'A second set of eyes': AI-supported breast cancer screening spots more cancers earlier, landmark trial finds
  • Doctors say AI model can predict 'biological age' from a selfie — and want to use it to guide cancer treatment

Because LLMs are trained on such a vast amount of data, "benchmarks are so expensive to produce," Goh told Live Science. What's more, he said, AI improves so quickly that most commercial models start beating the few benchmarks that exist and rapidly render them useless. Amid these challenges, Goh's team at Stanford's AI Research and Science Evaluation (ARISE) Healthcare Network is working to develop such standards by the end of this year.

For all the uncertainty around standards and safeguards, the researchers who spoke with Live Science shared a common conviction: AI belongs in the lab, but not unsupervised.

"We have to be careful not to forget what we know in terms of the scientific process," Sirota said. "But I think the opportunity is tremendous."

TOPICS news analyses Patrick SullivanPatrick SullivanLive Science contributor

Patrick Sullivan has been a professional writer and editor since 2009 and producing health care content since 2015. Based in New Jersey, he is a father of two children and servant to an ever-changing number of pet rabbits. When he's not at his writing desk, you can usually find him on a yoga mat, a Brazilian jiu jitsu mat, or wandering through the woods.

View More

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Logout Read more A scientists looks down a microscope. Artificial Intelligence AI may accelerate scientific progress — but here's why it can't replace human scientists    Digital generated image of abstract multicoloured AI data cloud against light blue background. Artificial Intelligence ​​AI can develop 'personality' spontaneously with minimal prompting, research shows. What does that mean for how we use it?    A robot looking at itself in a mirror. Artificial Intelligence Giving AI the ability to monitor its own thought process could help it think like humans    Vancouver, Canada - January 15, 2012: A hobgoblin archer from the Wizards of the Coast tabletop Dungeons and Dragons game, posed on a rocky background. Artificial Intelligence How well can AI and humans work together? Scientists are turning to Dungeons & Dragons to find out    Artificial intelligence brain with circuitry and big data. Artificial Intelligence AI hallucinations work both ways, study shows — using chatbots can amplify and reinforce our own delusions    A person holds a white model of a brain with their hands on either side while lines of green and red binary numbers are projected on top Artificial Intelligence 'Not how you build a digital mind': How reasoning failures are preventing AI models from achieving human-level intelligence    Latest in Health A bunch of small, red, disc-shaped blobs. The blobs are facing in different directions and have a concave center. Health Diabetes rates are lower in high-altitude environments ‪‪—‬ and scientists may have discovered why    A 3D modeled map of a T-shaped organ with yellow, purple and green separating out key nerve tracts. Anatomy Scientists mapped all the nerves of the clitoris for the first time    A close up of the pancreas, where purple and pink stained cells can be seen with dark dots for their nucleii Health Scientists cured type 1 diabetes in mice by creating a blended immune system    A 3D illustartion of green and orange cylindrical-shaped bacteria surrounded by webs of blue and white filaments Medicine & Drugs Scientists have discovered an 'Achilles' heel' in deadly superbugs    A close up of a child's eye that's blue. The image has a blue tint Medicine & Drugs Pig semen component could deliver chemotherapy to hard-to-reach eye cancer, mouse study suggests    A mother with dark hair leans over a small dark-haired child wearing a bright blue shirt. The mother presses a white handkerchief to the child's nose. Genetics Are allergies genetic?    Latest in News A close up of the moon in the darkness of space, with only it's right outer edge illuminated by the sun. Space Exploration Artemis II moon flyby begins: How to watch and what to know    Four people wearing black long sleeve shirts and tan pants float in a small room Space Exploration 'Trust us; you look amazing': Artemis II crewmembers share first message from space    Two images with a diagonal golden line between them. The left shows the Artemis II rocket blasting off from the launch pad and the right shows a close up of a purple and pink pancreas tissue sample Space Science news this week: Artemis II lifts off, diabetes cured in mice, and smog in China shapes Arctic storms    An illustration of a blue and green seabed floor with various paleolithic creatures standing up and swimming around Extinct species Fossil site in China reveals bevy of complex creatures lived prior to the Cambrian explosion, including a 'Dune'-like sandworm    A bunch of small, red, disc-shaped blobs. The blobs are facing in different directions and have a concave center. Health Diabetes rates are lower in high-altitude environments ‪‪—‬ and scientists may have discovered why    A 3D modeled map of a T-shaped organ with yellow, purple and green separating out key nerve tracts. Anatomy Scientists mapped all the nerves of the clitoris for the first time    LATEST ARTICLES