OpenAI o3 Helps Doctors Revisit Rare Disease Cases in NEJM AI Study

Researchers at Boston Children’s, Harvard, and OpenAI used o3 Deep Research to reanalyze 376 previously unsolved rare disease cases. The model surfaced evidence-linked leads that helped specialists confirm 18 diagnoses, but the study is careful about what AI did and did not decide.
Technician in a genomics laboratory operating DNA sequencing equipment
A National Cancer Institute technician validates genetic variants identified through whole-exome sequencing. Photo: National Cancer Institute / Unsplash.

Researchers at Boston Children’s Hospital, Harvard University, and OpenAI used an OpenAI reasoning model to revisit 376 previously unsolved rare disease cases, surfacing candidate explanations that ultimately helped specialists confirm 18 diagnoses, according to a study published June 18 in NEJM AI and an accompanying OpenAI summary.

The result is modest in percentage terms: an added diagnostic yield of 4.8% after earlier expert review. For rare disease genomics, though, that can be meaningful. These were not first-pass cases waiting for routine analysis. Many had already been through sequencing, commercial or institutional pipelines, and multidisciplinary specialist review. The AI system did not diagnose patients; it generated evidence-linked hypotheses that clinicians and researchers then checked through established genetic review, additional testing, and clinical confirmation.

That distinction is the center of the story. The study does not show a chatbot replacing a geneticist. It shows a model acting as a reasoning and literature-synthesis layer for old cases whose answers may have become easier to find as science moved on.

Why old rare disease cases can change

Rare disease diagnosis is unusually hard because a child’s clinical presentation, family history, genome data, and published literature do not sit in one tidy place. A patient may have a variant that looked uncertain when sequencing was first done, only for later case reports, database updates, or gene-disease discoveries to make that same variant more interpretable years later.

The OpenAI-Boston Children’s workflow treated reanalysis as an ongoing knowledge problem. For each case, researchers assembled de-identified clinical and genomic information, including Human Phenotype Ontology terms, clinician notes in some cases, age and gender metadata, and filtered variant tables. Those tables included information such as rarity, predicted protein impact, ClinVar classification, and signal quality across family members. Most cases included data from the child and both biological parents.

The team then asked OpenAI o3 Deep Research to propose a plausible molecular explanation and explain the evidence behind it. Human reviewers evaluated candidate outputs using ACMG/AMP variant-classification standards, with at least two team members reviewing each candidate and disagreements resolved by consensus. A model output only counted as a diagnosis after qualified experts reviewed the evidence, the variant was classified as pathogenic or likely pathogenic, a CLIA-certified lab confirmed it, and the clinical team returned the result to the family.

What the study found

The 376 unsolved cases came from four cohorts: 100 children with neurodevelopmental conditions, 61 people with rare neuromuscular disease, 200 sudden unexpected death in pediatrics cases, and 15 children or adolescents with early psychosis. The highest raw yield came from the small early psychosis cohort, where two of 15 cases were resolved, though the study notes that the percentage has a wide confidence interval because the cohort was so small.

The neurodevelopmental cohort produced 10 diagnoses from 100 cases. The neuromuscular cohort produced four diagnoses from 61 cases. The sudden unexpected death cohort produced two diagnoses from 200 cases. Across all groups, physicians established 18 diagnoses, including seven rediscoveries in which a diagnosis had been made outside the local research workflow but was missing from the record the team reviewed.

Some examples matter because they show the type of reasoning the model was being used to support. In one early-psychosis case, the system connected low-quality calls on chromosome 22 with cardiac, immune, neurodevelopmental, and psychiatric features, then proposed a 22q11.2 deletion associated with DiGeorge syndrome. Follow-up genome sequencing confirmed the structural event. In another set of cases, the model suggested that two genes, rather than one, better explained complex presentations involving muscle and neurodevelopmental features.

The model also proposed a possible new mechanistic link involving S1PR1 and vitiligo in one neurodevelopmental case. That hypothesis still needs experimental validation, but it illustrates a different use case: not merely ranking known variants, but connecting clinical features, structural biology, immunology, and genetic evidence into a testable biological explanation.

The limits are as important as the result

The study is careful about what it does not prove. It was retrospective, the cohorts were heterogeneous, and reviewers were not blinded to model confidence. The researchers did not measure time saved, cost, clinician workload, false-positive burden, or whether care changed after diagnosis. They also did not systematically evaluate every form of genetic variation, such as repeat expansions, mosaicism, deep-intronic variants, or broader structural variation.

Those limits matter because medical AI can sound more decisive than it is. A model that generates a biologically coherent explanation can still be wrong, incomplete, or overconfident. In this workflow, the useful output was not a final answer. It was a lead that a specialist could interrogate, test, reject, or confirm.

OpenAI’s broader health push gives the study extra context. The company separately said this week that more than 230 million people use ChatGPT each week for health and wellness questions, and that GPT-5.5 Instant has improved on health-specific evaluations involving physician-written rubrics. That consumer scale is one reason medical AI claims need careful handling. Better model performance can make health information easier to understand, but clinical use requires privacy controls, audit trails, local regulation, qualified clinicians, genetic counseling, and confirmatory testing.

What comes next for medical AI workflows

The next meaningful test is not whether a model can look impressive on a retrospective study. It is whether expert-led AI reanalysis can be run prospectively across institutions while improving diagnostic yield, reducing reviewer burden, managing false positives, and protecting patient data. Versioned prompts, reference checking, audit logs, calibrated uncertainty, and platform-independent workflows will matter if these systems move from research projects into clinical operations.

The Manton Center will lead the next stage with support from an OpenAI Foundation grant, aimed at building a platform-agnostic, lower-cost genetics AI copilot for clinical teams. If that work succeeds, the most important impact may be less dramatic than the phrase “AI diagnosis” suggests. Hospitals could become better at revisiting unresolved cases as medical knowledge changes, giving some families a second chance at answers without pretending the model itself is the doctor.

Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post
Server racks in a data center used for cloud development environments

F5’s Emergency NGINX Patches Put Web Server Teams on a Fast Upgrade Clock

Next Post
Rows of server racks inside a modern data center

FERC Gives AI Data Centers a Faster Path to the Grid

Related Posts