OpenAI introduced GeneBench-Pro on June 30 as a harder test for AI agents in computational biology, genomics, and translational medicine. The benchmark is not meant to ask whether a model can recite biology facts or run a tidy notebook. It asks whether an agent can handle the part of research that is harder to automate: deciding what the data can support, choosing the right analysis path, revising assumptions when diagnostics look wrong, and knowing when a result is ready to influence a downstream scientific decision.
That framing matters because AI labs are moving quickly into science. OpenAI has been building GPT-Rosalind and life-science evaluations, Anthropic launched Claude Science as a research workbench this week, and Google DeepMind has pushed genomics tools such as AlphaGenome into the broader discussion about AI-assisted discovery. GeneBench-Pro gives that race a useful reality check: even the best current systems are improving fast, but they remain unreliable on the sort of messy, multi-stage work that experienced computational biologists do every day.
What GeneBench-Pro Actually Tests
The benchmark contains 129 problems across 10 primary domains and 21 subdomains, with a strong genomics core. The areas include statistical genetics, population genetics, quantitative genetics, regulatory omics, functional genomics, proteomics, clinical diagnostics, pharmacogenomics, cancer genomics, microbial genomics, and forensic genetics.
Each task gives an AI agent a realistic dataset, a short experimental context, and a target estimand tied to a decision. The agent must explore the data, perform quality checks, choose an analytical method, run the analysis, interpret diagnostics, and return a final answer. In one example described by OpenAI, the agent must estimate the benefit-risk tradeoff for a targeted cancer therapy from a tumor-board-style dataset, including clinical benefit, toxicity risk, and a final treatment-class decision.
That structure is deliberately different from many biology benchmarks, where a model may start with a clean dataset and a well-specified task. Real scientific data often arrive with batch effects, missingness, confounding, outliers, weak signals, or ambiguous relationships between an experiment and the question researchers actually want to answer. GeneBench-Pro is designed to test whether an agent can make those judgment calls rather than simply execute a prescribed workflow.
Why Synthetic Data Is the Point
OpenAI built the benchmark with synthetic data, but not because it wants easy or artificial problems. The synthetic setup lets the benchmark designers know the underlying causal structure and data-generating process, which makes grading more precise. A benchmark built around historical datasets can accidentally reward one defensible analyst preference over another, or let models pass despite using a flawed method if the final number is insensitive to the mistake.
GeneBench-Pro tries to avoid both problems. The tasks are tuned so reasonable analytical choices can still land within accepted numerical ranges, while important missed steps should cause failure. OpenAI also says problem drafts went through trace analysis to check for leaks and shortcut paths. Of the 129 questions, 82 were sent to external domain experts, including graduate students, postdoctoral researchers, industry scientists, and professors, for review of realism and identifiability.
That design choice is important for readers outside biology, too. If AI agents are going to be evaluated on research work, the test needs to measure the decisions that make the work hard. A model that can produce polished prose around a dataset is not the same as a system that notices the estimator is wrong because the treatment assignment changed over time, or that an apparent signal is probably an artifact.
The Scores Show Progress and a Ceiling
OpenAI’s strongest reported system, GPT-5.6 Sol, reached a 28.7% pass rate at the highest reasoning level. With Pro mode enabled, the score rose to 31.5%. The GeneBench-Pro paper reports lower scores for earlier OpenAI models, including 12.0% for GPT-5.5 and 8.9% for GPT-5.4, while the strongest non-GPT baseline listed in the paper, Claude Opus 4.8, reached 16.0%.
Those numbers cut in two directions. They show real progress from earlier systems, especially because OpenAI says its best frontier model scored below 5% when the original GeneBench work began. They also show that today’s leading agents fail most of the time on these tasks. A 31.5% top score is impressive only if readers understand the difficulty of the work; it is not a green light to hand off consequential biological decisions to an unsupervised system.
The paper’s error analysis is more useful than the leaderboard. Models often identify a relevant diagnostic signal but fail to carry that implication into the next analytical choice. In practice, that means an agent may notice a quality-control issue, confounder, or modeling problem, then continue with an initially plausible but wrong analysis path. That failure mode is exactly where scientific supervision matters: the model sees part of the problem but does not reliably change course.
What This Means for Labs
For research teams, GeneBench-Pro is a reminder to treat scientific AI agents as accelerators for supervised work, not as replacement analysts. The useful near-term role is likely in workflow exploration, code drafting, data checks, sensitivity analyses, literature-to-analysis handoffs, and candidate-method comparison. The risky use case is letting an agent produce a decision-ready biological conclusion without an expert reviewing the assumptions and intermediate choices.
That does not make the benchmark pessimistic. OpenAI notes that reviewers estimated a typical GeneBench-Pro problem could take a human expert 20 to 40 hours to complete. If AI agents can solve parts of that work for only a few dollars of inference, even partial reliability can have practical value. A lab does not need a model to replace a principal investigator for the model to save time on exploratory analysis or make a first-pass workflow more reproducible.
The procurement lesson is to ask different questions. Instead of focusing only on whether a model tops a biology leaderboard, teams should ask whether it exposes its assumptions, preserves an auditable analysis trail, supports reruns and sensitivity checks, flags weak evidence, and makes it easy for a domain expert to inspect why it chose one method over another. Those are workflow and governance requirements, not just model-capability requirements.
The Science-AI Race Is Moving From Answers to Judgment
GeneBench-Pro lands in the same week as Claude Science, which packages AI around literature review, coding, compute access, scientific figures, and lab-specific workflows. The two releases point to the same market shift from different directions. One side is building workbenches that put AI closer to daily research operations. The other is building evaluations that ask whether those systems can reason through ambiguous analysis instead of merely helping with isolated tasks.
That is the right direction for the field. Scientific AI will not be judged only by how much biology it has memorized or how quickly it can write Python. The harder test is whether it can make sound choices when data are noisy, the right method is not obvious, and the answer will shape what a lab does next. GeneBench-Pro suggests AI agents are getting better at that work, but not yet good enough to be left alone with it.