Friday, July 3, 2026
BCN.
Technology

OpenAI's GeneBench-Pro benchmark has its best model solving under a third of its biology-analysis problems

The score is not the interesting part. OpenAI simulated every dataset from a known causal structure, so the answers can be graded against ground truth.

Janet Torvalds

July 3, 2026

On June 30 OpenAI put out GeneBench-Pro, a benchmark that hands an AI agent a messy genomics dataset and asks it to do the kind of analysis a working computational biologist does: figure out what question the data can actually answer, pick a method, notice when something is off, and commit to a number. OpenAI ran its strongest model against it. GPT-5.6 Sol solved 28.7 percent of the problems at the highest reasoning setting, 31.5 percent with Pro mode on. On a benchmark you built to be hard, a sub-third pass rate is the honest result, and OpenAI reported it as the headline rather than burying it.

That number is the part everyone will quote. The more interesting part is how the benchmark is graded.

What it actually measures

GeneBench-Pro is 129 problems across 10 domains and 21 sub-domains, from population genetics and heritability to cancer somatic genomics and pharmacogenomics. Each one drops the model into an isolated workspace with data files, a short experimental brief, a target to estimate, and a standard toolchain (Python, the usual scientific libraries, PLINK 2.0). There is no multiple choice and no fact to recall. The model has to explore the data, choose an analysis, revise when the diagnostics say to, and return a final answer.

OpenAI calls the thing being tested "research taste," which is marketing language for a real problem: knowing which analysis a dataset can support, and knowing when a plausible-looking result is actually wrong. That skill is hard to grade, which is exactly why most science benchmarks avoid it.

The grading is the news

Here is the design choice worth paying attention to. Every GeneBench-Pro problem is synthetic. OpenAI simulates the data from a known causal structure, so it knows the true answer before the model ever sees the files. That solves two failure modes that quietly wreck a lot of "AI does science" benchmarks.

The first is the arbitrary-cutoff problem. Build a question around a real historical dataset and there is often no single correct path through it. One defensible choice passes, another equally defensible choice fails, and you end up measuring whether the model guessed the benchmark author's preferences rather than whether it did good science. The second is the opposite: a problem so numerically forgiving that a model can make a real analytical error and still land on a passing number.

Because OpenAI controls the data-generating process, it can tune each problem so that reasonable differences in analytical choices still produce accepted answers, and it can run ablations to confirm that plausible-but-wrong analyses actually fail. It says it audited drafts for information leakage and shortcut solutions, and sent 82 of the 129 questions to outside domain experts to check that the problems were realistic and the targets identifiable. Deterministic grading against a known ground truth is a credible answer to benchmark gaming, and it is a design other evaluation teams can copy. That is the good engineering here, more than any single model's score.

Read the number carefully

A few caveats belong next to the 28.7 percent.

It is 129 problems. That is enough to be interesting and not enough to be the last word on anything. OpenAI built the benchmark, hardened the problems using its own frontier models during development, and then ran those same models against it. The company acknowledges the setup could bias the benchmark toward GPT models, and says competitors at best matched the corresponding GPT model and usually fell short. Maybe. A lab grading its own homework and reporting that it did well is still a lab grading its own homework, and the useful check will be the independent one.

OpenAI also frames the gap to open-weight models such as Z.ai's GLM 5.2 as evidence that open systems are "more specialized for coding than for broader reasoning." That is a real and measurable gap in their results, but the framing is doing work: it recasts a benchmark OpenAI designed as a general statement about open models. Flag it, don't swallow it whole.

The economic argument is cleaner. Reviewers estimated a single problem would take a human expert 20 to 40 hours, which at a conservative $200 an hour is thousands of dollars of labor. Model inference runs a few dollars per problem. Even at a 28.7 percent pass rate, and even with the reliability nowhere near good enough to replace anyone, that cost gap is why partial automation of this kind of analysis is worth something now. The reviewers were also specific about where models break: they are not cautious enough about data irregularities, things like ancestry swaps or ancient-DNA artifacts. That is the same failure that breaks production data pipelines, not some exotic frontier of biology.

What is actually open

OpenAI open-sourced 10 representative questions on Hugging Face with an interactive browser, and says it will hand a 50-question subset to Artificial Analysis for independent, third-party benchmarking "in the near future." The full paper is on bioRxiv. Until the outside numbers land, treat the 28.7 percent as OpenAI's own measurement of a benchmark OpenAI built, which is interesting, well-constructed, and not yet independently confirmed. The company expects the benchmark could be saturated by the end of the year. If that holds, the more telling number will be whoever tops it next, and whether anyone outside OpenAI can reproduce it.

AI benchmarkGPT-5.6 SolGeneBench-ProOpenAIComputational biologyAI benchmarkssynthetic data benchmarkAI for scienceGLM-5.2genomics

Keep reading