OpenAI’s GeneBench-Pro benchmark tests whether AI agents can make messy judgment calls in genomics and translational biology. GPT-5.6 Sol leads the field, but a 31.5% top score shows scientific AI still needs expert supervision before it can be trusted with consequential research decisions.