OpenAI LifeSciBench: AI's Real-World Life Science Test
Summary
OpenAI has released LifeSciBench, a new benchmark designed to evaluate AI models on real-world life science research tasks. This benchmark directly addresses the gap left by traditional biology benchmarks, which often focus on narrow, fact-based questions. LifeSciBench contains 750 expert-authored tasks across seven workflows and seven biological domains. Each task includes a prompt, supporting artifacts, and a detailed grading rubric. Tasks are free-response and written as a scientist would brief a colleague. About 79% of these tasks require multiple reasoning or decision-making steps, averaging four steps each. A cohort of 173 expert scientists, all with Ph.D.s and experience in biotechnology or pharmaceuticals, wrote these tasks. The benchmark also includes 1,062 attached artifacts, such as sequences, figures, tables, PDFs, and chemical structures. The core of LifeSciBench is its rubric system, containing 19,020 criteria, roughly 25 criteria per task. These rubrics reward specific facts, reasoning steps, or numeric answers within tolerance. Performance is summarized by a normalized rubric score and a task pass rate, with a pass threshold of 70%. OpenAI evaluated five models, with GPT-Rosalind, a domain-specialized model, showing the highest performance. However, even the strongest model passed only about one in three tasks, indicating the benchmark is far from saturated. This benchmark offers a realistic measure of AI capabilities in complex scientific problem-solving.
This is an AI-generated audio summary. Always check the original source for complete reporting.