OpenAI LifeSciBench: AI Fails 2/3 Scientific Tasks
Summary
OpenAI's new LifeSciBench evaluation shows that AI models are still far from mastering scientific research. The best-performing model, GPT-Rosalind, only completed 36.1% of 750 tasks. This means nearly two out of three research-level tasks defeat even the most advanced AI. What's interesting is how LifeSciBench works. It uses 750 tasks developed with 173 PhD-level scientists. Unlike standard tests, it doesn't use multiple-choice questions. Instead, it features free-response prompts, just like a scientist would brief a colleague. These tasks cover seven scientific workflows and seven biological domains. The scoring is also unique. Each task has a rubric with an average of 25 grading criteria, totaling over 19,000 criteria across the benchmark. Models are judged not just on correct answers, but on scientific validity, justification, and appropriate detail. More than half the tasks require interpreting complex scientific artifacts. The bottom line is that current AI systems are not yet equipped for the full complexity of real-world scientific research.
This is an AI-generated audio summary. Always check the original source for complete reporting.