ORCA V3 Report: AI Math Performance & Stability Tested

May 22·0:00 listen·Source: FinancialContent

Summary

Omni Calculator has published its ORCA V3 report, evaluating how well AI models perform on mathematical reasoning and stability. This independent benchmark tests large language models on real-world quantitative problems. What's interesting is the report introduces updated findings on accuracy, consistency, and calculation stability. It uses over 500 quantitative problems across seven categories, from Finance and Economics to Physics. The benchmark uses verified answer keys from Omni Calculator's library of over 3,800 calculators. Models are tested with a zero-shot methodology, meaning they get only one attempt at a response, reflecting a general user's experience. A key part of the project is the "Instability Metric," which checks how often models give different answers to the same question. The ORCA V3 report includes findings for ChatGPT 5.3, Claude Sonnet 4.6, and Grok 4.20. Grok 4.20 achieved a 70.4% math accuracy score and a 33.1% instability score. Claude Sonnet 4.6 scored 53.2% in math accuracy, while ChatGPT 5.3 recorded 48.4%. The report also discusses "Regression Risk," where newer AI versions might perform worse on some tasks than older ones. This variability could impact automated workflows. The bottom line is this initiative aims to bring more transparency to AI model performance in critical mathematical and logical tasks.

Read the full article on FinancialContent

This is an AI-generated audio summary. Always check the original source for complete reporting.

Share
Keep Listening