Claw-Anything: GPT-5.5 Fails New AI Assistant Test

May 27·0:00 listen·Source: Decrypt

Summary

Researchers have developed a new benchmark called Claw-Anything to evaluate AI agents on personal assistant tasks. What's interesting is that OpenAI's GPT-5.5 model scored only 34.5% on this new test. This score is much lower than what it achieves on existing benchmarks. The new benchmark simulates over three months of user activity, involves multiple backend services, and requires interaction across different device environments. It also features a much larger context window per task, reflecting real-world complexity. For example, a task might involve cross-referencing a price alert from weeks ago with a calendar event and acting on it from a phone. The benchmark also evaluates proactive assistance, where the AI acts without being asked. Agents scored only 6.7% on proactive tasks, compared to 25.9% on reactive ones. The bottom line is that current AI models remain unreliable for complex, long-term personal assistant roles, even when given broad digital access. This suggests that current AI tests might be measuring the wrong things.

Read the full article on Decrypt →

This is an AI-generated audio summary. Always check the original source for complete reporting.

Claw-Anything: GPT-5.5 Fails New AI Assistant Test

Summary

Suprema: ISO/IEC 42001 Certified for AI Governance

Bunkerhill Health Raises $55M for AI in Healthcare

AI Under Pressure: Scams, Security, Sustainability