Claw-Anything: GPT-5.5 Fails New AI Assistant Test

2h ago·0:00 listen·Source: Decrypt

Summary

Researchers have developed a new benchmark called Claw-Anything to evaluate AI agents on personal assistant tasks. What's interesting is that OpenAI's GPT-5.5 model scored only 34.5% on this new test. This score is much lower than what it achieves on existing benchmarks. The new benchmark simulates over three months of user activity, involves multiple backend services, and requires interaction across different device environments. It also features a much larger context window per task, reflecting real-world complexity. For example, a task might involve cross-referencing a price alert from weeks ago with a calendar event and acting on it from a phone. The benchmark also evaluates proactive assistance, where the AI acts without being asked. Agents scored only 6.7% on proactive tasks, compared to 25.9% on reactive ones. The bottom line is that current AI models remain unreliable for complex, long-term personal assistant roles, even when given broad digital access. This suggests that current AI tests might be measuring the wrong things.

Read the full article on Decrypt

This is an AI-generated audio summary. Always check the original source for complete reporting.

Share
Keep Listening