ComputerRL Achieves 48.1% Success Rate on OSWorld Benchmark, Surpassing Major AI Models

A new milestone in artificial intelligence has been reached as ComputerRL, an end-to-end reinforcement learning method, achieved a 48.1% success rate on the challenging OSWorld Benchmark. The breakthrough, announced by Xiao Liu (Shaw), a PhD student at Tsinghua University, highlights significant progress in developing autonomous computer use agents. This performance, driven by the 9-billion parameter open model AutoGLM-OS-9B (based on GLM-4-9B-0414), surpasses previous state-of-the-art models including OpenAI Operator and Claude Sonnet 4.0.

The OSWorld Benchmark is a critical evaluation platform for multimodal agents, designed to assess their ability to perform complex, open-ended tasks in real computer environments across various operating systems like Ubuntu, Windows, and macOS. It features 369 real-world tasks involving web and desktop applications, making it a robust test of an agent's practical utility. Previously, the best-performing models on OSWorld achieved success rates around 12.24%, and more recently, the highest-scoring agents on OSWorld-Human, a related benchmark, reached 42.5%.

Xiao Liu, affiliated with Zai_org, stated in a tweet, "Thrilled to share our latest progress on Computer Use Agent, ComputerRL, an end-to-end RL method which achieves 48.1% success rate on OSWorld Benchmark with only 9B open model, beating OpenAI Operator, Claude Sonnet 4.0, and other previous models, state-of-the-art performance." This marks a substantial leap towards closing the gap with human performance, which stands at approximately 72.36% on the benchmark.

ComputerRL's success underscores the growing efficacy of reinforcement learning in training agents for complex GUI interactions. The method's ability to operate effectively with a relatively small 9-billion parameter model suggests a promising path for developing more efficient and capable AI systems for desktop automation. This advancement could accelerate the development of more versatile and robust AI assistants capable of navigating diverse digital workflows with minimal human intervention.