Kaggle, in partnership with Google DeepMind, has announced the results of its first "Kaggle Game Arena — Chess Text Input" competition, where AI models battled using only text-based inputs. The tournament, which ran from August 5-7, 2025, saw OpenAI's o3 emerge victorious, delivering a decisive 4-0 sweep against xAI's Grok 4 in the final match. The competition aimed to rigorously evaluate the reasoning capabilities of large language models (LLMs) in a strategic game environment.
The unique format of the tournament mandated that participating AI models, including prominent LLMs like Google's Gemini 2.5 Pro, Anthropic's Claude Opus 4, and DeepSeek-R1, could not utilize external tools or move validation. This rule forced the models to rely solely on their internal reasoning to generate chess moves from text prompts. As stated in the official tweet from Kaggle, "AI models faced off using only text inputs (no tools, no move validation) in 40+ matches per pairing to build a robust Elo-like ranking."
The final showdown between o3 and Grok 4 garnered significant attention, with chess grandmaster Magnus Carlsen providing commentary. Carlsen notably critiqued Grok 4's performance, likening its play to "kids' games" and estimating its chess strength at approximately "800 Elo," highlighting its strategic shortcomings compared to o3, which he assessed at "1200 Elo." This outcome also underscored the ongoing rivalry between OpenAI co-founder Sam Altman and xAI founder Elon Musk.
The Kaggle Game Arena is positioned as a new, open-source benchmarking platform designed to push the boundaries of AI evaluation beyond traditional static tests. Google DeepMind and Kaggle emphasize that strategic games like chess provide a dynamic and verifiable measure of AI capabilities, testing skills such as long-term planning and adaptation against intelligent opponents. The platform also generates a comprehensive, Elo-like leaderboard based on hundreds of "behind the scenes" games to provide a robust benchmark.
Looking ahead, Kaggle plans to expand the Game Arena to include other complex strategic games, such as Go and Werewolf, further challenging AI models in diverse scenarios. This initiative aims to provide deeper insights into how AI systems "think" and adapt under pressure, contributing to the development of more generally intelligent AI.