Butler Predicts Near 100% Test Scores for Gemini 2.5 Pro and GPT-5 Thinking, Advocates Open-Source AI Evals

Image for Butler Predicts Near 100% Test Scores for Gemini 2.5 Pro and GPT-5 Thinking, Advocates Open-Source AI Evals

Prominent AI commentator Adam Butler, Chief Investment Officer at Resolve Asset Management, recently underscored the critical need for open-source evaluation sets in artificial intelligence development. In a social media post, Butler speculated that advanced models like Google's Gemini 2.5 Pro and OpenAI's gpt-5-thinking would achieve "near ~100%" on existing AI tests, highlighting their rapid advancement.

Butler's tweet acknowledged the significant value of contemporary AI research but emphasized the responsibility of developers to "open-source their eval set so the analysis can be updated as new, more capable models are released." This statement aligns with a growing consensus within the AI community advocating for enhanced transparency and collaborative oversight of AI system performance.

Google's Gemini 2.5 Pro, a highly capable multimodal model, became generally available in June 2025. It is recognized for its robust reasoning, advanced coding abilities, and capacity to process diverse inputs, including text, images, and video. The model features a substantial 1-million token context window, enabling complex data analysis and problem-solving.

OpenAI officially launched its next-generation AI system, GPT-5, on August 7, 2025. The gpt-5-thinking variant, specifically referenced by Butler, is designed for profound reasoning tasks and is part of a unified architecture that intelligently routes queries to the most suitable underlying model. This system aims to deliver expert-level responses across a broad spectrum of applications.

The call for open-source evaluation sets is rooted in their vital role for AI safety, fostering transparency, and enabling widespread auditability. Public access to these tools allows a diverse group of researchers and ethicists to identify potential biases, mitigate risks, and accelerate the development of robust safety protocols. This approach also helps democratize AI development, encouraging collective problem-solving and ensuring that advancements benefit a wider community.