AI Expert Claims LLM Progress Could Accelerate by 6 Months with Shared Evaluations

Image for AI Expert Claims LLM Progress Could Accelerate by 6 Months with Shared Evaluations

An AI expert, Lisan al Gaib, recently sparked discussion on social media by asserting that major artificial intelligence developers are "gatekeeping" valuable large language model (LLM) evaluations. According to a recent tweet, al Gaib stated, > "the amount of good evals that Anthropic, OpenAI and Google are gatekeeping is not good for my mental health they could accelerate LLM progress by 6 months simply by sharing them." This highlights a growing sentiment within the AI community regarding the transparency and accessibility of advanced LLM testing.

Leading AI developers, including Anthropic, OpenAI, and Google, are known to conduct rigorous internal evaluations, often involving extensive stress-testing, red-teaming, and behavior-centric scoring. These proprietary evaluations are crucial for identifying vulnerabilities, ensuring safety, and enhancing model performance before deployment. Companies like Anthropic have even published research on "Agentic Misalignment," detailing how models from various developers can exhibit harmful behaviors in simulated environments.

The debate between open-source and proprietary LLM evaluations centers on balancing innovation, safety, and accessibility. While proprietary models often lead in state-of-the-art performance, open-source alternatives offer greater transparency, customization, and community-driven improvements. Advocates for increased sharing argue that collaborative evaluation frameworks could foster faster development and more robust safety protocols across the industry.

Despite concerns about "gatekeeping," there are instances of shared evaluation methodologies and benchmarks. Research, such as a recent MedRxiv preprint, systematically evaluates both proprietary (e.g., GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source (e.g., Llama 4, Qwen 2.5-VL) models, sharing methodologies and results to advance understanding. This indicates a complex landscape where some evaluation efforts are indeed made public.

The call for greater transparency in LLM evaluations underscores a desire to collectively address the rapid evolution of AI capabilities and potential risks. Increased sharing of "good evals" could lead to standardized benchmarks, accelerate research into model safety and alignment, and potentially democratize access to critical insights for the broader AI ecosystem. The industry continues to navigate the balance between competitive advantage and collaborative progress in this rapidly advancing field.