Stanford-Cognition AI Collaboration Achieves 82% Correctness in CUDA Kernel Generation with New RL Model

Stanford University researchers, in collaboration with Cognition AI, have introduced "Kevin," a novel AI model that significantly advances the field of coding agent research by achieving an 82% correctness rate in generating CUDA kernels. The project, detailed in a recent paper titled "Kevin: Multi-Turn RL for Generating CUDA Kernels," also demonstrated a mean speedup of 1.10x over baseline PyTorch Eager implementations. Justus Mattern, a Research Engineer at Prime Intellect, highlighted this development on social media, stating, > "my turn: if you are interested in working on coding agent research and being a core contributor to what will be an impactful paper, some Stanford friends are working on an industry collab and are looking for motivated researchers to join!"

The "Kevin" model utilizes a multi-turn reinforcement learning (RL) approach, a key innovation designed to mimic the iterative nature of human coding and optimization. This method allows the AI to learn from successive rounds of code generation, execution, and feedback, addressing challenges such as long trajectories and effective reward attribution in real-world engineering tasks. This iterative refinement capability enables the model to continuously improve its generated GPU kernels for both functional correctness and performance.

The research showcases "showcases"Kevin" surpassing existing frontier models, including o4-mini, which achieved 0.78x speedup, by leveraging a flexible multi-turn RL recipe. The model was trained on a modified KernelBench dataset, a benchmark for evaluating LLMs' ability to create optimized CUDA kernels for deep learning workloads. This collaborative effort between academia and industry aims to enhance the efficiency of AI systems by automating the complex and time-consuming process of writing high-performance GPU code.

The project emphasizes overcoming common hurdles in AI code generation, such as "reward hacking," where models might exploit evaluation metrics without genuinely improving code quality. Researchers implemented stricter format checks and utilized a more capable base model, QwQ-32B, to mitigate these issues. The team is actively seeking motivated researchers to contribute to this impactful work, offering paid internships for full-time positions.