FreshStack Paper on Realistic RAG Benchmarks Accepted for NeurIPS 2025 D&B Track

San Diego, CA – Research on "FreshStack," a novel framework designed to create realistic evaluation benchmarks for Retrieval-Augmented Generation (RAG) systems, has been accepted for presentation at the prestigious NeurIPS 2025 Datasets & Benchmarks (D&B) Track. The paper, co-authored by Nandan Thakur of the University of Waterloo and researchers from Databricks, marks a significant step forward in addressing the challenges of evaluating AI models on complex technical documentation. The acceptance was announced on September 18, 2025, with a poster presentation confirmed for the conference in San Diego.Nandan Thakur, a PhD student at the University of Waterloo, shared his excitement via social media, stating, > "Really excited to share that FreshStack has been accepted at #neurips25 D&B Track (poster)! 🥁🥁 Huge congratulations to all my @DbrxMosaicAI co-authors! Time to see you in San Diego! 🍻" His co-authors from Databricks include Sam Havens, Michael Carbin, Omar Khattab, and Andrew Drozdov, with the work developed during Thakur's internship at the company. Databricks, having acquired MosaicML in July 2023, is a key player in generative AI research.FreshStack introduces a holistic approach to building challenging information retrieval (IR) and RAG evaluation datasets. It leverages real-world, community-asked questions from platforms like Stack Overflow and pairs them with technical documentation sourced from public GitHub repositories. This methodology aims to overcome the limitations of existing benchmarks, which are often criticized for being outdated, artificially easy, or susceptible to data contamination by large language models.The framework's core innovation lies in its ability to automatically generate "nuggets" (atomic facts) from question-answer pairs and assess document relevance at a nugget-level using advanced AI. This ensures that the benchmarks are dynamic, scalable, and reflect the complexity of real-world technical queries. The research indicates that current retrieval models significantly underperform on FreshStack, highlighting substantial room for improvement in the field.The NeurIPS 2025 conference is scheduled to take place from November 30 to December 7, 2025, with workshops extending until December 21, in San Diego, California, and Mexico City. The acceptance of FreshStack underscores the growing importance of robust evaluation methodologies in the rapidly evolving landscape of AI and machine learning, particularly for applications involving complex technical information and code.