A new preprint from Nadav Brandes and his lab at NYU School of Medicine highlights a critical issue in the reported accuracy of advanced genomic AI models, despite their near-perfect Area Under the Receiver Operating Characteristic (AUROC) scores. The research, detailed in a preprint titled "Genomic heterogeneity inflates the performance of variant pathogenicity predictions," suggests that high accuracy figures, such as AUROC > 0.97 for models like Evo2, are "very misleading."
Nadav Brandes, an Assistant Professor at NYU, announced the findings on social media, stating, "Latest genomic AI models report near-perfect prediction of pathogenic variants (e.g. AUROC>0.97 for Evo2). We ran extensive independent evals and found these figures are true, but very misleading." He further indicated that a full breakdown of their new preprint was available.
The core of the issue lies in what the researchers term "genomic heterogeneity." The study reveals that the performance of these large AI models, trained on vast genomic datasets, is significantly inflated by variations in the prevalence of pathogenic variants across different genomic contexts. This means that while models may appear highly accurate overall, their true predictive power can be skewed by their performance on "easier" or more common variants.
This inflated performance carries substantial implications for clinical genomics and genetic research. Accurate prediction of pathogenic variants is fundamental for diagnosing genetic diseases, understanding disease mechanisms, and guiding personalized medicine. An overestimation of a model's reliability could lead to misinterpretations in diagnostic settings or flawed conclusions in research.
The Brandes Lab's comprehensive evaluation not only exposes this inflation but also aims to provide a path forward. Their work identifies the best-performing models for specific variant types and, crucially, establishes a new benchmark designed to guide future development and evaluation of these critical genomic AI tools. This new benchmark seeks to ensure more reliable and context-aware assessments of model accuracy, fostering greater trust in AI-driven genomic predictions.