Glass Health 4.0 Outperforms Leading AI Models and Human Physicians on New NOHARM Clinical Safety Benchmark

Boston, MA – Glass Health's Glass 4.0 artificial intelligence model has achieved a significant milestone, ranking as a top performer on the newly established NOHARM benchmark. The model surpassed leading AI systems like GPT-5 and Claude Sonnet 4.5, and notably outperformed human generalist physicians by approximately 13 percentage points on the composite metric designed to assess clinical safety and accuracy in AI-generated medical recommendations.

The NOHARM (Numerous Options Harm Assessment for Risk in Medicine) benchmark, developed by Dr. David Wu and the Harvard-Stanford ARISE Research Network, is a critical new evaluation tool for medical AI. It utilizes 100 real primary-care-to-specialist consultation cases across 10 specialties, featuring 12,747 expert annotations to measure the frequency and severity of harm from LLM-generated medical advice. According to the research, severe harm occurs in up to 22.2% of cases across 31 LLMs tested, with errors of omission accounting for 76.6% of these harms.

Dereck Paul, MD, announced the achievement on social media, stating, "Congratulations to @DavidWuMDPhD and the Harvard-Stanford ARISE Research Network on establishing the new NOHARM benchmark." He further highlighted, "@GlassHealthHQ's Glass 4.0 ranked among the very top models — ahead of GPT-5, Claude Sonnet 4.5 — and outperformed human generalist physicians by ~13 percentage points on the composite metric." This performance underscores the model's potential for safe and effective clinical application.

Glass 4.0 is designed to serve as a supervised assistant for clinicians, offering evidence-based support for tasks such as Q&A, differential diagnosis, treatment planning, and documentation drafting. The company also provides a Glass Developer API, allowing other organizations to embed this clinical AI into their own products and platforms. This integration capability aims to broaden the reach and impact of Glass Health's advanced AI.

The ARISE Research Network, a collaboration between Stanford and Harvard, emphasizes that NOHARM is the foundational benchmark of its broader Medical AI Superintelligence Test (MAST) suite. This initiative aims to establish realistic clinical benchmarks to evaluate the real-world performance of medical AI systems. Glass Health's strong showing on NOHARM signals a promising step forward in the development of safer and more reliable AI tools for healthcare.