Nous Research's recently released Hermes 4 large language models are demonstrating intriguing and unexpected performance characteristics, particularly concerning their "thinking" versus "non-thinking" modes. A recent observation shared by xlr8harder
on social media highlights that while both the 405-billion and 70-billion parameter models score excellently in thinking mode, the larger 405B model exhibits a significant drop in performance when operating in non-thinking mode.
The Hermes 4 series, developed by the open-source AI startup Nous Research, introduces a "hybrid reasoning" capability, allowing users to toggle between quick, direct responses and a deeper, step-by-step thought process using explicit <think>
tags. This design aligns with Nous Research's philosophy of creating user-controlled models with minimal content restrictions, aiming for neutral alignment and high performance in complex tasks like mathematics, coding, and reasoning. The models were trained using innovative systems like DataForge and Atropos, emphasizing verified reasoning data.
According to the tweet, evaluating performance with "SpeechMap" reveals a notable disparity: the Hermes 405B model achieved a 90.1% score in thinking mode but only 55.5% in non-thinking mode. In contrast, the smaller Hermes 70B model scored 89.3% in thinking mode and maintained a robust 81.9% in non-thinking mode. As xlr8harder
stated in the tweet, "> This reverses the usual trend of thinking models being more restrictive."
This observed trend in the 405B model suggests a potential trade-off or optimization focus, where its advanced reasoning capabilities, when not actively engaged, may lead to less efficient or accurate direct responses compared to its smaller counterpart. While higher accuracy in thinking mode is beneficial for complex problem-solving, a significant dip in non-thinking performance could impact its versatility and efficiency for simpler, everyday queries. This contrasts with the 70B model, which appears to offer a more balanced performance across both modes.
Nous Research continues to advance open-source AI, with the Hermes 4 models demonstrating frontier-level capabilities and a strong commitment to user steerability. The company's technical reports indicate high scores in various benchmarks, including a 96.3% on MATH-500 in reasoning mode for the 405B model and a leading 57.1% on RefusalBench, indicating a willingness to engage with diverse topics. The ongoing analysis of these performance nuances will be crucial for optimizing future hybrid reasoning models.