Recent findings shared by Rohan Paul, Co-founder and CEO of AI research company Imbue, indicate that OpenAI's GPT-5 model significantly improves accuracy on spatial reasoning tasks when its "thinking effort" is increased. The advanced AI system, which features a reasoning_effort
parameter, demonstrated a notable jump in performance on a SpatialViz-Tiny sample. This improvement, however, comes with a substantial increase in computational resources, particularly token consumption.
According to Paul's tweet, GPT-5's accuracy on the SpatialViz-Tiny benchmark rose from 48.31% in its minimal reasoning mode to 56.78% in a medium reasoning mode. While a high reasoning mode was also tested, it frequently resulted in timeouts, achieving a raw accuracy of 52.54% or 68.89% when timeouts were disregarded. This highlights a trade-off between deeper reasoning capabilities and the practical efficiency of the model.
The gains in accuracy are directly correlated with increased token usage. The medium reasoning mode, which yielded the most significant accuracy improvement, averaged approximately 5,860 reasoning tokens per question, compared to zero in the minimal mode. This suggests that while GPT-5 can achieve more precise results by "thinking longer," the process demands considerably more processing power and associated costs.
Paul also noted the use of "hard-circular testing," a rigorous evaluation method designed to prevent models from relying on positional biases or lucky guesses. This technique involves rotating answer choices to ensure the model's understanding is robust and not merely superficial pattern recognition. The consistency of model ranking under this stringent testing indicates a genuine improvement in reasoning rather than an artifact of the testing environment.
OpenAI has positioned GPT-5 as its most advanced AI system, capable of unified reasoning and multimodal capabilities. The model, which includes variants like GPT-5-mini, GPT-5-nano, and GPT-5-pro, is designed to adapt its thinking process based on task complexity. These latest findings underscore the potential of advanced reasoning modes to enhance AI performance, while also bringing to light the critical considerations of resource allocation and efficiency in deploying such powerful models.