OpenAI's Responses API Achieves 40-80% Cache Rate Jump by Enhancing Model Chain-of-Thought Retention

OpenAI's recently launched Responses API is demonstrating superior performance in retaining AI model intelligence compared to the traditional Chat Completions API. This key advantage stems from its ability to better preserve the model's "chain-of-thought" throughout ongoing conversations, a benefit highlighted by developer Simon Willison. As Willison stated, > "TIL that the OpenAI Responses API gives better performance for retaining models over Chat Completions because it better preserves their chain-of-thought throughout the ongoing conversation." The API, introduced in March 2025, is designed to streamline complex agentic workflows and improve conversational state management.

The Responses API represents a strategic advancement from OpenAI's previous offerings, combining the simplicity of the Chat Completions API with the advanced agentic features initially explored in the now-deprecated Assistants API. Unlike the stateless nature of Chat Completions, the Responses API offers optional stateful conversations, allowing for native management of context across interactions. This design facilitates more robust and continuous AI applications, moving towards more intelligent agent behaviors.

A core benefit of the Responses API lies in its ability to preserve "reasoning tokens" across requests and tool calls, which directly enhances model intelligence and reduces operational costs and latency for developers. Prashant Mital, Head of Applied AI at OpenAI, clarified that this preservation leads to "much higher cache utilization," with reported cache rates jumping from 40% to 80% on some workloads. This mechanism allows reasoning models like OpenAI's o3 and o4-mini to maintain their internal thinking process, leading to more coherent and contextually rich responses.

For developers, the Responses API simplifies the creation of sophisticated AI agents by offering built-in tool support for features like web search, file search, and image generation. While OpenAI recommends its adoption for new projects due to these performance gains and integrated capabilities, some developers express concerns regarding potential vendor lock-in and the introduction of new pricing structures for certain tool calls, such as file search. Despite these considerations, the API aims to reduce boilerplate and enhance developer experience for agentic applications.

The company positions the Responses API as crucial for optimizing performance with future models like GPT-5, urging developers to consider migrating from older APIs to fully leverage these advancements. By providing a unified interface for agentic workflows, OpenAI is shaping the future of AI development towards more structured, capable, and agent-driven systems. This evolution underscores OpenAI's commitment to enabling more complex and intelligent AI interactions.