MAI-Voice-1 Powers New Scripted Audio Mode in Copilot Labs, Generating Audio in Under a Second

Mustafa Suleyman, CEO of Microsoft AI, recently announced the launch of a new "scripted mode" for audio generation within Copilot Labs, powered by the advanced MAI-Voice-1 model. This development aims to provide users with enhanced control over AI-generated speech, ensuring precise, verbatim output. The MAI-Voice-1 model is notable for its efficiency, capable of generating a minute of audio in less than one second on a single GPU.

The new scripted mode allows Copilot to read user input exactly as provided, offering a direct and accurate audio rendition. This contrasts with other available options, such as the "Emotive" mode, which adds dramatic flair, and the "Story" mode, designed to perform with multiple voices and characters for narrative purposes. These features are accessible through Copilot Labs, serving as a testing ground for cutting-edge AI functionalities.

"You asked, we shipped! Scripted mode just dropped for audio generation in Copilot Labs (c/o our new MAI-Voice-1 model)," Suleyman stated in his tweet, adding, "Scripted mode: reads your input verbatim. Emotive: riffs a bit for max drama. Story: performs multiple voices/characters. Try out all 3."

The introduction of MAI-Voice-1 and its integration into Copilot Labs is part of Microsoft's broader strategy to develop robust in-house AI models. Under Suleyman's leadership, Microsoft AI is focusing on creating consumer-centric AI solutions that enhance user experience and reduce reliance on external partners. MAI-Voice-1 is already being utilized in other Copilot features, including Copilot Daily and Podcasts, where it provides expressive and natural speech for news updates and discussions.

This move underscores Microsoft's commitment to advancing its AI capabilities independently, offering users more diverse and efficient tools for audio content creation. The rapid generation speed and varied expressive modes of MAI-Voice-1 position it as a significant step in the evolution of AI-powered voice technology within Microsoft's ecosystem.