Improved Gemini audio models for powerful voice experiences

Quick Overview

Google has significantly enhanced its Gemini 2.5 Flash Native Audio model, enabling more natural and complex live voice interactions across its products and introducing real-time speech translation.

Improved Function Calling: Achieves 71.5% on ComplexFuncBench Audio, reliably triggering external functions and integrating real-time data into conversations.
Robust Instruction Following: Demonstrates a 90% adherence rate to developer instructions, leading to higher user satisfaction and content completeness.
Smoother Conversations: Enhances multi-turn conversation quality by more effectively retrieving context from previous interactions.
Live Speech Translation: Introduces real-time streaming speech-to-speech translation in the Google Translate app, preserving speaker intonation, pacing, and pitch.
Broad Availability: Available in Google AI Studio, Vertex AI, Gemini Live, and Search Live, powering diverse conversational experiences.

Key Points

Introduction to Gemini Audio Models

Google has introduced an upgrade to its Gemini 2.5 Pro and Flash Text-to-Speech models for greater control over audio generation.
This release focuses on Gemini 2.5 Flash Native Audio for live voice agents, improving its ability to handle complex workflows, user instructions, and natural conversations.
The updated model is available across Google AI Studio, Vertex AI, Gemini Live, and Search Live, bringing native audio to Search Live for the first time.

Key Enhancements in Gemini 2.5 Flash Native Audio

Sharper function calling: Improved reliability in triggering external functions, accurately identifying when to fetch real-time information and seamlessly integrating it into audio responses. Achieved 71.5% on ComplexFuncBench Audio.
Robust instruction following: Better at handling complex instructions, leading to higher user satisfaction and a 90% adherence rate to developer instructions (up from 84%).
Smoother conversations: Significant gains in multi-turn conversation quality by more effectively retrieving context from previous turns, creating more cohesive interactions.

New Possibilities: Live Speech Translation

Beyond powering helpful agents, native audio unlocks live speech translation capabilities.
This feature enables streaming speech-to-speech translation for headphones, preserving the speaker's intonation, pacing, and pitch.
A beta experience of this live speech translation is currently rolling out in the Google Translate app.

Customer Testimonials

Shopify (David Wurtz): 'Users often forget they’re talking to AI within a minute of using Sidekick...New Live API AI capabilities offered through Gemini [2.5 Flash Native Audio] empower our merchants to win.'
United Wholesale Mortgage (Jason Bressler): 'By integrating the Gemini 2.5 Flash Native Audio model...we've significantly enhanced Mia's capabilities...This powerful combination has enabled us to generate over 14,000 loans for our broker partners.'
Newo.ai (David Yang): 'Working with the Gemini 2.5 Flash Native Audio model through Vertex AI allows Newo.ai AI Receptionists to achieve unmatched conversational intelligence...They can identify the main speaker even in noisy settings, switch languages mid-conversation, and sound remarkably natural and emotionally expressive.'

Outline

No outline available

AI saves you up to 7 minutes

Improved Gemini audio models for powerful voice experiences

Quick Overview

Key Points

Introduction to Gemini Audio Models

Key Enhancements in Gemini 2.5 Flash Native Audio

New Possibilities: Live Speech Translation

Customer Testimonials

Outline

Similar Articles

Build a 100% Local AI Voice Assistant (LangChain + Ollama + Streamlit)

Deepfake / Voice Clone for Tabletop Exercise

Realtime Voice to Voice Agent for recruitment agency using livekit and gemini 2.5 flash native audio