voice/google-gemini-live-api/README.md
Google Gemini Live API integration for Mastra, providing real-time multimodal voice interactions with advanced capabilities including video input, tool calling, and session management.
npm install @mastra/voice-google-gemini-live
The module supports two authentication methods:
Use an API key from Google AI Studio:
# Set environment variable
GOOGLE_API_KEY=your_api_key
Use OAuth authentication with Google Cloud Platform. There are multiple ways to authenticate:
# Install gcloud CLI and authenticate
gcloud auth application-default login
# Set project ID
GOOGLE_CLOUD_PROJECT=your_project_id
# Set path to service account JSON
GOOGLE_APPLICATION_CREDENTIALS=path/to/service-account.json
GOOGLE_CLOUD_PROJECT=your_project_id
const voice = new GeminiLiveVoice({
vertexAI: true,
project: 'your-gcp-project',
location: 'us-central1',
serviceAccountKeyFile: '/path/to/service-account.json',
// OR use service account email for impersonation
serviceAccountEmail: '[email protected]',
});
When using Vertex AI, ensure your service account or user has these IAM roles:
aiplatform.user or specific permissions:
aiplatform.endpoints.predictaiplatform.models.predictimport { GeminiLiveVoice } from '@mastra/voice-google-gemini-live';
// Initialize with Gemini API
const voice = new GeminiLiveVoice({
apiKey: 'your-api-key', // Optional, can use GOOGLE_API_KEY env var
model: 'gemini-2.0-flash-live-001',
speaker: 'Puck', // Default voice
});
// OR initialize with Vertex AI (recommended for production)
const voice = new GeminiLiveVoice({
vertexAI: true,
project: 'your-project-id',
model: 'gemini-2.0-flash-live-001',
speaker: 'Puck',
});
// Connect to the Live API
await voice.connect();
// Listen for responses
voice.on('speaking', ({ audioData }) => {
// Handle audio response as Int16Array
playAudio(audioData);
});
// Or subscribe to a concatenated audio stream per response
voice.on('speaker', audioStream => {
audioStream.pipe(playbackDevice);
});
voice.on('writing', ({ text, role }) => {
// Handle transcribed text
console.log(`${role}: ${text}`);
});
// Send text to speech
await voice.speak('Hello from Mastra!');
// Send audio stream
const microphoneStream = getMicrophoneStream();
await voice.send(microphoneStream);
// When done, disconnect
voice.disconnect();
new GeminiLiveVoice(options?: GeminiLiveVoiceConfig)
Creates a new GeminiLiveVoice instance.
Parameters:
options (optional): Configuration object
apiKey?: string - Google API key (falls back to GOOGLE_API_KEY env var)model?: GeminiVoiceModel - Model to use (default: 'gemini-2.0-flash-exp')speaker?: GeminiVoiceName - Voice to use (default: 'Puck')vertexAI?: boolean - Use Vertex AI instead of Gemini APIproject?: string - Google Cloud project ID (required for Vertex AI)location?: string - Google Cloud region (default: 'us-central1')serviceAccountKeyFile?: string - Path to service account JSON key fileserviceAccountEmail?: string - Service account email for impersonationinstructions?: string - System instructions for the modeltools?: GeminiToolConfig[] - Tools available to the modelsessionConfig?: GeminiSessionConfig - Session configurationaudioConfig?: Partial<AudioConfig> - Audio configurationdebug?: boolean - Enable debug loggingasync connect(): Promise<void>
Establishes connection to the Gemini Live API. Must be called before using other methods.
Returns: Promise that resolves when connection is established
Throws: Error if connection fails or authentication is invalid
async disconnect(): Promise<void>
Disconnects from the Gemini Live API and cleans up resources.
Returns: Promise that resolves when disconnection is complete
getConnectionState(): 'disconnected' | 'connected'
Gets the current connection state.
Returns: Current connection state
isConnected(): boolean
Checks if currently connected to the API.
Returns: true if connected, false otherwise
Connection lifecycle transitions such as "connecting", "disconnecting", and "updated" are emitted via the session event:
voice.on('session', data => {
// data.state is one of: 'connecting' | 'connected' | 'disconnected' | 'disconnecting' | 'updated'
});
async speak(input: string | NodeJS.ReadableStream, options?: GeminiLiveVoiceOptions): Promise<void>
Converts text to speech and sends it to the model.
Parameters:
input: string | NodeJS.ReadableStream - Text to convert to speechoptions?: GeminiLiveVoiceOptions - Optional speech options
speaker?: GeminiVoiceName - Override the default speakerlanguageCode?: string - Language code for the responseresponseModalities?: ('AUDIO' | 'TEXT')[] - Response modalitiesReturns: Promise<void> (responses are emitted via speaker and writing events)
Throws: Error if not connected or input is empty
async send(audioData: NodeJS.ReadableStream | Int16Array): Promise<void>
Sends audio data for real-time processing.
Parameters:
audioData: NodeJS.ReadableStream | Int16Array - Audio data to sendReturns: Promise that resolves when audio is sent
Throws: Error if not connected or audio format is invalid
async listen(audioStream: NodeJS.ReadableStream, options?: GeminiLiveVoiceOptions): Promise<string>
Processes audio stream for speech-to-text transcription.
Parameters:
audioStream: NodeJS.ReadableStream - Audio stream to transcribeoptions?: GeminiLiveVoiceOptions - Optional transcription optionsReturns: Promise that resolves to transcribed text
Throws: Error if not connected, audio format is invalid, or transcription fails
getCurrentSpeakerStream(): NodeJS.ReadableStream | null
Gets the current concatenated audio stream for the active response.
Returns: ReadableStream of concatenated audio chunks, or null if no active stream
async updateSessionConfig(config: Partial<GeminiLiveVoiceConfig>): Promise<void>
Updates session configuration during an active session.
Parameters:
config: Partial<GeminiLiveVoiceConfig> - Configuration to update
speaker?: GeminiVoiceName - Change voice/speakerinstructions?: string - Update system instructionstools?: GeminiToolConfig[] - Update available toolssessionConfig?: GeminiSessionConfig - Update session settings (e.g. vad, interrupts, contextCompression)Returns: Promise that resolves when configuration is updated
Throws: Error if not connected or update fails
async resumeSession(handle: string): Promise<void>
Resumes a previous session using a session handle.
Parameters:
handle: string - Session handle from previous sessionReturns: Promise that resolves when session is resumed
Note: Session resumption is not yet fully implemented for Gemini Live API
getSessionHandle(): string | undefined
Gets the current session handle for resumption.
Returns: Session handle string, or undefined if not available
Note: Session handles are not yet fully supported by Gemini Live API
async getSpeakers(): Promise<Array<{ voiceId: string; description?: string }>>
Gets available speakers/voices.
Returns: Promise that resolves to array of available voices with descriptions
async getListener(): Promise<{ enabled: boolean }>
Checks if listening capabilities are enabled.
Returns: Promise that resolves to listening status
Note: Inherits default implementation from MastraVoice base class
on<E extends VoiceEventType>(event: E, callback: (data: E extends keyof GeminiLiveEventMap ? GeminiLiveEventMap[E] : unknown) => void): void
Registers an event listener.
Parameters:
event: E - Event name to listen forcallback: (data) => void - Function to call when event occursAvailable Events:
'speaking' - Audio response from model'speaker' - Readable stream of concatenated audio for the active response'writing' - Text response or transcription'error' - Error events'session' - Session state changes'toolCall' - Tool calls from model'vad' - Voice activity detection events'interrupt' - Interrupt events'usage' - Token usage information'sessionHandle' - Session resumption handle'turnComplete' - Turn completion for the current model responseAdd tools with addTools() using either @mastra/core/tools or a plain object matching ToolsInput.
Using createTool:
import { createTool } from '@mastra/core/tools';
import { z } from 'zod';
const searchTool = createTool({
id: 'search',
description: 'Search the web',
inputSchema: z.object({ query: z.string() }),
execute: async inputData => {
const { query } = inputData;
// ... perform search
return { results: [] };
},
});
voice.addTools({ search: searchTool });
Using a plain object (ensure each tool has an id):
voice.addTools({
search: {
id: 'search',
description: 'Search the web',
inputSchema: { type: 'object', properties: { query: { type: 'string' } } },
execute: async (inputData, context) => ({ results: [] }),
},
});
Tool call events from the model are emitted as:
voice.on('toolCall', ({ name, args, id }) => {
// name: string, args: Record<string, any>, id: string
});
off<E extends VoiceEventType>(event: E, callback: (data: E extends keyof GeminiLiveEventMap ? GeminiLiveEventMap[E] : unknown) => void): void
Removes an event listener.
Parameters:
event: E - Event name to stop listening tocallback: (data) => void - Specific callback function to removeGeminiLiveVoiceConfig
interface GeminiLiveVoiceConfig {
apiKey?: string;
model?: GeminiVoiceModel;
speaker?: GeminiVoiceName;
vertexAI?: boolean;
project?: string;
location?: string;
serviceAccountKeyFile?: string;
serviceAccountEmail?: string;
instructions?: string;
tools?: GeminiToolConfig[];
sessionConfig?: GeminiSessionConfig;
audioConfig?: Partial<AudioConfig>;
debug?: boolean;
}
GeminiLiveVoiceOptions
interface GeminiLiveVoiceOptions {
speaker?: GeminiVoiceName;
languageCode?: string;
responseModalities?: ('AUDIO' | 'TEXT')[];
}
GeminiSessionConfig
interface GeminiSessionConfig {
enableResumption?: boolean;
maxDuration?: string;
contextCompression?: boolean;
vad?: {
enabled?: boolean;
sensitivity?: number;
silenceDurationMs?: number;
};
interrupts?: {
enabled?: boolean;
allowUserInterruption?: boolean;
};
}
gemini-2.0-flash-exp - Default modelgemini-2.0-flash-live-001 - Latest production modelgemini-2.5-flash-preview-native-audio-dialog - Preview with native audiogemini-live-2.5-flash-preview - Half-cascade architectureFor detailed API documentation, visit Google's Gemini Live API docs.