Text-to-Speech (TTS) Integration Guide
NeuroLink provides integrated Text-to-Speech (TTS) capabilities, allowing you to generate high-quality audio from text prompts or AI-generated responses. This feature is perfect for voice assistants, accessibility features, narration, podcasts, and more.
Overview
Key Features:
- Multiple providers - Google Cloud TTS, OpenAI TTS, ElevenLabs, Azure TTS, Fish Audio, and Cartesia
- High-quality voices - Neural, Wavenet, Standard, and multilingual voice types
- Multiple languages - 50+ voices across 10+ languages
- Flexible audio formats - MP3, WAV, OGG/Opus
- Voice customization - Adjust speed, pitch, and volume
- Two synthesis modes - Direct text-to-speech OR AI response synthesis
- Production-ready - Works with Google Cloud, OpenAI, ElevenLabs, Azure, Fish Audio, and Cartesia
Quick Start
Installation
TTS support is built into NeuroLink. No additional installation required.
Environment Setup
Set the appropriate environment variables for your chosen TTS provider:
# Google AI Studio (google-ai TTS uses the Google Cloud TTS SDK; ADC / service
# account is required — there is no API-key auth for the TTS handler itself)
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
# (GOOGLE_AI_API_KEY is used by the LLM/STT side of `google-ai`, not TTS.)
# Google Vertex AI (vertex) — service account recommended for production
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
# OpenAI TTS (openai-tts)
export OPENAI_API_KEY="your-openai-api-key"
# ElevenLabs (elevenlabs)
export ELEVENLABS_API_KEY="your-elevenlabs-api-key"
# Azure TTS (azure-tts)
export AZURE_SPEECH_KEY="your-azure-speech-key"
export AZURE_SPEECH_REGION="eastus" # or your Azure region
# Fish Audio TTS (fish-audio) — low-cost, voice cloning, 14 languages
export FISH_AUDIO_API_KEY="your-fish-audio-api-key"
# Cartesia TTS (cartesia) — low-latency Sonic models, voice cloning
export CARTESIA_API_KEY="sk_car_..."
Google API Key Configuration:
If using API key authentication for Google, enable both APIs in Google Cloud Console:
- Navigate to "APIs & Services" > "Credentials"
- Create or select your API key
- Under "API restrictions", enable:
- Generative Language API (for Gemini)
- Cloud Text-to-Speech API (for TTS)
Basic Usage
CLI:
# Generate and play audio automatically
neurolink generate "Hello, world!" \
--provider google-ai \
--tts-voice en-US-Neural2-C
# Save to file
neurolink generate "Welcome to our application" \
--provider google-ai \
--tts-voice en-US-Neural2-C \
--tts-output welcome.mp3
SDK:
import { NeuroLink } from "@juspay/neurolink";
const neurolink = new NeuroLink();
const result = await neurolink.generate({
input: { text: "Hello, world!" },
provider: "google-ai",
tts: {
enabled: true,
voice: "en-US-Neural2-C",
format: "mp3",
play: true, // Auto-play in CLI, manual in SDK
},
});
// Access generated audio
console.log("Audio size:", result.audio?.size, "bytes");
console.log("Audio format:", result.audio?.format);
Supported Providers
TTS is available through the following providers:
| Provider | Authentication | Voices / Models | Notes |
|---|---|---|---|
| google-ai | Service Account (GOOGLE_APPLICATION_CREDENTIALS) | 50+ voices (Neural2, Wavenet, Standard) | Same auth as vertex (TTS uses Google Cloud Text-to-Speech client) |
| vertex | Service Account (GOOGLE_APPLICATION_CREDENTIALS) | 50+ voices (Neural2, Wavenet, Standard) | Recommended for production |
| openai-tts | API Key (OPENAI_API_KEY) | 6 voices: alloy, echo, fable, onyx, nova, shimmer; models: tts-1, tts-1-hd | Good default quality |
| elevenlabs | API Key (ELEVENLABS_API_KEY) | Multilingual voices; model: eleven_multilingual_v2 | High-quality multilingual synthesis |
| azure-tts | API Key (AZURE_SPEECH_KEY + region AZURE_SPEECH_REGION) | Neural voices with SSML support | Enterprise-grade Azure Speech |
| fish-audio | API Key (FISH_AUDIO_API_KEY) | 14 languages, voice cloning (15 s reference); models: s1 (default), speech-1.6, speech-1.5 | Low-cost, ~80% cheaper than ElevenLabs — see provider guide |
| cartesia | API Key (CARTESIA_API_KEY) | Cartesia voice library, English-first; models: sonic-2 (default), sonic | Low-latency Sonic models — see provider guide. Synchronous /tts/bytes; the WebSocket streaming flow is exposed separately via CartesiaStream in the voice server. |
Planned for future releases:
- AWS Polly
Voice Selection
Available Voice Types
Google Cloud TTS offers three voice quality tiers:
| Voice Type | Quality | Cost | Use Case | Example Voice |
|---|---|---|---|---|
| Neural2 | Highest | High | Natural conversations, voice assistants | en-US-Neural2-C |
| Wavenet | High | Medium | Professional narration, podcasts | en-US-Wavenet-D |
| Standard | Good | Low | Cost optimization, bulk generation | en-US-Standard-B |
Voice Discovery
Voice identifiers follow Google Cloud TTS naming conventions: <language>-<variant>-<type>-<name> (e.g., en-US-Neural2-C, en-GB-Wavenet-D).
Refer to the Google Cloud TTS voice list for all available voices.
Supported Languages
English Variants:
en-US- United States Englishen-GB- British Englishen-AU- Australian Englishen-IN- Indian English
Other Languages:
es-ES,es-US- Spanish (Spain, Latin America)fr-FR,fr-CA- French (France, Canada)de-DE- Germanja-JP- Japanesehi-IN- Hindizh-CN,zh-TW- Chinese (Simplified, Traditional)pt-BR,pt-PT- Portuguese (Brazil, Portugal)it-IT- Italianko-KR- Koreanru-RU- Russian
Voice Selection Guidelines
For Natural Conversations:
tts: {
voice: "en-US-Neural2-C", // Female, natural
// OR
voice: "en-US-Neural2-A", // Male, natural
}
For Professional Narration:
tts: {
voice: "en-US-Wavenet-D", // Male, professional
// OR
voice: "en-GB-Wavenet-A", // British, professional
}
For Cost Optimization:
tts: {
voice: "en-US-Standard-B", // Lower cost
}
TTS Synthesis Modes
NeuroLink supports two TTS synthesis modes:
Mode 1: Direct Text-to-Speech (Default)
Converts input text directly to speech without AI generation.
const result = await neurolink.generate({
input: { text: "Welcome to our service!" },
provider: "google-ai",
tts: {
enabled: true,
useAiResponse: false, // Default: synthesize input text
voice: "en-US-Neural2-C",
},
});
// Audio contains: "Welcome to our service!"
// No AI generation occurs
Use cases:
- Pre-written scripts
- System notifications
- Fixed announcements
- Voice confirmations
Mode 2: AI Response Synthesis
Generates AI response first, then converts the response to speech.
const result = await neurolink.generate({
input: { text: "Tell me a joke" },
provider: "google-ai",
tts: {
enabled: true,
useAiResponse: true, // Synthesize AI's response
voice: "en-US-Neural2-C",
},
});
// AI generates joke text
// TTS synthesizes the joke audio
// Both text and audio available in result
Note: when
useAiResponse: true, NeuroLink synthesizes the chat provider's text response. If your chat provider has no TTS counterpart (e.g.anthropic,bedrock), settts.providerexplicitly — otherwise stream Mode 2 will throwNo TTS provider resolved for stream Mode 2after text generation completes. Chat providers that double as TTS handlers (e.g.google-ai,vertex) auto-resolve whentts.provideris omitted.
Use cases:
- Voice assistants
- Interactive AI conversations
- Dynamic content narration
- AI-powered podcasts
Audio Format Options
Supported Formats
| Format | Quality | File Size | Platform Support | Use Case |
|---|---|---|---|---|
| MP3 | Good | Small (~100 KB/min) | All platforms | Default, balanced quality/size |
| WAV | Best | Large (~1 MB/min) | All platforms | Highest quality, editing |
| OGG/Opus | Good | Medium (~150 KB/min) | macOS, Linux | Web streaming |
Format Selection
// Default: MP3 (balanced quality and size)
tts: {
voice: "en-US-Neural2-C",
format: "mp3" // Default
}
// Best quality: WAV
tts: {
voice: "en-US-Neural2-C",
format: "wav"
}
// Web streaming: OGG
tts: {
voice: "en-US-Neural2-C",
format: "ogg"
}
Platform-Specific Considerations
Windows:
- Built-in playback only supports WAV format
- Auto-converts to WAV when
play: trueon Windows - Use MP3 for file output, WAV for immediate playback
macOS/Linux:
- All formats supported
afplay(macOS) andffplay(Linux) handle all formats- Use MP3 for general purpose
Voice Customization
Speaking Rate
Control speech speed (0.25 to 4.0):
// Slower (half speed)
tts: {
voice: "en-US-Neural2-C",
speed: 0.5
}
// Normal speed (default)
tts: {
voice: "en-US-Neural2-C",
speed: 1.0 // Default
}
// Faster (double speed)
tts: {
voice: "en-US-Neural2-C",
speed: 2.0
}
CLI:
neurolink generate "This is faster speech" \
--provider google-ai \
--tts-voice en-US-Neural2-C \
--tts-speed 1.5
Pitch Adjustment
Adjust voice pitch (-20.0 to 20.0 semitones):
// Lower pitch (deeper voice)
tts: {
voice: "en-US-Neural2-C",
pitch: -5.0
}
// Normal pitch (default)
tts: {
voice: "en-US-Neural2-C",
pitch: 0.0 // Default
}
// Higher pitch
tts: {
voice: "en-US-Neural2-C",
pitch: 5.0
}
CLI:
neurolink generate "Higher pitch test" \
--provider google-ai \
--tts-voice en-US-Neural2-C \
--tts-pitch 3.0
Volume Adjustment
Control output volume (-96.0 to 16.0 dB):
tts: {
voice: "en-US-Neural2-C",
volumeGainDb: 0.0 // Default (no change)
}
Complete Configuration Reference
SDK Configuration
import { NeuroLink } from "@juspay/neurolink";
const neurolink = new NeuroLink();
const result = await neurolink.generate({
input: { text: "Your text here" },
provider: "google-ai", // or "vertex"
tts: {
enabled: true, // Enable TTS output
useAiResponse: false, // false = input text, true = AI response
voice: "en-US-Neural2-C", // Voice identifier
format: "mp3", // Audio format: "mp3" | "wav" | "ogg"
speed: 1.0, // Speaking rate: 0.25-4.0
pitch: 0.0, // Pitch adjustment: -20.0 to 20.0
volumeGainDb: 0.0, // Volume: -96.0 to 16.0
quality: "standard", // Quality: "standard" | "hd"
output: "./audio.mp3", // Optional file path
play: false, // Auto-play (CLI only)
},
});
// Access results
console.log("Text:", result.content);
console.log("Audio size:", result.audio?.size, "bytes");
console.log("Audio format:", result.audio?.format);
console.log("Voice used:", result.audio?.voice);
// Save audio to file
if (result.audio?.buffer) {
import { writeFileSync } from "fs";
writeFileSync("output.mp3", result.audio.buffer);
}
CLI Flags
neurolink generate "Your text" \
--provider google-ai \
--tts \
--tts-provider <provider> \
--tts-voice <voice-id> \
--tts-format <format> \
--tts-speed <rate> \
--tts-pitch <pitch> \
--tts-output <file> \
--tts-use-ai-response
# --tts : enable TTS (boolean flag, required)
# --tts-provider : google-ai|vertex|openai-tts|elevenlabs|azure-tts|fish-audio|cartesia
# --tts-voice : voice id (optional — provider default applies)
# --tts-format : mp3|wav|ogg (default: mp3)
# --tts-speed : 0.25-4.0 (default: 1.0)
# --tts-pitch : -20.0 to 20.0 (default: 0.0)
# --tts-output : save to file
# --tts-use-ai-response : synthesize AI response instead of input text
Selecting a specific TTS provider:
# Use OpenAI TTS
neurolink generate "Hello" --tts --tts-provider openai-tts
# Use ElevenLabs
neurolink generate "Hello" --tts --tts-provider elevenlabs
# Use Azure TTS
neurolink generate "Hello" --tts --tts-provider azure-tts
# Use Fish Audio
neurolink generate "Hello" --tts --tts-provider fish-audio
# Use Cartesia
neurolink generate "Hello" --tts --tts-provider cartesia
Use Cases & Examples
1. Voice Assistant
Create a voice assistant that speaks responses:
const assistant = new NeuroLink();
const response = await assistant.generate({
input: { text: "What's the weather like today?" },
provider: "google-ai",
tts: {
enabled: true,
useAiResponse: true, // Speak AI's weather response
voice: "en-US-Neural2-C",
play: true,
},
});
// AI generates weather info and speaks it
2. Accessibility Features
Screen reader-style narration for visually impaired users:
const narration = await neurolink.generate({
input: { text: "Button clicked. Navigation menu opened." },
provider: "google-ai",
tts: {
enabled: true,
voice: "en-US-Neural2-C",
speed: 1.2, // Slightly faster for efficiency
play: true,
},
});
3. Podcast Generation
Generate professional podcast intros:
neurolink generate "Welcome to Tech Insights Podcast, episode 42. Today we're discussing the future of AI development." \
--provider google-ai \
--tts-voice en-US-Wavenet-D \
--tts-speed 0.95 \
--tts-format mp3 \
--tts-output podcast-intro.mp3
4. Language Learning
Slow pronunciation for language learners:
# Slow French pronunciation
neurolink generate "Je m'appelle Claude. Comment allez-vous?" \
--provider google-ai \
--tts-voice fr-FR-Neural2-A \
--tts-speed 0.7 \
--tts-output french-slow.mp3
# Normal speed for comparison
neurolink generate "Je m'appelle Claude. Comment allez-vous?" \
--provider google-ai \
--tts-voice fr-FR-Neural2-A \
--tts-speed 1.0 \
--tts-output french-normal.mp3
5. Multilingual Support
Generate audio in multiple languages:
const translations = {
english: {
text: "Hello, welcome to our application.",
voice: "en-US-Neural2-C",
},
french: {
text: "Bonjour, bienvenue dans notre application.",
voice: "fr-FR-Wavenet-A",
},
spanish: {
text: "Hola, bienvenido a nuestra aplicación.",
voice: "es-ES-Neural2-A",
},
hindi: {
text: "नमस्ते, हमारे एप्लिकेशन में आपका स्वागत है।",
voice: "hi-IN-Wavenet-A",
},
};
for (const [lang, config] of Object.entries(translations)) {
const result = await neurolink.generate({
input: { text: config.text },
provider: "google-ai",
tts: {
enabled: true,
voice: config.voice,
format: "mp3",
output: `welcome-${lang}.mp3`,
},
});
console.log(`Generated ${lang} audio (${result.audio?.size} bytes)`);
}
6. Batch Audio Generation
Generate multiple audio files efficiently:
async function generateBatchAudio(
texts: string[],
voice: string = "en-US-Neural2-C",
) {
const results = [];
for (const text of texts) {
const result = await neurolink.generate({
input: { text },
provider: "google-ai",
tts: {
enabled: true,
voice,
format: "mp3",
},
});
results.push({
text,
audioBuffer: result.audio?.buffer,
audioSize: result.audio?.size,
});
}
return results;
}
// Usage
const audioFiles = await generateBatchAudio([
"Welcome to our application.",
"Please enter your username and password.",
"Login successful. Redirecting to dashboard.",
]);
// Save all files
audioFiles.forEach((item, index) => {
if (item.audioBuffer) {
writeFileSync(`audio-${index}.mp3`, item.audioBuffer);
}
});
7. Streaming Text + Audio
Stream AI-generated text and convert to audio:
async function streamAndSpeak(prompt: string, voice: string) {
// Step 1: Stream AI response
const streamResult = await neurolink.stream({
input: { text: prompt },
provider: "google-ai",
model: "gemini-2.0-flash-exp",
});
let fullText = "";
for await (const chunk of streamResult.stream) {
fullText += chunk.content;
process.stdout.write(chunk.content);
}
console.log("\n\nConverting to audio...");
// Step 2: Convert complete text to audio
const ttsResult = await neurolink.generate({
input: { text: fullText },
provider: "google-ai",
tts: {
enabled: true,
voice,
play: true,
},
});
return {
text: fullText,
audio: ttsResult.audio,
};
}
// Usage
const result = await streamAndSpeak(
"Explain quantum computing in simple terms",
"en-US-Neural2-C",
);
Error Handling
Common Error Patterns
async function generateTTSWithRetry(
text: string,
voice: string,
maxRetries: number = 3,
) {
let lastError: Error | undefined;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const result = await neurolink.generate({
input: { text },
provider: "google-ai",
tts: {
enabled: true,
voice,
format: "mp3",
},
});
// Validate audio buffer
if (!result.audio || result.audio.size === 0) {
throw new Error("Empty audio buffer received");
}
return {
success: true,
audio: result.audio,
attempt,
};
} catch (error) {
lastError = error as Error;
console.error(`Attempt ${attempt} failed:`, error.message);
// Don't retry on invalid input
if (
error.message.includes("invalid voice") ||
error.message.includes("text too long")
) {
break;
}
// Exponential backoff
if (attempt < maxRetries) {
await new Promise((resolve) => setTimeout(resolve, 1000 * attempt));
}
}
}
return {
success: false,
error: lastError?.message || "Unknown error occurred",
attempts: maxRetries,
};
}
// Usage
const result = await generateTTSWithRetry(
"Generate this with retry logic",
"en-US-Neural2-C",
);
if (result.success && result.audio) {
console.log("Success!");
writeFileSync("output.mp3", result.audio.buffer);
} else {
console.error("Failed:", result.error);
}
Troubleshooting
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| "TTS client not initialized" | Missing credentials | Set GOOGLE_APPLICATION_CREDENTIALS or GOOGLE_AI_API_KEY |
| "Invalid voice name" | Voice ID not found | Check the Google Cloud TTS voice list |
| "Text too long" | Input exceeds 5000 bytes | Split text into smaller chunks |
| "Synthesis failed" | Network/API error | Check network connection and credentials |
| Audio doesn't play | Missing audio player | Install afplay (macOS), ffplay (Linux), or use WAV on Windows |
| Empty audio buffer | API returned no content | Check API quota and retry |
Authentication Issues
Service Account:
# Verify credentials file exists
ls -la $GOOGLE_APPLICATION_CREDENTIALS
# Test authentication
gcloud auth application-default login
API Key:
# Verify API key is set
echo $GOOGLE_AI_API_KEY
Audio Playback Issues
macOS:
afplayis pre-installed, supports all formats- If playback fails, check system volume settings
Linux:
- Install
ffmpegfor full format support:sudo apt install ffmpeg - Alternative: Use
aplayfor WAV files only
Windows:
- Built-in playback only supports WAV
- Install VLC or Windows Media Player for other formats
- SDK auto-converts to WAV when
play: trueon Windows
Best Practices
Performance Optimization
- Cache voices - Voice list is cached for 5 minutes
- Batch processing - Group multiple TTS requests when possible
- Use appropriate quality - Standard voices are faster and cheaper
- Optimize text length - Keep under 5000 bytes per request
Production Deployment
- Use service accounts - More secure than API keys
- Implement retry logic - Handle transient network failures
- Monitor quota usage - Track Google Cloud TTS API usage
- Set appropriate timeouts - Default is 30 seconds
- Handle errors gracefully - Provide fallback behavior
Voice Selection
- Test before deploying - Different voices suit different use cases
- Match gender to persona - Choose appropriate gender for your application
- Consider language variants -
en-USvsen-GBvsen-IN - Use Neural2 for quality - Best natural-sounding voices
Cost Management
- Use Standard voices - For high-volume, non-critical use cases
- Cache generated audio - Avoid regenerating the same content
- Monitor API usage - Set budget alerts in Google Cloud Console
Pricing
Google Cloud TTS pricing (as of 2026):
| Voice Type | Price per 1M characters |
|---|---|
| Neural2 | $16.00 |
| Wavenet | $16.00 |
| Standard | $4.00 |
Monthly free tier: 1 million characters (Standard voices) or 1 million characters (Wavenet/Neural2 voices)
For detailed pricing, see Google Cloud TTS Pricing.
Related Features
Multimodal Capabilities:
- Multimodal Guide - Images, PDFs, CSV inputs
- PDF Support - Document processing
- Video Generation - AI-powered video creation
- PPT Generation - AI-powered PowerPoint presentations
Advanced Features:
- Streaming - Stream AI responses in real-time
- Provider Orchestration - Multi-provider failover
Documentation:
- CLI Commands - Complete CLI reference
- SDK API Reference - Full API documentation
- Troubleshooting - Extended error catalog
Summary
NeuroLink's TTS integration provides:
- Multiple TTS providers - Google Cloud TTS, OpenAI TTS, ElevenLabs, Azure TTS, Fish Audio, Cartesia
- High-quality voices - Neural2, Wavenet, Standard, and multilingual options
- Multiple languages - 50+ voices across 10+ languages
- Flexible synthesis modes - Direct text or AI response
- Voice customization - Speed, pitch, volume control
- Easy integration - Works seamlessly with CLI and SDK via
--tts-providerflag
Next Steps:
- Set up Google Cloud credentials
- Discover available voices
- Try the quick start examples
- Explore use cases for your application
- Check troubleshooting if needed