14 · Voice / Speech Integration — Implementation Journal
Commit: 27a31c32 — feat(voice): add multi-provider TTS, STT, and realtime voice integration
Architecture
How voice plugs into Factory + Registry
The voice integration does not add AI providers (it adds no entries to AIProviderName). Instead it introduces three parallel static registries that mirror the ProviderFactory / ProviderRegistry pattern for non-LLM capabilities:
ProviderFactory → creates LLM provider instances
ProviderRegistry → holds LLM factory functions (dynamic imports)
TTSProcessor → static Map<string, TTSHandler> (text-to-speech)
STTProcessor → static Map<string, STTHandler> (speech-to-text)
RealtimeProcessor → static Map<string, RealtimeHandler> (bidirectional voice)
Each processor exposes registerHandler(name, handler) and the appropriate operation (synthesize, transcribe, connect). The same O(1) Map lookup and lazy-instantiation pattern used by ProviderRegistry applies here.
Registration location
All handler registration happens at the bottom of ProviderRegistry.registerAllProviders() in src/lib/factories/providerRegistry.ts, after all LLM providers are registered. The order is:
- LLM providers (existing)
- TTS handler registration block
- STT handler registration block
- Realtime handler registration block
Each block uses a separate try/catch so a missing API key or a broken import cannot prevent the LLM providers from registering. Registration is fire-and-forget: failures log a warn and continue.
All imports inside the registration blocks are dynamic (await import(...)), matching CLAUDE.md rule #1 and preventing circular dependencies.
// Pattern used for every voice handler:
try {
const { TTSProcessor } = await import("../utils/ttsProcessor.js");
const { OpenAITTS } = await import("../voice/providers/OpenAITTS.js");
TTSProcessor.registerHandler("openai-tts", new OpenAITTS());
} catch {
/* Optional provider — skip if unavailable */
}
STT preprocessing in neurolink.ts runStandardGenerateRequest()
When a caller passes { stt: { enabled: true, audio: buffer } } to generate(), the following happens inside runStandardGenerateRequest() before the LLM call:
ProviderRegistry.isRegistered()is checked; if false,registerAllProviders()is awaited.STTProcessoris dynamically imported andtranscribe(audio, providerName, sttOptions)is called.- The transcription text is injected into the LLM prompt:
- If no user text exists, the transcription becomes the prompt directly.
- If user text exists, the transcription is prepended as
[Transcribed audio]: <text>\n\n<user text>.
generateResult.transcriptionis set to theSTTResultobject (available to callers).- Failure-handling — split by whether the caller provided text:
- Audio-only requests (
stt.audiopresent, no user text) — transcription failures fail fast:STTErrorpropagates andgenerate()rejects, since there is no fallback prompt. - Text + audio requests — transcription failures are logged via
logger.errorandgenerate()continues with the un-augmented user text (preserves the optional-augmentation contract).
- Audio-only requests (
Type organisation
Three new canonical type files added to src/lib/types/ (CLAUDE.md rule #8 compliant — no "Types" suffix):
| File | Contents |
|---|---|
src/lib/types/tts.ts | Extended AudioFormat (added m4a, flac, webm, mp4, mpeg, mpga); added TTSOptions.provider field |
src/lib/types/stt.ts | STTHandler, STTOptions, STTResult, STTLanguage, WordTiming, TranscriptionSegment, STT_ERROR_CODES, DEFAULT_STT_OPTIONS, guards |
src/lib/types/realtime.ts | RealtimeHandler, RealtimeConfig, RealtimeSession, RealtimeAudioChunk, RealtimeSessionState, REALTIME_ERROR_CODES, DEFAULT_REALTIME_CONFIG |
src/lib/types/voice.ts | Aggregator: re-exports all of tts.ts, stt.ts, realtime.ts; adds VoiceCapability, VoiceProviderName, VoiceProviderConfig, VoiceErrorOptions |
src/lib/types/index.ts gets two new export * lines (for stt.ts and realtime.ts; voice.ts is already present). All rules 9 and 10 apply: type names are globally unique, barrel uses export * only.
TTS Providers Added
openai-tts
- File:
src/lib/voice/providers/OpenAITTS.ts(253 lines, NEW) - Class:
OpenAITTS implements TTSHandler - API:
POST https://api.openai.com/v1/audio/speech - Auth:
Authorization: Bearer $OPENAI_API_KEY - Models:
tts-1(standard, default) andtts-1-hd(high quality; selected whenoptions.quality === "hd") - Voices (6):
alloy,echo,fable,onyx,nova,shimmer - Output formats:
mp3(default),wav,opus/ogg(mapped to OpenAI'sopus) - Max text: 4 096 characters
- Registered as:
"openai-tts"inTTSProcessor - Timeout: 30-second
AbortControlleron everyfetchcall; throwsTTSErrorwithTTS_ERROR_CODES.SYNTHESIS_FAILEDon abort
elevenlabs
- File:
src/lib/voice/providers/ElevenLabsTTS.ts(326 lines, NEW) - Class:
ElevenLabsTTS implements TTSHandler - API:
POST https://api.elevenlabs.io/v1/text-to-speech/{voice_id}?output_format=... - Auth:
xi-api-key: $ELEVENLABS_API_KEY - Model:
eleven_multilingual_v2(default) - Voices: Dynamic — fetched from
/v1/voicesand cached for 5 minutes. Default voice:21m00Tcm4TlvDq8ikWAM(Rachel). - Output formats:
mp3_44100_128(mp3),pcm_44100(wav),ogg_22050(ogg/opus) - Voice settings:
stability(default 0.5),similarity_boost(0.75),style(0.0),use_speaker_boost(true) - Max text: 5 000 characters
- Registered as:
"elevenlabs"and"elevenlabs-tts"inTTSProcessor - Timeout: 30-second
AbortControlleronsynthesizeandgetVoicescalls
azure-tts
- File:
src/lib/voice/providers/AzureTTS.ts(357 lines, NEW) - Class:
AzureTTS implements TTSHandler - API:
POST https://{region}.tts.speech.microsoft.com/cognitiveservices/v1 - Auth:
Ocp-Apim-Subscription-Key: $AZURE_SPEECH_KEY - Region:
$AZURE_SPEECH_REGION(default"eastus") - Default voice:
en-US-JennyNeural - Output format (default):
audio-24khz-96kbitrate-mono-mp3 - SSML: The handler builds SSML automatically from
text,voice,speed, andpitchoptions. Callers can pass raw SSML by settingtextto a string starting with<speakor by providingoptions.ssmlTemplate. - Voices: Fetched from
/cognitiveservices/voices/listand cached for 30 minutes. - Max text: 10 000 characters
- Registered as:
"azure-tts"inTTSProcessor - Timeout: 30-second
AbortControlleron all fetch calls
STT Providers Added
whisper / openai-stt
- File:
src/lib/voice/providers/OpenAISTT.ts(317 lines, NEW) - Class:
OpenAISTT implements STTHandler(exported also asWhisperSTT,WhisperSTTHandler,OpenAISTTHandler) - API:
POST https://api.openai.com/v1/audio/transcriptions(or/translationswhentranslate: true) - Auth:
Authorization: Bearer $OPENAI_API_KEY - Model:
whisper-1(default) - Response format:
verbose_json(default) — returnstext,language,duration,words,segments - Word timestamps: Enabled when
options.wordTimestamps === true(sendstimestamp_granularities[]=word&segment) - Confidence: Fixed at
0.95(Whisper does not return per-result confidence); segment confidence derived fromMath.exp(segment.avg_logprob) - Max audio: 25 minutes
- Supported formats:
mp3,wav,ogg,opus - Streaming: Not supported (
supportsStreaming = false) - Registered as:
"whisper"and"openai-stt"inSTTProcessor - Timeout: 30-second
AbortControlleron thefetchmultipart form POST
google-stt
- File:
src/lib/voice/providers/GoogleSTT.ts(481 lines, NEW) - Class:
GoogleSTT implements STTHandler - API:
POST https://speech.googleapis.com/v1/speech:recognize - Auth:
$GOOGLE_API_KEY(query param) or$GOOGLE_APPLICATION_CREDENTIALS(service account path) - Streaming: Supported (
supportsStreaming = true) - Max audio: 480 minutes (8 hours, async path)
- Diarization: Supported
- Registered as:
"google-stt"inSTTProcessor - Timeout: 30-second
AbortController
deepgram
- File:
src/lib/voice/providers/DeepgramSTT.ts(547 lines, NEW) - Class:
DeepgramSTT implements STTHandler - API:
POST https://api.deepgram.com/v1/listen - Auth:
Authorization: Token $DEEPGRAM_API_KEY - Models: Nova-2 (default), Nova-3
- Streaming: Supported via WebSocket (
supportsStreaming = true) - Speaker diarization: Supported
- Max audio: 2 hours (
maxAudioDuration = 7200) - Supported formats:
mp3,wav,ogg,opus - Registered as:
"deepgram"inSTTProcessor - Timeout: 30-second
AbortControlleron REST calls
azure-stt
- File:
src/lib/voice/providers/AzureSTT.ts(374 lines, NEW) - Class:
AzureSTT implements STTHandler - API: Azure Cognitive Services Speech SDK REST endpoint
- Auth:
$AZURE_SPEECH_KEY+$AZURE_SPEECH_REGION - Streaming: Supported
- Registered as:
"azure-stt"inSTTProcessor - Timeout: 30-second
AbortController
Realtime Providers Added (registered, not yet SDK-exposed)
Both realtime providers are registered in ProviderRegistry.registerAllProviders() but are not yet accessible via public NeuroLink SDK methods. They exist as handler registrations ready for future surfacing.
openai-realtime
- File:
src/lib/voice/providers/OpenAIRealtime.ts(475 lines, NEW) - Class:
OpenAIRealtime extends BaseRealtimeHandler - Transport: WebSocket (
wss://api.openai.com/v1/realtime) - Auth:
Authorization: Bearer $OPENAI_API_KEY+OpenAI-Beta: realtime=v1headers - Supported formats:
wav,opus - Registered as:
"openai-realtime"inRealtimeProcessor
gemini-live
- File:
src/lib/voice/providers/GeminiLive.ts(413 lines, NEW) - Class:
GeminiLive extends BaseRealtimeHandler - Transport: WebSocket (Gemini Live API)
- Auth:
$GOOGLE_API_KEY - Supported formats:
opus,wav - Registered as:
"gemini-live"inRealtimeProcessor
Both extend BaseRealtimeHandler (in src/lib/voice/RealtimeVoiceAPI.ts), which manages connection state, session lifecycle, and event emission via EventEmitter.
Key Design Decisions
Everything through generate() / stream()
No new top-level NeuroLink methods were added (synthesize, transcribe, startRealtimeVoice are intentionally absent). All voice capability is driven through the existing option objects:
// TTS — same as before
await neurolink.generate({
prompt: "...",
tts: { enabled: true, voice: "alloy", provider: "openai-tts" },
});
// STT — new
await neurolink.generate({
stt: { enabled: true, audio: audioBuffer, provider: "whisper" },
// prompt is optional; transcription becomes the prompt if omitted
});
// Round-trip — STT input, LLM, TTS output
await neurolink.generate({
stt: { enabled: true, audio: audioBuffer, provider: "whisper" },
tts: { enabled: true, provider: "elevenlabs", voice: "21m00Tcm4TlvDq8ikWAM" },
});
This preserves backward compatibility (CLAUDE.md rule #5) — existing callers are unaffected.
STT preprocessing logic
The preprocessing runs in runStandardGenerateRequest() after options validation and before generateTextInternal(). Key properties:
options.stt.providerdefaults tooptions.provider(the LLM provider name) then falls back to"whisper".- Failure handling depends on whether user text is present:
- With user text — failure is non-fatal: logged via
logger.errorandgenerate()continues with the un-augmented prompt. - Audio-only (no user text) — failure is fatal:
STTErroris rethrown andgenerate()rejects, since the request has no prompt fallback.
- With user text — failure is non-fatal: logged via
result.transcription(typeSTTResult) is attached to theGenerateResultwhen transcription succeeds.
Fetch timeouts
Every provider API call wraps its fetch in a 30-second AbortController:
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 30000);
try {
response = await fetch(url, { ..., signal: controller.signal });
} finally {
clearTimeout(timeoutId);
}
AbortError is caught and re-thrown as a typed TTSError / STTError with a human-readable message. This pattern is consistent across all 7 new providers.
Audio utilities (src/lib/voice/audio-utils.ts)
552-line utility module with no external dependencies beyond Node.js built-ins:
| Export | Purpose |
|---|---|
detectAudioFormat(buffer) | Identifies wav, mp3, ogg, opus from magic bytes |
createWavHeader(...) / createWavFile(...) | Builds a 44-byte RIFF/WAV header / header + PCM data |
extractPcmSamples(buffer) | Reads 16-bit LE PCM samples from a WAV |
normalizeAudio(samples) | Scales to peak 0.9 |
resamplePcm(samples, fromRate, toRate) | Linear interpolation resampling |
calculateDuration(buffer, format, sampleRate, channels, bitsPerSample) | Duration in seconds (parses WAV header / estimates MP3) |
convertAudioFormat(buffer, from, to) | Throws when from ≠ to — cross-format conversion is not implemented (use ffmpeg) |
getMimeType(format) / getFileExtension(format) | Format → MIME / extension |
AUDIO_SIGNATURES | Magic-byte constants per format |
MIME_TYPES | Format → MIME map constant |
Note: earlier drafts of this doc referenced
createPcmBuffer(durationMs, …)andsplitIntoChunks(buffer, chunkSize). Those helpers were dropped before the PR shipped in favour of caller-side composition.convertAudioFormatis not best-effort — it throws when source ≠ target format.
Stream infrastructure (src/lib/voice/stream-handler.ts)
546-line module providing:
| Export | Purpose |
|---|---|
ChunkedAudioStream extends EventEmitter | Slices incoming audio into fixed-duration chunks (default 100 ms) with backpressure management |
StreamHandler extends EventEmitter | Generic event-driven handler with start/stop and error propagation |
StreamSplitter | Fan-out: one input → multiple output streams |
StreamMerger | Fan-in: multiple input streams → one output |
asyncIterableToStream(iterable) | Converts AsyncIterable<Buffer> → Node Readable |
streamToAsyncIterable(stream) | Converts Node Readable → AsyncIterable<Buffer> |
ChunkedAudioStream defaults: chunkDurationMs=100, sampleRate=16000, bytesPerSample=2, highWaterMark=64KB.
Error Handling
Three new error classes in src/lib/voice/errors.ts (all extend NeuroLinkError):
| Class | Default category | Default severity |
|---|---|---|
VoiceError | EXECUTION | MEDIUM |
STTError | VALIDATION | MEDIUM |
RealtimeError | EXECUTION | HIGH |
TTSError lives in src/lib/utils/ttsProcessor.ts (pre-existing; not in errors.ts).
STTError includes static factory methods: audioEmpty, audioTooLong, invalidFormat, languageNotSupported, transcriptionFailed, providerNotConfigured, providerNotSupported, streamError.
RealtimeError includes: connectionFailed, sessionTimeout, protocolError, audioStreamError, providerNotConfigured, sessionAlreadyActive, sessionNotActive, invalidMessage.
CLI Changes
New flags added to src/cli/commands/voice.ts and propagated via src/cli/factories/commandFactory.ts:
| Flag | Purpose |
|---|---|
--stt | Enable STT preprocessing |
--stt-provider | Which STT provider to use (default: whisper) |
--input-audio | Path to audio file for STT |
--stt-language | BCP-47 language code for transcription |
--tts-provider | Override TTS provider (e.g., openai-tts, elevenlabs, azure-tts) |
The --tts and --tts-voice flags are pre-existing.
Testing
Test suite: test/continuous-test-suite-voice.ts (1 822 lines, NEW)
The suite is invoked as:
npx tsx test/continuous-test-suite-voice.ts --provider=vertex
It covers 15 test items via the consumer API only — no direct provider class calls:
| # | Test | Notes |
|---|---|---|
| 1 | generate() + TTS MP3 | Validates MP3 magic bytes (0xFF 0xFB or 0x49 0x44 0x33) |
| 2 | generate() + TTS WAV | Validates RIFF header (0x52 0x49 0x46 0x46) |
| 3 | Unconfigured TTS provider | Verifies azure-tts without keys errors gracefully |
| 4 | generate() + STT | Validates result.transcription.confidence is numeric |
| 5 | STT + TTS round-trip | Audio in → LLM → audio out; validates both transcription and MP3 output |
| 6–8 | stream() + TTS | Validates StreamResult with audio chunks |
| 9–10 | CLI --tts / --stt flags | Spawns CLI subprocess, validates exit code and JSON output |
| 11 | Handler registration check | Verifies TTSProcessor, STTProcessor, RealtimeProcessor have expected provider keys |
| 12 | Audio utility validation | detectAudioFormat, createWavHeader, splitIntoChunks, resamplePcm |
| 13 | ChunkedAudioStream | Validates chunking and event emission |
| 14 | Barrel exports | VOICE_ERROR_CODES, STT_ERROR_CODES, REALTIME_ERROR_CODES, DEFAULT_STT_OPTIONS |
| 15 | Removed method guard | Asserts synthesize, transcribe, startRealtimeVoice do NOT exist on NeuroLink |
Real API results logged in commit message:
| Provider | Phrase | Confidence |
|---|---|---|
| Whisper (openai-stt) | "The quick brown fox..." | 0.95 |
| Deepgram | same | 1.0 |
| Google STT | same | 0.98 |
| Azure STT | same | 0.9 |
| Full round-trip | Whisper → Vertex LLM → ElevenLabs | 126 KB MP3 output |
Files Changed
New files (11)
| File | Lines | Purpose |
|---|---|---|
src/lib/voice/providers/OpenAITTS.ts | 253 | OpenAI TTS handler |
src/lib/voice/providers/ElevenLabsTTS.ts | 326 | ElevenLabs TTS handler |
src/lib/voice/providers/AzureTTS.ts | 357 | Azure Cognitive Services TTS handler |
src/lib/voice/providers/OpenAISTT.ts | 317 | Whisper / OpenAI STT handler |
src/lib/voice/providers/DeepgramSTT.ts | 547 | Deepgram STT handler |
src/lib/voice/providers/GoogleSTT.ts | 481 | Google Cloud STT handler |
src/lib/voice/providers/AzureSTT.ts | 374 | Azure Cognitive Services STT handler |
src/lib/voice/providers/OpenAIRealtime.ts | 475 | OpenAI Realtime (WebSocket) handler |
src/lib/voice/providers/GeminiLive.ts | 413 | Gemini Live (WebSocket) handler |
src/lib/voice/audio-utils.ts | 552 | Audio format detection, WAV/PCM utilities |
src/lib/voice/stream-handler.ts | 546 | Chunked streaming, fan-out/fan-in |
Substantially extended files (4)
| File | Change |
|---|---|
src/lib/voice/RealtimeVoiceAPI.ts | 516 lines added — BaseRealtimeHandler (abstract) and RealtimeProcessor (static handler registry with connect/send/disconnect) |
src/lib/voice/errors.ts | 464 lines added — VoiceError, STTError, RealtimeError with full static factory methods |
src/lib/voice/index.ts | 125 lines added — barrel for all voice exports |
src/lib/utils/sttProcessor.ts | 319 lines added — STTProcessor static registry with transcribe, getHandler, supports, registerHandler, span instrumentation matching TTSProcessor |
New type files (2)
| File | Lines | Purpose |
|---|---|---|
src/lib/types/stt.ts | 772 | All STT types, error codes, constants, type guards |
src/lib/types/realtime.ts | 322 | All Realtime types, error codes, constants, guards |
Modified files
| File | Change |
|---|---|
src/lib/types/tts.ts | Extended AudioFormat union with 6 additional formats; added TTSOptions.provider |
src/lib/types/voice.ts | Now re-exports stt.ts and realtime.ts; adds voice-level union types |
src/lib/types/index.ts | New export * for stt.ts and realtime.ts |
src/lib/types/generate.ts | Added stt option block to GenerateOptions; added transcription: STTResult to GenerateResult |
src/lib/types/stream.ts | Minor additions for audio stream result types |
src/lib/types/span.ts | Added SpanType.STT enum value |
src/lib/factories/providerRegistry.ts | TTS, STT, and Realtime handler registration blocks at end of registerAllProviders() |
src/lib/neurolink.ts | STT preprocessing in runStandardGenerateRequest(); TTS option threading to stream/generate |
src/cli/commands/voice.ts | New --stt, --stt-provider, --input-audio, --stt-language, --tts-provider flags |
src/lib/server/voice/voiceWebSocketHandler.ts | Refactored to use STTProcessor / TTSProcessor / RealtimeProcessor instead of direct provider classes |
.env.example | Added DEEPGRAM_API_KEY, ELEVENLABS_API_KEY, AZURE_SPEECH_KEY, AZURE_SPEECH_REGION |
test/continuous-test-suite-voice.ts | 1 822-line new test suite |
Smoke Tests
# Build first
pnpm run build:cli
# TTS: OpenAI
export OPENAI_API_KEY="sk-..."
pnpm run cli generate "Hello world" --tts --tts-provider openai-tts --tts-voice nova
# TTS: ElevenLabs
export ELEVENLABS_API_KEY="..."
pnpm run cli generate "Hello world" --tts --tts-provider elevenlabs
# STT: Whisper
export OPENAI_API_KEY="sk-..."
pnpm run cli generate --stt --stt-provider whisper --input-audio recording.wav
# STT + TTS round-trip
pnpm run cli generate --stt --stt-provider whisper --input-audio recording.wav \
--tts --tts-provider openai-tts --provider openai
# Full test suite (requires Vertex credentials)
npx tsx test/continuous-test-suite-voice.ts --provider=vertex
Backward Compatibility
- No changes to
AIProviderNameenum — existing provider callers unaffected. - No new public
NeuroLinkmethods — interface extends only through option fields. AudioFormattype extended additively — existing"mp3" | "wav" | "ogg" | "opus"values unchanged.GenerateOptions.sttandGenerateResult.transcriptionare optional — callers not passingsttsee no change in behaviour.TTSProcessorpre-existing registration forgoogle-aiandvertex(viaGoogleTTSHandler) is unmodified.