Skip to main content

14 · Voice / Speech Integration — Implementation Journal

Commit: 27a31c32feat(voice): add multi-provider TTS, STT, and realtime voice integration


Architecture

How voice plugs into Factory + Registry

The voice integration does not add AI providers (it adds no entries to AIProviderName). Instead it introduces three parallel static registries that mirror the ProviderFactory / ProviderRegistry pattern for non-LLM capabilities:

ProviderFactory    →  creates LLM provider instances
ProviderRegistry → holds LLM factory functions (dynamic imports)

TTSProcessor → static Map<string, TTSHandler> (text-to-speech)
STTProcessor → static Map<string, STTHandler> (speech-to-text)
RealtimeProcessor → static Map<string, RealtimeHandler> (bidirectional voice)

Each processor exposes registerHandler(name, handler) and the appropriate operation (synthesize, transcribe, connect). The same O(1) Map lookup and lazy-instantiation pattern used by ProviderRegistry applies here.

Registration location

All handler registration happens at the bottom of ProviderRegistry.registerAllProviders() in src/lib/factories/providerRegistry.ts, after all LLM providers are registered. The order is:

  1. LLM providers (existing)
  2. TTS handler registration block
  3. STT handler registration block
  4. Realtime handler registration block

Each block uses a separate try/catch so a missing API key or a broken import cannot prevent the LLM providers from registering. Registration is fire-and-forget: failures log a warn and continue.

All imports inside the registration blocks are dynamic (await import(...)), matching CLAUDE.md rule #1 and preventing circular dependencies.

// Pattern used for every voice handler:
try {
const { TTSProcessor } = await import("../utils/ttsProcessor.js");
const { OpenAITTS } = await import("../voice/providers/OpenAITTS.js");
TTSProcessor.registerHandler("openai-tts", new OpenAITTS());
} catch {
/* Optional provider — skip if unavailable */
}

STT preprocessing in neurolink.ts runStandardGenerateRequest()

When a caller passes { stt: { enabled: true, audio: buffer } } to generate(), the following happens inside runStandardGenerateRequest() before the LLM call:

  1. ProviderRegistry.isRegistered() is checked; if false, registerAllProviders() is awaited.
  2. STTProcessor is dynamically imported and transcribe(audio, providerName, sttOptions) is called.
  3. The transcription text is injected into the LLM prompt:
    • If no user text exists, the transcription becomes the prompt directly.
    • If user text exists, the transcription is prepended as [Transcribed audio]: <text>\n\n<user text>.
  4. generateResult.transcription is set to the STTResult object (available to callers).
  5. Failure-handling — split by whether the caller provided text:
    • Audio-only requests (stt.audio present, no user text) — transcription failures fail fast: STTError propagates and generate() rejects, since there is no fallback prompt.
    • Text + audio requests — transcription failures are logged via logger.error and generate() continues with the un-augmented user text (preserves the optional-augmentation contract).

Type organisation

Three new canonical type files added to src/lib/types/ (CLAUDE.md rule #8 compliant — no "Types" suffix):

FileContents
src/lib/types/tts.tsExtended AudioFormat (added m4a, flac, webm, mp4, mpeg, mpga); added TTSOptions.provider field
src/lib/types/stt.tsSTTHandler, STTOptions, STTResult, STTLanguage, WordTiming, TranscriptionSegment, STT_ERROR_CODES, DEFAULT_STT_OPTIONS, guards
src/lib/types/realtime.tsRealtimeHandler, RealtimeConfig, RealtimeSession, RealtimeAudioChunk, RealtimeSessionState, REALTIME_ERROR_CODES, DEFAULT_REALTIME_CONFIG
src/lib/types/voice.tsAggregator: re-exports all of tts.ts, stt.ts, realtime.ts; adds VoiceCapability, VoiceProviderName, VoiceProviderConfig, VoiceErrorOptions

src/lib/types/index.ts gets two new export * lines (for stt.ts and realtime.ts; voice.ts is already present). All rules 9 and 10 apply: type names are globally unique, barrel uses export * only.


TTS Providers Added

openai-tts

  • File: src/lib/voice/providers/OpenAITTS.ts (253 lines, NEW)
  • Class: OpenAITTS implements TTSHandler
  • API: POST https://api.openai.com/v1/audio/speech
  • Auth: Authorization: Bearer $OPENAI_API_KEY
  • Models: tts-1 (standard, default) and tts-1-hd (high quality; selected when options.quality === "hd")
  • Voices (6): alloy, echo, fable, onyx, nova, shimmer
  • Output formats: mp3 (default), wav, opus/ogg (mapped to OpenAI's opus)
  • Max text: 4 096 characters
  • Registered as: "openai-tts" in TTSProcessor
  • Timeout: 30-second AbortController on every fetch call; throws TTSError with TTS_ERROR_CODES.SYNTHESIS_FAILED on abort

elevenlabs

  • File: src/lib/voice/providers/ElevenLabsTTS.ts (326 lines, NEW)
  • Class: ElevenLabsTTS implements TTSHandler
  • API: POST https://api.elevenlabs.io/v1/text-to-speech/{voice_id}?output_format=...
  • Auth: xi-api-key: $ELEVENLABS_API_KEY
  • Model: eleven_multilingual_v2 (default)
  • Voices: Dynamic — fetched from /v1/voices and cached for 5 minutes. Default voice: 21m00Tcm4TlvDq8ikWAM (Rachel).
  • Output formats: mp3_44100_128 (mp3), pcm_44100 (wav), ogg_22050 (ogg/opus)
  • Voice settings: stability (default 0.5), similarity_boost (0.75), style (0.0), use_speaker_boost (true)
  • Max text: 5 000 characters
  • Registered as: "elevenlabs" and "elevenlabs-tts" in TTSProcessor
  • Timeout: 30-second AbortController on synthesize and getVoices calls

azure-tts

  • File: src/lib/voice/providers/AzureTTS.ts (357 lines, NEW)
  • Class: AzureTTS implements TTSHandler
  • API: POST https://{region}.tts.speech.microsoft.com/cognitiveservices/v1
  • Auth: Ocp-Apim-Subscription-Key: $AZURE_SPEECH_KEY
  • Region: $AZURE_SPEECH_REGION (default "eastus")
  • Default voice: en-US-JennyNeural
  • Output format (default): audio-24khz-96kbitrate-mono-mp3
  • SSML: The handler builds SSML automatically from text, voice, speed, and pitch options. Callers can pass raw SSML by setting text to a string starting with <speak or by providing options.ssmlTemplate.
  • Voices: Fetched from /cognitiveservices/voices/list and cached for 30 minutes.
  • Max text: 10 000 characters
  • Registered as: "azure-tts" in TTSProcessor
  • Timeout: 30-second AbortController on all fetch calls

STT Providers Added

whisper / openai-stt

  • File: src/lib/voice/providers/OpenAISTT.ts (317 lines, NEW)
  • Class: OpenAISTT implements STTHandler (exported also as WhisperSTT, WhisperSTTHandler, OpenAISTTHandler)
  • API: POST https://api.openai.com/v1/audio/transcriptions (or /translations when translate: true)
  • Auth: Authorization: Bearer $OPENAI_API_KEY
  • Model: whisper-1 (default)
  • Response format: verbose_json (default) — returns text, language, duration, words, segments
  • Word timestamps: Enabled when options.wordTimestamps === true (sends timestamp_granularities[]=word&segment)
  • Confidence: Fixed at 0.95 (Whisper does not return per-result confidence); segment confidence derived from Math.exp(segment.avg_logprob)
  • Max audio: 25 minutes
  • Supported formats: mp3, wav, ogg, opus
  • Streaming: Not supported (supportsStreaming = false)
  • Registered as: "whisper" and "openai-stt" in STTProcessor
  • Timeout: 30-second AbortController on the fetch multipart form POST

google-stt

  • File: src/lib/voice/providers/GoogleSTT.ts (481 lines, NEW)
  • Class: GoogleSTT implements STTHandler
  • API: POST https://speech.googleapis.com/v1/speech:recognize
  • Auth: $GOOGLE_API_KEY (query param) or $GOOGLE_APPLICATION_CREDENTIALS (service account path)
  • Streaming: Supported (supportsStreaming = true)
  • Max audio: 480 minutes (8 hours, async path)
  • Diarization: Supported
  • Registered as: "google-stt" in STTProcessor
  • Timeout: 30-second AbortController

deepgram

  • File: src/lib/voice/providers/DeepgramSTT.ts (547 lines, NEW)
  • Class: DeepgramSTT implements STTHandler
  • API: POST https://api.deepgram.com/v1/listen
  • Auth: Authorization: Token $DEEPGRAM_API_KEY
  • Models: Nova-2 (default), Nova-3
  • Streaming: Supported via WebSocket (supportsStreaming = true)
  • Speaker diarization: Supported
  • Max audio: 2 hours (maxAudioDuration = 7200)
  • Supported formats: mp3, wav, ogg, opus
  • Registered as: "deepgram" in STTProcessor
  • Timeout: 30-second AbortController on REST calls

azure-stt

  • File: src/lib/voice/providers/AzureSTT.ts (374 lines, NEW)
  • Class: AzureSTT implements STTHandler
  • API: Azure Cognitive Services Speech SDK REST endpoint
  • Auth: $AZURE_SPEECH_KEY + $AZURE_SPEECH_REGION
  • Streaming: Supported
  • Registered as: "azure-stt" in STTProcessor
  • Timeout: 30-second AbortController

Realtime Providers Added (registered, not yet SDK-exposed)

Both realtime providers are registered in ProviderRegistry.registerAllProviders() but are not yet accessible via public NeuroLink SDK methods. They exist as handler registrations ready for future surfacing.

openai-realtime

  • File: src/lib/voice/providers/OpenAIRealtime.ts (475 lines, NEW)
  • Class: OpenAIRealtime extends BaseRealtimeHandler
  • Transport: WebSocket (wss://api.openai.com/v1/realtime)
  • Auth: Authorization: Bearer $OPENAI_API_KEY + OpenAI-Beta: realtime=v1 headers
  • Supported formats: wav, opus
  • Registered as: "openai-realtime" in RealtimeProcessor

gemini-live

  • File: src/lib/voice/providers/GeminiLive.ts (413 lines, NEW)
  • Class: GeminiLive extends BaseRealtimeHandler
  • Transport: WebSocket (Gemini Live API)
  • Auth: $GOOGLE_API_KEY
  • Supported formats: opus, wav
  • Registered as: "gemini-live" in RealtimeProcessor

Both extend BaseRealtimeHandler (in src/lib/voice/RealtimeVoiceAPI.ts), which manages connection state, session lifecycle, and event emission via EventEmitter.


Key Design Decisions

Everything through generate() / stream()

No new top-level NeuroLink methods were added (synthesize, transcribe, startRealtimeVoice are intentionally absent). All voice capability is driven through the existing option objects:

// TTS — same as before
await neurolink.generate({
prompt: "...",
tts: { enabled: true, voice: "alloy", provider: "openai-tts" },
});

// STT — new
await neurolink.generate({
stt: { enabled: true, audio: audioBuffer, provider: "whisper" },
// prompt is optional; transcription becomes the prompt if omitted
});

// Round-trip — STT input, LLM, TTS output
await neurolink.generate({
stt: { enabled: true, audio: audioBuffer, provider: "whisper" },
tts: { enabled: true, provider: "elevenlabs", voice: "21m00Tcm4TlvDq8ikWAM" },
});

This preserves backward compatibility (CLAUDE.md rule #5) — existing callers are unaffected.

STT preprocessing logic

The preprocessing runs in runStandardGenerateRequest() after options validation and before generateTextInternal(). Key properties:

  • options.stt.provider defaults to options.provider (the LLM provider name) then falls back to "whisper".
  • Failure handling depends on whether user text is present:
    • With user text — failure is non-fatal: logged via logger.error and generate() continues with the un-augmented prompt.
    • Audio-only (no user text) — failure is fatal: STTError is rethrown and generate() rejects, since the request has no prompt fallback.
  • result.transcription (type STTResult) is attached to the GenerateResult when transcription succeeds.

Fetch timeouts

Every provider API call wraps its fetch in a 30-second AbortController:

const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 30000);
try {
response = await fetch(url, { ..., signal: controller.signal });
} finally {
clearTimeout(timeoutId);
}

AbortError is caught and re-thrown as a typed TTSError / STTError with a human-readable message. This pattern is consistent across all 7 new providers.

Audio utilities (src/lib/voice/audio-utils.ts)

552-line utility module with no external dependencies beyond Node.js built-ins:

ExportPurpose
detectAudioFormat(buffer)Identifies wav, mp3, ogg, opus from magic bytes
createWavHeader(...) / createWavFile(...)Builds a 44-byte RIFF/WAV header / header + PCM data
extractPcmSamples(buffer)Reads 16-bit LE PCM samples from a WAV
normalizeAudio(samples)Scales to peak 0.9
resamplePcm(samples, fromRate, toRate)Linear interpolation resampling
calculateDuration(buffer, format, sampleRate, channels, bitsPerSample)Duration in seconds (parses WAV header / estimates MP3)
convertAudioFormat(buffer, from, to)Throws when from ≠ to — cross-format conversion is not implemented (use ffmpeg)
getMimeType(format) / getFileExtension(format)Format → MIME / extension
AUDIO_SIGNATURESMagic-byte constants per format
MIME_TYPESFormat → MIME map constant

Note: earlier drafts of this doc referenced createPcmBuffer(durationMs, …) and splitIntoChunks(buffer, chunkSize). Those helpers were dropped before the PR shipped in favour of caller-side composition. convertAudioFormat is not best-effort — it throws when source ≠ target format.

Stream infrastructure (src/lib/voice/stream-handler.ts)

546-line module providing:

ExportPurpose
ChunkedAudioStream extends EventEmitterSlices incoming audio into fixed-duration chunks (default 100 ms) with backpressure management
StreamHandler extends EventEmitterGeneric event-driven handler with start/stop and error propagation
StreamSplitterFan-out: one input → multiple output streams
StreamMergerFan-in: multiple input streams → one output
asyncIterableToStream(iterable)Converts AsyncIterable<Buffer> → Node Readable
streamToAsyncIterable(stream)Converts Node ReadableAsyncIterable<Buffer>

ChunkedAudioStream defaults: chunkDurationMs=100, sampleRate=16000, bytesPerSample=2, highWaterMark=64KB.


Error Handling

Three new error classes in src/lib/voice/errors.ts (all extend NeuroLinkError):

ClassDefault categoryDefault severity
VoiceErrorEXECUTIONMEDIUM
STTErrorVALIDATIONMEDIUM
RealtimeErrorEXECUTIONHIGH

TTSError lives in src/lib/utils/ttsProcessor.ts (pre-existing; not in errors.ts).

STTError includes static factory methods: audioEmpty, audioTooLong, invalidFormat, languageNotSupported, transcriptionFailed, providerNotConfigured, providerNotSupported, streamError.

RealtimeError includes: connectionFailed, sessionTimeout, protocolError, audioStreamError, providerNotConfigured, sessionAlreadyActive, sessionNotActive, invalidMessage.


CLI Changes

New flags added to src/cli/commands/voice.ts and propagated via src/cli/factories/commandFactory.ts:

FlagPurpose
--sttEnable STT preprocessing
--stt-providerWhich STT provider to use (default: whisper)
--input-audioPath to audio file for STT
--stt-languageBCP-47 language code for transcription
--tts-providerOverride TTS provider (e.g., openai-tts, elevenlabs, azure-tts)

The --tts and --tts-voice flags are pre-existing.


Testing

Test suite: test/continuous-test-suite-voice.ts (1 822 lines, NEW)

The suite is invoked as:

npx tsx test/continuous-test-suite-voice.ts --provider=vertex

It covers 15 test items via the consumer API only — no direct provider class calls:

#TestNotes
1generate() + TTS MP3Validates MP3 magic bytes (0xFF 0xFB or 0x49 0x44 0x33)
2generate() + TTS WAVValidates RIFF header (0x52 0x49 0x46 0x46)
3Unconfigured TTS providerVerifies azure-tts without keys errors gracefully
4generate() + STTValidates result.transcription.confidence is numeric
5STT + TTS round-tripAudio in → LLM → audio out; validates both transcription and MP3 output
6–8stream() + TTSValidates StreamResult with audio chunks
9–10CLI --tts / --stt flagsSpawns CLI subprocess, validates exit code and JSON output
11Handler registration checkVerifies TTSProcessor, STTProcessor, RealtimeProcessor have expected provider keys
12Audio utility validationdetectAudioFormat, createWavHeader, splitIntoChunks, resamplePcm
13ChunkedAudioStreamValidates chunking and event emission
14Barrel exportsVOICE_ERROR_CODES, STT_ERROR_CODES, REALTIME_ERROR_CODES, DEFAULT_STT_OPTIONS
15Removed method guardAsserts synthesize, transcribe, startRealtimeVoice do NOT exist on NeuroLink

Real API results logged in commit message:

ProviderPhraseConfidence
Whisper (openai-stt)"The quick brown fox..."0.95
Deepgramsame1.0
Google STTsame0.98
Azure STTsame0.9
Full round-tripWhisper → Vertex LLM → ElevenLabs126 KB MP3 output

Files Changed

New files (11)

FileLinesPurpose
src/lib/voice/providers/OpenAITTS.ts253OpenAI TTS handler
src/lib/voice/providers/ElevenLabsTTS.ts326ElevenLabs TTS handler
src/lib/voice/providers/AzureTTS.ts357Azure Cognitive Services TTS handler
src/lib/voice/providers/OpenAISTT.ts317Whisper / OpenAI STT handler
src/lib/voice/providers/DeepgramSTT.ts547Deepgram STT handler
src/lib/voice/providers/GoogleSTT.ts481Google Cloud STT handler
src/lib/voice/providers/AzureSTT.ts374Azure Cognitive Services STT handler
src/lib/voice/providers/OpenAIRealtime.ts475OpenAI Realtime (WebSocket) handler
src/lib/voice/providers/GeminiLive.ts413Gemini Live (WebSocket) handler
src/lib/voice/audio-utils.ts552Audio format detection, WAV/PCM utilities
src/lib/voice/stream-handler.ts546Chunked streaming, fan-out/fan-in

Substantially extended files (4)

FileChange
src/lib/voice/RealtimeVoiceAPI.ts516 lines added — BaseRealtimeHandler (abstract) and RealtimeProcessor (static handler registry with connect/send/disconnect)
src/lib/voice/errors.ts464 lines added — VoiceError, STTError, RealtimeError with full static factory methods
src/lib/voice/index.ts125 lines added — barrel for all voice exports
src/lib/utils/sttProcessor.ts319 lines added — STTProcessor static registry with transcribe, getHandler, supports, registerHandler, span instrumentation matching TTSProcessor

New type files (2)

FileLinesPurpose
src/lib/types/stt.ts772All STT types, error codes, constants, type guards
src/lib/types/realtime.ts322All Realtime types, error codes, constants, guards

Modified files

FileChange
src/lib/types/tts.tsExtended AudioFormat union with 6 additional formats; added TTSOptions.provider
src/lib/types/voice.tsNow re-exports stt.ts and realtime.ts; adds voice-level union types
src/lib/types/index.tsNew export * for stt.ts and realtime.ts
src/lib/types/generate.tsAdded stt option block to GenerateOptions; added transcription: STTResult to GenerateResult
src/lib/types/stream.tsMinor additions for audio stream result types
src/lib/types/span.tsAdded SpanType.STT enum value
src/lib/factories/providerRegistry.tsTTS, STT, and Realtime handler registration blocks at end of registerAllProviders()
src/lib/neurolink.tsSTT preprocessing in runStandardGenerateRequest(); TTS option threading to stream/generate
src/cli/commands/voice.tsNew --stt, --stt-provider, --input-audio, --stt-language, --tts-provider flags
src/lib/server/voice/voiceWebSocketHandler.tsRefactored to use STTProcessor / TTSProcessor / RealtimeProcessor instead of direct provider classes
.env.exampleAdded DEEPGRAM_API_KEY, ELEVENLABS_API_KEY, AZURE_SPEECH_KEY, AZURE_SPEECH_REGION
test/continuous-test-suite-voice.ts1 822-line new test suite

Smoke Tests

# Build first
pnpm run build:cli

# TTS: OpenAI
export OPENAI_API_KEY="sk-..."
pnpm run cli generate "Hello world" --tts --tts-provider openai-tts --tts-voice nova

# TTS: ElevenLabs
export ELEVENLABS_API_KEY="..."
pnpm run cli generate "Hello world" --tts --tts-provider elevenlabs

# STT: Whisper
export OPENAI_API_KEY="sk-..."
pnpm run cli generate --stt --stt-provider whisper --input-audio recording.wav

# STT + TTS round-trip
pnpm run cli generate --stt --stt-provider whisper --input-audio recording.wav \
--tts --tts-provider openai-tts --provider openai

# Full test suite (requires Vertex credentials)
npx tsx test/continuous-test-suite-voice.ts --provider=vertex

Backward Compatibility

  • No changes to AIProviderName enum — existing provider callers unaffected.
  • No new public NeuroLink methods — interface extends only through option fields.
  • AudioFormat type extended additively — existing "mp3" | "wav" | "ogg" | "opus" values unchanged.
  • GenerateOptions.stt and GenerateResult.transcription are optional — callers not passing stt see no change in behaviour.
  • TTSProcessor pre-existing registration for google-ai and vertex (via GoogleTTSHandler) is unmodified.