Real-Time Voice Agent - Streaming Voice Loop Design

Automatic low-latency voice conversations with STT, LLM, TTS, and barge-in support

Problem Statement & Solution
Architecture Overview
Core Components
Runtime Flow
CLI Integration
Source Layout
Configuration
Operational Behavior
Error Handling & Troubleshooting
Performance Characteristics
Extensibility Roadmap

Problem Statement & Solution

The Challenge

Real-time voice assistants are harder than ordinary request/response chat because they must coordinate:

continuous microphone audio input
speech detection
real-time transcription
streaming LLM generation
streaming TTS playback
interruption while the assistant is still speaking

Without careful coordination, common failures appear quickly:

user speech gets cut off too early
assistant speech is echoed back into the mic
interruptions trigger too often or too late
TTS providers fail under token-by-token flooding
local MCP/tool initialization adds large latency spikes

Our Solution

NeuroLink exposes a dedicated voice-server mode that runs a full browser-to-server voice loop:

Browser captures microphone audio
Cobra detects speaking/silence boundaries
Soniox performs streaming STT
NeuroLink streams the LLM response
Cartesia converts the response into streaming PCM audio
Browser plays audio immediately and supports mid-response interruption

Key Benefits

Low-latency speech loop for natural conversations
Automatic barge-in while assistant audio is playing
Buffered TTS chunking to avoid provider overload on long replies
Warmup path to reduce first-turn cold start cost
Environment-driven configuration for Cartesia endpoint/version overrides
Voice-mode tool isolation by disabling MCP tools during real-time turns

Architecture Overview

System Flow Diagram

┌─────────────────────────────────────────────────────────────┐
│ Browser                                                     │
│ • Captures 16kHz PCM mic audio                              │
│ • Sends audio over WebSocket                                │
│ • Plays assistant PCM audio                                 │
└────────────────────────┬────────────────────────────────────┘
                                         │
                                         ▼
┌─────────────────────────────────────────────────────────────┐
│ Voice WebSocket Handler                                     │
│ • Maintains per-session state                               │
│ • Routes audio frames to VAD + STT                          │
│ • Sends assistant audio back to browser                     │
└────────────────────────┬────────────────────────────────────┘
                                         │
                 ┌─────────────┼─────────────┐
                 ▼             ▼             ▼
     Cobra VAD        Soniox STT     TurnManager
     • speech start   • partials     • IDLE
     • speech stop    • finals       • USER_SPEAKING
                                                            • PROCESSING
                                                            • ASSISTANT_SPEAKING
                 └─────────────┬─────────────┘
                                         ▼
┌─────────────────────────────────────────────────────────────┐
│ NeuroLink LLM Streaming                                     │
│ • Default: Azure gpt-4o-automatic (configurable via env)    │
│ • tools disabled for lower latency                          │
│ • short spoken response style                               │
└────────────────────────┬────────────────────────────────────┘
                                         ▼
┌─────────────────────────────────────────────────────────────┐
│ Cartesia TTS                                                │
│ • transcript chunks in                                      │
│ • PCM S16LE 24kHz out                                       │
└────────────────────────┬────────────────────────────────────┘
                                         ▼
┌─────────────────────────────────────────────────────────────┐
│ Browser Playback                                            │
│ • buffered playback                                          │
│ • sends playback_done when queue drains                      │
└─────────────────────────────────────────────────────────────┘

Core Components

1. Voice Activity Detection

Provider: Picovoice Cobra

Purpose:

identify when the user starts speaking
identify when the user stops speaking
move session state between IDLE, USER_SPEAKING, and PROCESSING

Implementation details:

512-sample frames
threshold-based speech probability
explicit start and stop hysteresis using consecutive frames

2. Streaming Speech-to-Text

Provider: Soniox

Purpose:

transcribe incoming speech continuously
use non-final tokens for reliable barge-in detection
use final tokens plus <end> to trigger LLM processing

3. Turn State Management

Component: TurnManager

State machine:

IDLE -> USER_SPEAKING -> PROCESSING -> ASSISTANT_SPEAKING

Purpose:

prevent overlapping turns
distinguish user speech from assistant playback state
ensure barge-in only fires when the assistant is actually speaking

4. Streaming TTS Adapter

Provider: Cartesia

Purpose:

accept streaming transcript chunks
return PCM S16LE 24kHz audio for immediate playback

Important implementation detail:

transcript is buffered into phrase/sentence chunks before being sent
this avoids sending one tiny WS message per token
reduces Service unavailable failures for long responses

5. Browser Client

Files in src/lib/server/voice/public:

src/lib/server/voice/public/index.html
src/lib/server/voice/public/app.js
src/lib/server/voice/public/pcm-worklet.js
src/lib/server/voice/public/styles.css

Responsibilities:

microphone capture
audio frame encoding and streaming
assistant playback queueing
playback completion signaling
simple voice UI state updates

Runtime Flow

Normal Turn

Browser sends microphone PCM frames to server
Cobra detects speech start and publishes vad_start
Soniox streams transcription in parallel
On final transcript + <end>, server calls NeuroLink streaming
LLM response is buffered into TTS-friendly chunks
Cartesia returns audio chunks
Browser plays audio immediately
Browser sends playback_done after the queue drains
Session returns to IDLE

Barge-In Flow

Assistant is already speaking
Soniox emits new non-final user speech tokens
Server verifies current state is ASSISTANT_SPEAKING
Server interrupts active TTS
Browser receives { type: "interrupt" }
Current turn is canceled and user takes over

Warmup Flow

On server startup:

NeuroLink performs a tiny LLM stream request using the configured voice provider
Cartesia WebSocket connection is opened and closed once
Subsequent first-user-turn latency is reduced

CLI Integration

Command

neurolink voice-server --port 3000

Implementation Entry Point

src/cli/commands/voiceServer.ts

What the command does

starts an Express server
serves the browser UI
exposes a /health endpoint
attaches a WebSocket voice session handler
performs LLM + TTS warmup in the background

Source Layout

CLI

src/cli/commands/voiceServer.ts

Voice Server Module

src/lib/server/voice/
├── voiceServerApp.ts
├── voiceWebSocketHandler.ts
├── frameBus.ts
├── turnManager.ts
├── types.ts
└── public/
      ├── index.html
      ├── app.js
      ├── pcm-worklet.js
      └── styles.css

TTS Adapter

src/lib/adapters/tts/cartesiaHandler.ts

Configuration

Required Environment Variables

# Cartesia
CARTESIA_API_KEY=

# Soniox
SONIOX_API_KEY=

# Picovoice Cobra
PICOVOICE_ACCESS_KEY=

# Azure OpenAI (for LLM — default provider)
AZURE_OPENAI_API_KEY=
AZURE_OPENAI_ENDPOINT=

Optional Voice LLM Overrides

# Override the LLM provider/model used for voice turns (defaults: azure / gpt-4o-automatic)
VOICE_LLM_PROVIDER=azure
VOICE_LLM_MODEL=gpt-4o-automatic

Optional Cartesia Overrides

CARTESIA_WS_BASE_URL=wss://api.cartesia.ai/tts/websocket
CARTESIA_API_VERSION=2025-04-16

Optional Soniox Overrides

SONIOX_WS_URL=wss://stt-rt.soniox.com/transcribe-websocket

These exist because the endpoint base URL is usually shared, but API key and version may vary by environment or future provider rollout.

Operational Behavior

Tuned Constants

Constant	Value	Purpose
`VOICE_THRESHOLD`	`0.7`	Cobra speech probability cutoff
`VOICE_FRAMES_TO_START`	`5` (~160ms)	Filter short noise bursts
`SILENCE_FRAMES_TO_STOP`	`30` (~960ms)	Avoid cutting natural pauses
Pre-lock before assistant speech	`1000ms`	Protect initial TTS connection window
Lock refresh on first audio	`+400ms`	Cover browser AEC lock-on window

Why MCP Tools Are Disabled

Voice mode sets:

process.env.NEUROLINK_DISABLE_MCP_TOOLS = "true";

Reason:

tool/MCP initialization adds several seconds of latency
real-time voice turns need predictable low overhead
voice mode is optimized for direct conversation, not tool orchestration

Why TTS Buffering Matters

Sending every token directly to Cartesia can overload the provider on long responses.

Current strategy:

accumulate text in a local buffer
flush at sentence/phrase boundaries or after a minimum chunk length
keep speech natural while reducing provider stress

Error Handling & Troubleshooting

Common Runtime Failure Modes

1. Invalid MCP HTTP Auth

Symptom:

Authorization header is badly formatted

Impact:

degraded latency
failed or partial turns
noisy logs during voice testing

Fix:

correct the local token/env value, or
disable that MCP server locally while testing voice mode

Important:

this is a local environment issue
do not commit personal .mcp-config.json changes unless they are intended for everyone

2. Cartesia Temporary Unavailability

Symptom:

Service unavailable: TTS generation services are temporarily unavailable

Impact:

no assistant audio for that turn
turn resets so the user can retry

Mitigation already implemented:

mid-stream TTS errors abort the turn cleanly
failed turns are not committed to conversation history
long-response flooding was reduced via chunked TTS buffering

3. Missing Environment Variables

Examples:

CARTESIA_API_KEY is not set in environment
PICOVOICE_ACCESS_KEY is not set in environment

Fix:

populate values from .env.example

Health Check

GET http://localhost:3000/health

Returns:

{ "status": "ok" }

Performance Characteristics

Expected Latency

Condition	STT -> First Audio
Warm turn	~700–1400ms
Cold first turn	~7000ms

Why cold start is slower

initial LLM provider request setup (Azure by default)
initial Cartesia TLS/WebSocket setup
one-time runtime warmup overhead

Warmup in src/lib/server/voice/voiceServerApp.ts helps reduce this for the first real user turn.

Extensibility Roadmap

Possible next steps:

Provider abstraction for STT/TTS
- support alternative STT providers
- support alternative TTS providers
Richer browser client
- waveform UI
- transcripts in real time
- reconnect UX
Session persistence
- resumable voice sessions
- persisted conversation history
Voice personalization
- user-selectable voices
- language presets
- speaking style controls
Operational hardening
- retries/backoff for TTS transport
- structured metrics for per-turn latency
- better provider fallback strategies

Table of Contents​

Problem Statement & Solution​

The Challenge​

Our Solution​

Key Benefits​

Architecture Overview​

System Flow Diagram​

Core Components​

1. Voice Activity Detection​

2. Streaming Speech-to-Text​

3. Turn State Management​

4. Streaming TTS Adapter​

5. Browser Client​

Runtime Flow​

Normal Turn​

Barge-In Flow​

Warmup Flow​

CLI Integration​

Command​

Implementation Entry Point​

What the command does​

Source Layout​

CLI​

Voice Server Module​

TTS Adapter​

Configuration​

Required Environment Variables​

Optional Voice LLM Overrides​

Optional Cartesia Overrides​

Optional Soniox Overrides​

Operational Behavior​

Tuned Constants​

Why MCP Tools Are Disabled​

Why TTS Buffering Matters​

Error Handling & Troubleshooting​

Common Runtime Failure Modes​

1. Invalid MCP HTTP Auth​

2. Cartesia Temporary Unavailability​

3. Missing Environment Variables​

Health Check​

Performance Characteristics​

Expected Latency​

Why cold start is slower​

Extensibility Roadmap​

Table of Contents

Problem Statement & Solution

The Challenge

Our Solution

Key Benefits

Architecture Overview

System Flow Diagram

Core Components

1. Voice Activity Detection

2. Streaming Speech-to-Text

3. Turn State Management

4. Streaming TTS Adapter

5. Browser Client

Runtime Flow

Normal Turn

Barge-In Flow

Warmup Flow

CLI Integration

Command

Implementation Entry Point

What the command does

Source Layout

CLI

Voice Server Module

TTS Adapter

Configuration

Required Environment Variables

Optional Voice LLM Overrides

Optional Cartesia Overrides

Optional Soniox Overrides

Operational Behavior

Tuned Constants

Why MCP Tools Are Disabled

Why TTS Buffering Matters

Error Handling & Troubleshooting

Common Runtime Failure Modes

1. Invalid MCP HTTP Auth

2. Cartesia Temporary Unavailability

3. Missing Environment Variables

Health Check

Performance Characteristics

Expected Latency

Why cold start is slower

Extensibility Roadmap