Real-time Voice Services

NeuroLink integrates the two major realtime voice APIs behind a single, provider-agnostic interface: OpenAI Realtime (openai-realtime) and Google Gemini Live (gemini-live). These let you build full-duplex voice agents where audio streams in and out simultaneously, with the model responding mid-utterance and calling tools in-flight.

Realtime voice is exposed through the RealtimeProcessor static class (not a method on the NeuroLink instance). For non-realtime synthesis and transcription, see the TTS Guide and STT Guide.

Overview

Capability	`openai-realtime`	`gemini-live`
Provider value	`"openai-realtime"`	`"gemini-live"`
Transport	WebSocket	WebSocket / WebRTC
Modalities	audio in/out, text in/out	audio in/out, text, video
Tool calls	Yes (via `onFunctionCall`)	Yes (via `onFunctionCall`)
Interruption	Server-side VAD + manual cancel	Native barge-in + manual cancel

Both APIs support concurrent audio input and output streams, so the user can interrupt the model mid-response and the model can stream audio while still listening for new input.

Quick Start (SDK)

The RealtimeProcessor is a static class — there is no new RealtimeProcessor() and no neurolink.openRealtimeSession(...) method. Connect with RealtimeProcessor.connect(provider, config, handlers):

import { RealtimeProcessor } from "@juspay/neurolink";

// OpenAI Realtime
const session = await RealtimeProcessor.connect(
  "openai-realtime",
  {
    provider: "openai-realtime",
    model: "gpt-4o-realtime-preview",
    voice: "alloy",
    instructions: "You are a helpful voice assistant.",
  },
  {
    onAudio: (chunk) => speaker.write(chunk.audio),
    onTranscript: (text, isFinal) => {
      if (isFinal) console.log("User said:", text);
    },
    onError: (err) => console.error(err),
  },
);

// Send audio chunks (PCM16 mono 24kHz, raw Buffer or RealtimeAudioChunk)
await RealtimeProcessor.sendAudio("openai-realtime", audioChunk);

// Send text input alongside audio
await RealtimeProcessor.sendText("openai-realtime", "What's the weather?");

// Manually request a model response (for manual turn detection)
await RealtimeProcessor.triggerResponse("openai-realtime");

// Cancel an in-progress response (barge-in)
await RealtimeProcessor.cancelResponse("openai-realtime");

// Close the session
await RealtimeProcessor.disconnect("openai-realtime");

// Gemini Live — same handler shape, just a different provider value
const session = await RealtimeProcessor.connect(
  "gemini-live",
  {
    provider: "gemini-live",
    model: "gemini-2.0-flash-live",
    instructions: "Speak naturally and ask follow-up questions.",
  },
  { onAudio, onTranscript, onError },
);

The handler shape is provider-agnostic: the same RealtimeEventHandlers object works across both providers, so you can switch with a single string change.

Event handler reference

type RealtimeEventHandlers = {
  onAudio?: (chunk: RealtimeAudioChunk) => void;
  onTranscript?: (text: string, isFinal: boolean) => void;
  onText?: (text: string, isFinal: boolean) => void;
  onFunctionCall?: (
    name: string,
    args: Record<string, unknown>,
  ) => Promise<unknown>;
  onStateChange?: (state: RealtimeSessionState) => void;
  onError?: (error: Error) => void;
  onTurnStart?: () => void;
  onTurnEnd?: () => void;
};

Quick Start (CLI)

NeuroLink does not ship a neurolink voice realtime interactive CLI. Instead, the realtime voice server is exposed via:

# Canonical: start the realtime voice WebSocket server
npx @juspay/neurolink serve voice --port 8081

# Deprecated alias (still works, prints a deprecation notice)
npx @juspay/neurolink voice-server --port 8081

Connect a browser/mobile client to ws://localhost:8081/voice to drive the session. The server bridges the client to the chosen provider (configured via env vars and per-session messages) and forwards events bidirectionally.

The TTS and STT flags on generate / stream (e.g. --tts, --stt, --input-audio) are for non-realtime synthesis and transcription — see TTS and STT.

Self-hosted Realtime Voice Server

For multi-tenant deployments — voice bots, IVR-style applications, in-app voice features — NeuroLink ships a real-time voice agent server. It bridges browser/mobile clients to provider realtime APIs with session management, observability, and tool routing.

// startVoiceServer is the canonical export
import { startVoiceServer } from "@juspay/neurolink/dist/lib/server/voice/voiceServerApp.js";

await startVoiceServer(8081);

Note: the server is a function export (startVoiceServer), not a NeuroLinkVoiceServer class. To run it from the CLI, prefer npx @juspay/neurolink serve voice --port 8081.

The server emits OTEL spans + Langfuse traces per session, supports HITL approvals on tool calls, and can be deployed standalone or behind your own gateway.

Provider Selection

Use case	Recommended provider
English-first, broad voice catalog, GPT-4o reasoning	`openai-realtime`
Multilingual, video input, lowest latency in many regions	`gemini-live`
Customer support voice bots with structured tool calls	`openai-realtime` (more deterministic function calls)
In-app voice search / multimodal queries	`gemini-live`

Either can be wrapped behind providerFallback so a model-access denial automatically falls through to the alternate model. See Provider Fallback — note that the orchestrator only triggers on access-denied errors, not on rate limits or generic failures.

Tool Calls Inside Realtime Sessions

Both providers can call functions registered with the realtime session. Use the onFunctionCall handler (not onToolCall — that name is reserved for the streaming-text API):

const session = await RealtimeProcessor.connect(
  "openai-realtime",
  {
    provider: "openai-realtime",
    model: "gpt-4o-realtime-preview",
    tools: [
      {
        name: "lookupOrderStatus",
        description: "Look up the status of an order",
        parameters: {
          type: "object",
          properties: { id: { type: "string" } },
          required: ["id"],
        },
      },
    ],
  },
  {
    onFunctionCall: async (name, args) => {
      if (name === "lookupOrderStatus") {
        const order = await db.findOrder(args.id as string);
        return { status: order.status, eta: order.eta };
      }
    },
  },
);

When HITL middleware is wired in front of the function-call handler, sensitive operations (e.g. cancelOrder, chargeCard) pause for human approval before responding back into the realtime stream.

Observability

Realtime sessions emit:

session:start, session:end events with duration + token usage
Per-utterance transcript:user, transcript:assistant events
tool:call, tool:result events
audio:in:bytes, audio:out:bytes for bandwidth tracking

These flow into the same OTEL/Langfuse pipeline as text generation. See the Observability Guide.

Status & Inspection

RealtimeProcessor.isConnected("openai-realtime"); // boolean
RealtimeProcessor.getProviders(); // string[] of registered providers
RealtimeProcessor.supports("openai-realtime"); // boolean
RealtimeProcessor.getSession("openai-realtime"); // RealtimeSession | null
RealtimeProcessor.getSupportedFormats("openai-realtime"); // TTSAudioFormat[]

TTS Guide — non-realtime text-to-speech (5 providers)
STT Guide — transcription (4 providers)
Voice Agent Guide — building voice agents end-to-end
Provider Fallback — failover between models on access denial
Observability — wiring fallback events into your monitoring stack

Overview​

Quick Start (SDK)​

Event handler reference​

Quick Start (CLI)​

Self-hosted Realtime Voice Server​

Provider Selection​

Tool Calls Inside Realtime Sessions​

Observability​

Status & Inspection​

Related​