Audio Input & Voice Conversations Guide

NeuroLink provides comprehensive audio input capabilities, enabling real-time voice conversations with AI models. This guide covers currently available features, audio specifications, and upcoming enhancements.

Overview

Currently Available

NeuroLink supports the following audio capabilities today:

Real-time voice conversations via Gemini Live (Google AI Studio)
Text-to-Speech (TTS) output via Google Cloud TTS, OpenAI TTS, ElevenLabs, and Azure TTS
Speech-to-Text (STT) via generate() and stream() options (Whisper/OpenAI STT, Google STT, Deepgram, Azure STT)
WebSocket-based voice streaming for web applications
Bidirectional audio - speak and hear AI responses in real-time

Planned

The following features are planned for future releases:

CLI commands: neurolink audio transcribe, neurolink audio analyze, neurolink audio summarize
CLI commands: neurolink voice chat, neurolink voice demo
Cross-provider audio support (Anthropic, AWS Transcribe still planned)
File-based audio input processing

Provider Support Matrix

Provider	Real-time Voice	TTS Output	Audio Transcription	Status
Google AI Studio	Yes	Yes	Yes (via Google STT)	Production Ready
Google Vertex AI	Planned	Yes	Yes (via Google STT)	Available
OpenAI	Planned	Yes	Yes (via Whisper/OpenAI STT)	Available
Deepgram	Planned	No	Yes	Available
Azure	Planned	Yes	Yes (via Azure STT)	Available
Anthropic	Planned	Planned	Planned	Planned
AWS Bedrock	Planned	Planned	Planned	Planned

Supported Model for Real-time Voice:

Model	Provider	Capabilities
`gemini-2.5-flash-preview-native-audio-dialog`	Google AI	Bidirectional audio, low latency

Quick Start: Real-Time Voice (SDK)

Real-time voice conversations are available through the SDK using Gemini Live's native audio dialog model.

Prerequisites

# Set your Google AI API key
export GOOGLE_AI_API_KEY="your-api-key"
# OR
export GEMINI_API_KEY="your-api-key"

Basic Real-time Voice Streaming

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink();

// Create an async iterator for audio frames
// This example uses a hypothetical audio source
async function* getAudioFrames(): AsyncIterable<Buffer> {
  // Your audio capture logic here
  // Each frame should be PCM16LE mono at 16kHz
  // Recommended frame size: 20-60ms of audio
  while (capturing) {
    const frame = await captureAudioFrame();
    yield frame;
  }
}

// Stream with real-time audio input
const result = await neurolink.stream({
  provider: "google-ai",
  model: "gemini-2.5-flash-preview-native-audio-dialog",
  input: {
    audio: {
      frames: getAudioFrames(),
      sampleRateHz: 16000, // Input sample rate (default: 16000)
      encoding: "PCM16LE", // Encoding format (default: PCM16LE)
    },
  },
  disableTools: true, // Required for Phase 1 audio streaming
});

// Process audio responses
for await (const event of result.stream) {
  if (event.type === "audio") {
    // Handle audio output chunk
    // Output is PCM16LE mono at 24kHz
    const audioData = event.audio.data;
    playAudio(audioData);
  }
}

Complete Voice Session Example

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink();

async function startVoiceSession() {
  // Audio frame queue management
  const frameQueue: Buffer[] = [];
  let isSessionActive = true;

  // Create async iterator from queue
  const audioFramesIterator: AsyncIterable<Buffer> = {
    [Symbol.asyncIterator]() {
      return {
        async next() {
          if (!isSessionActive) {
            return { value: undefined, done: true };
          }
          // Wait for frames to be available
          while (frameQueue.length === 0 && isSessionActive) {
            await new Promise((resolve) => setTimeout(resolve, 10));
          }
          if (frameQueue.length > 0) {
            return { value: frameQueue.shift()!, done: false };
          }
          return { value: undefined, done: true };
        },
      };
    },
  };

  // Start the streaming session
  const streamResult = await neurolink.stream({
    provider: "google-ai",
    model: "gemini-2.5-flash-preview-native-audio-dialog",
    input: {
      audio: {
        frames: audioFramesIterator,
        sampleRateHz: 16000,
        encoding: "PCM16LE",
      },
    },
    disableTools: true,
  });

  // Function to add captured audio to queue
  function onAudioCaptured(pcmBuffer: Buffer) {
    frameQueue.push(pcmBuffer);
  }

  // Function to signal end of input (flush)
  function flushAudio() {
    // Push a zero-length buffer as flush signal
    frameQueue.push(Buffer.alloc(0));
  }

  // Process responses
  for await (const event of streamResult.stream) {
    if (event.type === "audio") {
      // Output audio data: PCM16LE, 24kHz, mono
      handleAudioOutput(event.audio.data);
    }
  }

  isSessionActive = false;
}

function handleAudioOutput(audioBuffer: Buffer) {
  // Play or process the audio response
  // Sample rate: 24000 Hz
  // Format: PCM16LE mono
  playAudioBuffer(audioBuffer);
}

Quick Start: TTS Integration

NeuroLink provides Text-to-Speech output via Google Cloud TTS. TTS can be combined with any text generation.

CLI Usage

# Generate text and convert to speech
neurolink generate "Hello, world!" \
  --provider google-ai \
  --tts-voice en-US-Neural2-C

# Save audio to file
neurolink generate "Welcome to NeuroLink" \
  --provider google-ai \
  --tts-voice en-US-Neural2-C \
  --tts-output welcome.mp3

# Customize voice parameters
neurolink generate "This is a test" \
  --provider google-ai \
  --tts-voice en-US-Wavenet-D \
  --tts-speed 1.2 \
  --tts-pitch 2.0 \
  --tts-format mp3 \
  --tts-output test.mp3

# Synthesize AI response (not input text)
neurolink generate "Tell me a joke" \
  --provider google-ai \
  --tts-voice en-US-Neural2-C \
  --tts-use-ai-response \
  --tts-output joke.mp3

SDK Usage

import { NeuroLink } from "@juspay/neurolink";
import { writeFileSync } from "fs";

const neurolink = new NeuroLink();

// Basic TTS
const result = await neurolink.generate({
  input: { text: "Hello, world!" },
  provider: "google-ai",
  tts: {
    enabled: true,
    voice: "en-US-Neural2-C",
    format: "mp3",
    play: true, // Auto-play in CLI
  },
});

// Save TTS audio
if (result.audio?.buffer) {
  writeFileSync("output.mp3", result.audio.buffer);
  console.log(`Audio saved: ${result.audio.size} bytes`);
}

// Advanced TTS with AI response synthesis
const aiResponse = await neurolink.generate({
  input: { text: "Explain quantum computing briefly" },
  provider: "google-ai",
  tts: {
    enabled: true,
    useAiResponse: true, // Synthesize AI's response
    voice: "en-US-Wavenet-D",
    format: "mp3",
    speed: 0.9,
    pitch: -2.0,
  },
});

console.log("Text:", aiResponse.content);
console.log("Audio size:", aiResponse.tts?.size, "bytes");

For comprehensive TTS documentation, see the TTS Integration Guide.

Voice Demo Example

NeuroLink includes a complete voice demo application demonstrating real-time bidirectional audio conversations.

Location

examples/voice-demo/
  server.mjs      # WebSocket server with NeuroLink integration
  public/
    index.html    # Web interface
    client.js     # Browser audio capture and playback

Running the Demo

# Navigate to the project root
cd /path/to/neurolink

# Build the SDK first
pnpm run build

# Set your API key
export GOOGLE_AI_API_KEY="your-api-key"

# Run the demo server
node examples/voice-demo/server.mjs

The demo will:

Start a WebSocket server on port 5175 (or next available port)
Open your browser automatically to the demo interface
Allow you to speak and receive real-time AI audio responses

Demo Architecture

Browser (client.js)
    |
    | WebSocket (ws://localhost:5175/ws)
    |
    v
Server (server.mjs)
    |
    | neurolink.stream()
    |
    v
Gemini Live API
    |
    | PCM16LE audio chunks
    |
    v
Server -> Browser -> Audio playback

Key Code from Voice Demo Server

// From examples/voice-demo/server.mjs
const streamResult = await neurolink.stream({
  provider: "google-ai",
  model:
    process.env.GEMINI_MODEL || "gemini-2.5-flash-preview-native-audio-dialog",
  input: {
    audio: {
      frames: framesFromClient,
      // sampleRateHz defaults to 16000
      // encoding defaults to 'PCM16LE'
    },
  },
  disableTools: true, // Required for audio streaming
});

// Stream audio responses back to client
for await (const ev of streamResult.stream) {
  if (ev.type === "audio") {
    // Send raw PCM16LE bytes back to the client
    ws.send(ev.audio.data, { binary: true });
  }
}

Audio Specifications

Input Audio Format

Parameter	Value	Notes
Encoding	PCM16LE	16-bit signed integer, little-endian
Sample Rate	16,000 Hz	16 kHz mono
Channels	1 (mono)	Stereo not supported in Phase 1
Frame Size	20-60ms recommended	~320-960 samples per frame
Byte Order	Little-endian	Intel/ARM standard

Output Audio Format

Parameter	Value	Notes
Encoding	PCM16LE	16-bit signed integer, little-endian
Sample Rate	24,000 Hz	24 kHz mono
Channels	1 (mono)	Single channel output
Byte Order	Little-endian	Intel/ARM standard

Converting Audio Formats

From Float32 to PCM16LE (for input):

function floatTo16BitPCM(float32Array) {
  const length = float32Array.length;
  const buffer = new ArrayBuffer(length * 2);
  const view = new DataView(buffer);

  for (let i = 0; i < length; i++) {
    // Clamp value to [-1, 1]
    let sample = Math.max(-1, Math.min(1, float32Array[i]));
    // Convert to 16-bit signed integer
    view.setInt16(i * 2, sample < 0 ? sample * 0x8000 : sample * 0x7fff, true);
  }

  return buffer;
}

From PCM16LE to Float32 (for output playback):

function pcm16ToFloat32(pcm16Buffer) {
  const dataInt16 = new Int16Array(pcm16Buffer);
  const dataFloat32 = new Float32Array(dataInt16.length);

  for (let i = 0; i < dataInt16.length; i++) {
    dataFloat32[i] = dataInt16[i] / 32768.0;
  }

  return dataFloat32;
}

Browser Audio Context Setup

// Input context at 16kHz for capturing
const inputCtx = new AudioContext({ sampleRate: 16000 });

// Output context at 24kHz for playback
const outputCtx = new AudioContext({ sampleRate: 24000 });

SDK API Reference

AudioInputSpec

Configuration for streaming audio input.

type AudioInputSpec = {
  /**
   * Async iterator yielding PCM16LE audio frames
   * Each frame should be 20-60ms of audio (mono)
   */
  frames: AsyncIterable<Buffer>;

  /**
   * Input sample rate in Hz
   * @default 16000
   */
  sampleRateHz?: number;

  /**
   * Audio encoding format
   * @default "PCM16LE"
   */
  encoding?: "PCM16LE";

  /**
   * Number of audio channels
   * Phase 1 only supports mono
   * @default 1
   */
  channels?: 1;
};

AudioChunk

Audio output chunk received from streaming responses.

type AudioChunk = {
  /**
   * Raw audio data buffer (PCM16LE format)
   */
  data: Buffer;

  /**
   * Sample rate of the audio data
   * Gemini typically outputs at 24000 Hz
   */
  sampleRateHz: number;

  /**
   * Number of audio channels (typically 1 for mono)
   */
  channels: number;

  /**
   * Audio encoding format
   */
  encoding: "PCM16LE";
};

StreamOptions with Audio

type StreamOptions = {
  input: {
    text: string;
    audio?: AudioInputSpec; // Optional audio input
    // ... other input options
  };

  provider: string;
  model?: string;
  disableTools?: boolean; // Required true for audio streaming
  // ... other options
};

Stream Result Events

// Stream yields different event types
type StreamEvent =
  | { content: string } // Text chunk
  | { type: "audio"; audio: AudioChunk } // Audio chunk
  | { type: "image"; imageOutput: { base64: string } }; // Image output

// Usage
for await (const event of result.stream) {
  if ("content" in event) {
    // Text content
    console.log(event.content);
  } else if (event.type === "audio") {
    // Audio data
    playAudio(event.audio.data);
  }
}

AudioContent (File-based - Future)

For file-based audio input (planned feature).

type AudioContent = {
  type: "audio";
  data: Buffer | string; // Buffer, base64, URL, or file path
  mediaType?:
    | "audio/mpeg" // MP3
    | "audio/wav" // WAV
    | "audio/ogg" // OGG
    | "audio/webm" // WebM
    | "audio/aac" // AAC
    | "audio/flac" // FLAC
    | "audio/mp4"; // M4A
  metadata?: {
    filename?: string;
    duration?: number; // in seconds
    sampleRate?: number;
    channels?: number;
    transcription?: string; // Pre-existing transcription
  };
};

Roadmap

Phase 1 (Current)

Real-time voice with Gemini Live
Bidirectional audio streaming via SDK
Voice demo example application
TTS output integration

Phase 2 (Planned)

CLI Voice Commands

# Start interactive voice chat
neurolink voice chat --provider google-ai

# Launch voice demo server
neurolink voice demo --port 5175

Audio Transcription

# Transcribe audio file
neurolink audio transcribe recording.mp3 --provider openai

# Analyze audio content
neurolink audio analyze podcast.mp3 --prompt "Summarize key points"

Phase 3 (Partially Available)

Speech-to-Text via generate() / stream() — Available now via the stt option. There is no standalone neurolink.transcribe() method; STT is integrated directly into generate() and stream():

const result = await neurolink.generate({
  input: { text: "Respond to the audio" },
  provider: "vertex",
  stt: {
    enabled: true,
    provider: "google-stt",
    audio: audioBuffer,
    language: "en-US",
  },
});
console.log("Transcription:", result.transcription?.text);
console.log("AI Response:", result.content);

CLI equivalent:

neurolink generate "Respond to the audio" \
  --provider vertex \
  --stt \
  --stt-provider google-stt \
  --input-audio recording.wav

Available STT providers: whisper / openai-stt, google-stt, deepgram, azure-stt

CLI STT flags: --stt, --stt-provider <provider>, --input-audio <file>, --stt-language <lang>

Cross-provider Audio Support
- Anthropic voice capabilities — Planned
- AWS Transcribe — Planned

File-based Audio Input

const result = await neurolink.generate({
  input: {
    text: "Analyze this audio file",
    audioFiles: ["./meeting.mp3"],
  },
  provider: "openai",
});

Environment Setup

Required Environment Variables

# For Google AI Studio (Gemini Live)
export GOOGLE_AI_API_KEY="your-api-key"
# OR
export GEMINI_API_KEY="your-api-key"

# For TTS (Google Cloud)
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
# OR use the same GOOGLE_AI_API_KEY with Cloud TTS API enabled

API Key Configuration

For Gemini Live and TTS to work with an API key:

Go to Google Cloud Console > APIs & Services > Credentials
Create or select your API key
Under "API restrictions", enable:
- Generative Language API (for Gemini)
- Cloud Text-to-Speech API (for TTS output)

Troubleshooting

Common Issues

Issue	Cause	Solution
No audio output	Missing API key	Set `GOOGLE_AI_API_KEY` or `GEMINI_API_KEY`
"disableTools required"	Tools enabled with audio	Add `disableTools: true` to stream options
Choppy audio playback	Buffer underrun	Increase buffer size or frame rate
Wrong sample rate	Mismatched audio context	Use 16kHz input, 24kHz output contexts
WebSocket disconnects	Network timeout	Implement reconnection logic
"Model not found"	Invalid model name	Use `gemini-2.5-flash-preview-native-audio-dialog`

Audio Quality Issues

Clipping/Distortion:

Ensure input samples are normalized to [-1, 1] range
Check gain levels before PCM conversion

Echo/Feedback:

Mute microphone during AI audio playback
Implement voice activity detection (VAD)

Latency:

Use smaller frame sizes (20ms)
Process audio in real-time, avoid buffering
Use WebSocket for low-latency transport

Debug Mode

Enable debug logging to troubleshoot audio issues:

export NEUROLINK_DEBUG=true

const neurolink = new NeuroLink({
  debug: true,
});

Audio & Voice:

TTS Integration Guide - Complete Text-to-Speech documentation
Video Generation - AI-powered video with audio
PPT Generation - AI-powered PowerPoint presentations

Multimodal Capabilities:

Multimodal Guide - Images, PDFs, CSV inputs
PDF Support - Document processing

Advanced Features:

Streaming - Stream AI responses in real-time
Provider Orchestration - Multi-provider failover

Documentation:

CLI Commands - Complete CLI reference
SDK API Reference - Full API documentation
Troubleshooting - Extended error catalog

Summary

NeuroLink's audio input capabilities provide:

Currently Available:

Real-time voice conversations via Gemini Live
Bidirectional audio streaming (speak and hear)
TTS output via Google Cloud, OpenAI TTS, ElevenLabs, and Azure TTS
STT via generate({ stt: { ... } }) — Whisper/OpenAI STT, Google STT, Deepgram, Azure STT
Voice demo example application
PCM16LE audio format support

Planned:

CLI voice commands (voice chat, audio transcribe)
Anthropic and AWS Transcribe audio support
File-based audio processing

Next Steps:

Set up environment variables
Try the voice demo application
Integrate real-time voice in your SDK code
Explore TTS output for text-to-speech
Check troubleshooting if you encounter issues

Overview​

Currently Available​

Planned​

Provider Support Matrix​

Quick Start: Real-Time Voice (SDK)​

Prerequisites​

Basic Real-time Voice Streaming​

Complete Voice Session Example​

Quick Start: TTS Integration​

CLI Usage​

SDK Usage​

Voice Demo Example​

Location​

Running the Demo​

Demo Architecture​

Key Code from Voice Demo Server​

Audio Specifications​

Input Audio Format​

Output Audio Format​

Converting Audio Formats​

Browser Audio Context Setup​

SDK API Reference​

AudioInputSpec​

AudioChunk​

StreamOptions with Audio​

Stream Result Events​

AudioContent (File-based - Future)​

Roadmap​

Phase 1 (Current)​

Phase 2 (Planned)​

Phase 3 (Partially Available)​

Environment Setup​

Required Environment Variables​

API Key Configuration​

Troubleshooting​

Common Issues​

Audio Quality Issues​

Debug Mode​

Related Features​

Summary​