Skip to main content

Multimodal Capabilities Guide

NeuroLink provides comprehensive multimodal support, allowing you to combine text with various media types in a single AI interaction. This guide covers all supported input types, provider capabilities, and best practices.

Overview

Supported Input Types:

  • Images - JPEG, PNG, GIF, WebP, HEIC (vision-capable models)
  • PDFs - Document analysis and content extraction
  • CSV/Spreadsheets - Data analysis and tabular content processing
  • Audio - Transcription, analysis, and real-time voice input (Audio Input Guide)
  • Documents - Excel, Word, RTF, OpenDocument formats (File Processors Guide)
  • Data Files - JSON, YAML, XML with validation and formatting
  • Markup - HTML, SVG, Markdown with security sanitization
  • Source Code - 50+ programming languages with syntax detection

All multimodal inputs work seamlessly across both the CLI and SDK, with automatic format detection and provider-specific optimization.

New in 2026: NeuroLink now supports 17+ file types through the ProcessorRegistry system. See the File Processors Guide for comprehensive documentation.


Provider Support Matrix

Not all providers support all multimodal capabilities. Use this matrix to select the right provider for your use case.

Vision (Images)

ProviderSupportedRecommended ModelsMax ImagesMax SizeNotes
OpenAIgpt-4o, gpt-4o-mini, gpt-5.210~20 MBBest for general vision tasks
Azure OpenAIgpt-4o, gpt-4o-mini10~20 MBSame as OpenAI
Google AI Studiogemini-2.5-pro, gemini-2.5-flash, gemini-3-flash16~20 MBExcellent for visual reasoning
Google Vertex AIgemini-2.5-pro, gemini-2.5-flash, Claude models16/20~20 MBGemini: 16 images, Claude: 20 images
Anthropicclaude-3.5-sonnet, claude-3.7-sonnet20~20 MBStrong visual understanding
AWS BedrockClaude models20~20 MBSame as Anthropic
Ollamallava, bakllava, llava-phi310VariesLocal vision models
LiteLLMDepends on upstream10VariesProxy to vision-capable models
Mistralpixtral-12b-2409, pixtral-large-241110~20 MBMultimodal Mistral models
OpenRouterDepends on model10VariesRoutes to various vision models
Hugging Face⚠️LimitedVariesVariesModel-dependent
AWS SageMakerN/A--Not supported
OpenAI Compatible⚠️Depends on endpointVariesVariesServer-dependent

Legend:

  • ✅ Full support with multiple models
  • ⚠️ Limited or server-dependent support
  • ❌ Not supported

PDF Documents

ProviderSupportedMax SizeMax PagesProcessing ModeNotes
Google Vertex AI5 MB100Native PDFBest for document analysis
Anthropic5 MB100Native PDFClaude excels at document understanding
AWS Bedrock5 MB100Native PDFVia Claude models
Google AI Studio2000 MB100Native PDFHandles very large files
OpenAI10 MB100Files APIgpt-4o, gpt-4o-mini, o1
Azure OpenAI10 MB100Files APIUses OpenAI Files API
LiteLLM10 MB100ProxyDepends on upstream model
OpenAI Compatible10 MB100VariesServer-dependent
Mistral10 MB100Native PDFNative support
Hugging Face10 MB100Model-dependentVaries by model
Ollama---Not supported
OpenRouter⚠️VariesVariesDepends on modelRoute-dependent
AWS SageMaker---Not supported

CSV/Spreadsheet Data

ProviderSupportedMax RowsFormat OptionsNotes
All Providers10,000raw, json, markdownUniversal support - processed as text

CSV support works with all providers because files are converted to text before sending to the AI model. The file is parsed and formatted (raw CSV, JSON, or Markdown table) before inclusion in the prompt.

Format Recommendations:

  • Raw format - Best for large files (minimal token usage)
  • JSON format - Best for structured data processing
  • Markdown format - Best for small datasets (<100 rows), readable tables

Audio Input

ProviderNative AudioTranscriptionReal-timeMax DurationNotes
Google AI Studio1 hourBest for real-time voice
Google Vertex AI1 hourNative Gemini audio support
OpenAI✅ Whisper25 MBExcellent transcription accuracy
Azure OpenAI✅ Whisper25 MBVia Whisper integration
AnthropicVia fallback-Uses transcription approach
AWS BedrockVia fallback-Uses transcription approach
OthersVia fallback-Audio transcribed before processing

For comprehensive audio documentation, see the Audio Input Guide.


Image Input

Quick Start

CLI:

# Single image
npx @juspay/neurolink generate "Describe this interface" \
--image ./designs/dashboard.png --provider google-ai

# Remote URL
npx @juspay/neurolink generate "Analyze this diagram" \
--image https://example.com/architecture.png --provider openai

# Multiple images
npx @juspay/neurolink generate "Compare these screenshots" \
--image ./before.png \
--image ./after.png \
--provider anthropic

SDK:

import { readFileSync } from "node:fs";
import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink({ enableOrchestration: true });

const result = await neurolink.generate({
input: {
text: "Analyze these product screenshots",
images: [
readFileSync("./homepage.png"), // Local file as Buffer
"https://example.com/chart.png", // Remote URL
],
},
provider: "google-ai",
});

Image Formats Supported

Accepted formats:

  • JPEG (.jpg, .jpeg)
  • PNG (.png)
  • GIF (.gif)
  • WebP (.webp)
  • HEIC (.heic, .heif) - iOS photos

Input methods:

  • Buffer objects - readFileSync() from Node.js
  • Local file paths - Relative or absolute paths
  • HTTPS URLs - Remote images (auto-downloaded)

Image Alt Text (Accessibility)

NeuroLink supports alt text for images, improving accessibility and providing additional context to AI models.

const result = await neurolink.generate({
input: {
text: "Compare these revenue charts",
images: [
{
data: readFileSync("./q1-revenue.png"),
altText: "Q1 2024 revenue chart showing 15% growth",
},
{
data: "https://example.com/q2-revenue.png",
altText: "Q2 2024 revenue chart showing 22% growth",
},
],
},
provider: "openai",
});

Alt text best practices:

  • Keep concise (under 125 characters ideal)
  • Focus on key information the image conveys
  • Alt text is automatically included as context in prompts

Image Size Limits

Provider-specific limits:

  • Most providers: ~20 MB per image
  • Recommended: Resize images to < 2 MP for faster processing
  • Token usage: ~7,000 tokens per image (varies by provider)

Optimization tips:

  • Compress images before sending for large batches
  • Use appropriate resolution (1920x1080 often sufficient)
  • Pre-process images to reduce unnecessary detail

PDF Document Input

Quick Start

CLI:

# Auto-detect PDF
npx @juspay/neurolink generate "Summarize this report" \
--file ./financial-report.pdf --provider vertex

# Explicit PDF
npx @juspay/neurolink generate "Extract key terms from contract" \
--pdf ./contract.pdf --provider anthropic

# Multiple PDFs
npx @juspay/neurolink generate "Compare these documents" \
--pdf ./version1.pdf \
--pdf ./version2.pdf \
--provider vertex

SDK:

// Auto-detect (recommended)
await neurolink.generate({
input: {
text: "Analyze this document",
files: ["./report.pdf", "./data.csv"], // Mixed file types
},
provider: "vertex",
});

// Explicit PDF
await neurolink.generate({
input: {
text: "Compare Q1 and Q2 reports",
pdfFiles: ["./q1-report.pdf", "./q2-report.pdf"],
},
provider: "anthropic",
});

PDF Processing Modes

Provider-specific approaches:

ProviderModeToken UsageBest For
Vertex AI, Anthropic, BedrockNative PDF~1,000 tokens/3 pagesVisual + text extraction
Google AI StudioNative PDF~1,000 tokens/3 pagesLarge files (up to 2 GB)
OpenAI, AzureFiles API~1,000 tokens/3 pagesText-only mode optimal

Visual vs. Text-only mode:

  • Visual mode: Preserves layout, tables, charts (~7,000 tokens/3 pages)
  • Text-only mode: Extracts text content only (~1,000 tokens/3 pages)

PDF Best Practices

  • Choose the right provider: Vertex AI or Anthropic for best results
  • Check file size: Most providers limit to 5 MB (AI Studio supports 2 GB)
  • Use streaming: For large documents, streaming provides faster initial results
  • Combine with other files: Mix PDFs with CSV data and images
  • Be specific in prompts: "Extract all monetary values" vs. "Tell me about this PDF"
  • Set appropriate token limits: Recommended 2000-8000 tokens for PDF analysis

CSV/Spreadsheet Input

Quick Start

CLI:

# Auto-detect CSV
npx @juspay/neurolink generate "Analyze sales trends" \
--file ./sales_2024.csv

# Explicit CSV with options
npx @juspay/neurolink generate "Summarize data" \
--csv ./data.csv \
--csv-max-rows 500 \
--csv-format raw

SDK:

// Auto-detect (recommended)
await neurolink.generate({
input: {
text: "Analyze this sales data",
files: ["./sales.csv"], // Auto-detected as CSV
},
});

// Explicit CSV with options
await neurolink.generate({
input: {
text: "Compare quarterly data",
csvFiles: ["./q1.csv", "./q2.csv"],
},
csvOptions: {
maxRows: 1000,
formatStyle: "json", // or "raw", "markdown"
},
});

CSV Format Options

Three format styles:

  1. Raw format (default)

    • Best for large files
    • Minimal token usage
    • Preserves original CSV structure
    name,age,city
    Alice,30,NYC
    Bob,25,LA
  2. JSON format

    • Structured data processing
    • Easier for AI to parse
    • Higher token usage
    [
    { "name": "Alice", "age": 30, "city": "NYC" },
    { "name": "Bob", "age": 25, "city": "LA" }
    ]
  3. Markdown format

    • Readable tables
    • Good for small datasets (<100 rows)
    • Moderate token usage
    | name  | age | city |
    | ----- | --- | ---- |
    | Alice | 30 | NYC |
    | Bob | 25 | LA |

CSV Configuration

const result = await neurolink.generate({
input: {
text: "Analyze customer data",
csvFiles: ["./customers.csv"],
},
csvOptions: {
maxRows: 1000, // Limit rows (default: 1000, max: 10000)
formatStyle: "json", // Format: "raw" | "json" | "markdown"
includeHeaders: true, // Include header row (default: true)
},
});

CSV Best Practices

  • Use raw format for large files to minimize token usage
  • Use JSON format for structured processing when AI needs to manipulate data
  • Limit to 1000 rows by default (configurable up to 10,000)
  • Combine CSV with visualization images for comprehensive analysis
  • Works with ALL providers (not just vision-capable models)

Combining Multiple Input Types

NeuroLink excels at combining different media types in a single request.

Mixed Media Example

const result = await neurolink.generate({
input: {
text: "Analyze this product launch: review the presentation, compare sales data, and assess the promotional materials",
pdfFiles: ["./presentation.pdf"], // Slides
csvFiles: ["./sales-data.csv"], // Numbers
images: [
readFileSync("./promo-banner.png"), // Marketing material
"https://example.com/ad-campaign.jpg",
],
},
provider: "vertex", // Supports all input types
});

Streaming with Multimodal

const stream = await neurolink.stream({
input: {
text: "Analyze this floor plan and cost breakdown",
images: ["./floor-plan.jpg"],
csvFiles: ["./costs.csv"],
},
provider: "google-ai",
});

for await (const chunk of stream) {
process.stdout.write(chunk.text ?? "");
}

Configuration & Fine-tuning

Image-Specific Options

const result = await neurolink.generate({
input: {
text: "Analyze these screenshots",
images: [
{
data: readFileSync("./screenshot.png"),
altText: "Product dashboard showing KPIs",
},
],
},
provider: "openai",
maxTokens: 2000, // Increase for detailed image analysis
});

PDF-Specific Options

const result = await neurolink.generate({
input: {
text: "Extract financial data from this report",
pdfFiles: ["./annual-report.pdf"],
},
provider: "vertex",
maxTokens: 8000, // Large token budget for comprehensive extraction
});

Regional Routing

Some providers require regional configuration for optimal performance:

const result = await neurolink.generate({
input: {
text: "Analyze this document",
pdfFiles: ["./contract.pdf"],
},
provider: "vertex",
region: "us-central1", // Vertex AI region
});

Best Practices

General Guidelines

  1. Provide descriptive prompts - Reference specific images/files by name
  2. Use alt text for accessibility - Helps both AI and screen readers
  3. Combine analytics + evaluation - Benchmark multimodal quality before production
  4. Cache remote assets locally - Avoid repeated downloads for frequently used files
  5. Stream for user-facing apps - Use generate() for structured JSON output

Image Best Practices

  • Provide short captions describing each image in the prompt
  • Pre-compress large images to reduce processing time
  • Use appropriate image formats (JPEG for photos, PNG for diagrams)
  • Consider token limits when sending multiple images

PDF Best Practices

  • Choose providers with native PDF support (Vertex, Anthropic, Bedrock)
  • Be specific about what you need extracted
  • Use streaming for large documents
  • Set appropriate maxTokens (2000-8000 recommended)

CSV Best Practices

  • Use raw format for large datasets
  • Use JSON format when AI needs structured data manipulation
  • Limit rows to avoid token exhaustion
  • Combine with images for visual + numerical analysis

Troubleshooting

Common Issues

IssueSolution
"Image not found"Check file paths are relative to CWD where CLI is invoked
"Provider does not support images"Switch to vision-capable provider (see matrix above)
"Error downloading image"Ensure URL returns HTTP 200 and doesn't require authentication
"Large response latency"Pre-compress images and reduce resolution to < 2 MP
"Streaming ends early"Disable tools (--disableTools) to avoid tool call interruptions
"PDF too large"Use Google AI Studio (2 GB limit) or split into smaller chunks
"CSV token overflow"Reduce maxRows or use raw format instead of JSON/markdown

Provider-Specific Issues

OpenAI/Azure:

  • Images must be < 20 MB
  • PDFs processed via Files API (may take longer)

Google AI Studio/Vertex:

  • Best for large PDFs (AI Studio supports up to 2 GB)
  • Gemini models have excellent visual reasoning

Anthropic/Bedrock:

  • Claude excels at document understanding
  • Strong visual and text analysis capabilities

Ollama:

  • Use vision-capable models like llava, bakllava
  • Local processing - no cloud API required

Document Processing:

Q4 2025 Features:

Advanced Features:

Documentation:


Examples & Recipes

Example 1: Product Analysis

Analyze a product page with screenshot, description, and pricing data:

const analysis = await neurolink.generate({
input: {
text: "Analyze this product: review the screenshot, pricing data, and provide recommendations",
images: [readFileSync("./product-screenshot.png")],
csvFiles: ["./pricing-tiers.csv"],
},
provider: "google-ai",
maxTokens: 3000,
});

Example 2: Document Comparison

Compare two versions of a contract:

const comparison = await neurolink.generate({
input: {
text: "Compare these two contract versions and highlight key differences",
pdfFiles: ["./contract-v1.pdf", "./contract-v2.pdf"],
},
provider: "anthropic",
maxTokens: 5000,
});

Example 3: Data Visualization Analysis

Analyze charts and underlying data together:

const dataAnalysis = await neurolink.generate({
input: {
text: "Analyze these sales charts and verify against the raw data",
images: [
"https://example.com/q1-chart.png",
"https://example.com/q2-chart.png",
],
csvFiles: ["./sales-data.csv"],
},
provider: "vertex",
enableAnalytics: true,
enableEvaluation: true,
});

Summary

NeuroLink's multimodal capabilities provide:

Universal input support - Images, PDFs, CSV files ✅ Provider flexibility - Extensive provider compatibility matrix ✅ Automatic format detection - Smart file type recognition ✅ Accessibility features - Alt text support for images ✅ Production-ready - Battle-tested at enterprise scale ✅ Developer-friendly - Works seamlessly across CLI and SDK

Next Steps:

  1. Review the provider support matrix to select the right provider
  2. Try the quick start examples with your use case
  3. Explore advanced recipes for complex scenarios
  4. Check troubleshooting if you encounter issues