Skip to main content

File Processors Guide

NeuroLink includes a comprehensive file processing system that supports 20+ file types with intelligent content extraction, security sanitization, and provider-agnostic formatting. This system enables seamless multimodal AI interactions across all 13 supported providers.

Overview

The file processor system is organized into a modular architecture:

src/lib/processors/
├── base/ # BaseFileProcessor abstract class and types
├── registry/ # ProcessorRegistry singleton for processor selection
├── config/ # MIME types, extensions, language maps, size limits
├── errors/ # FileErrorCode enum and error helpers
├── document/ # Excel, Word, RTF, OpenDocument processors
├── media/ # Video and Audio processors (metadata extraction)
├── archive/ # ZIP, TAR, GZ archive processors (file listing + content extraction)
├── markup/ # SVG, HTML, Markdown, Text processors
├── code/ # SourceCode, Config processors
├── data/ # JSON, YAML, XML processors
├── integration/ # FileProcessorIntegration for registry usage
└── cli/ # CLI helpers for file processing

Quick Start

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink();

const result = await neurolink.generate({
input: {
text: "Summarize this document",
files: ["./report.pdf"],
},
});

console.log(result.content);

Supported File Types

Documents

TypeExtensionsProcessorFeatures
Excel.xlsx, .xlsExcelProcessorMulti-sheet extraction, cell formatting, data tables
Word.docx, .docWordProcessorText extraction, paragraph preservation
RTF.rtfRtfProcessorRich text to plain text conversion
OpenDocument.odt, .ods, .odpOpenDocumentProcessorLibreOffice/OpenOffice format support

Data Files

TypeExtensionsProcessorFeatures
JSON.jsonJsonProcessorValidation, pretty-printing, syntax highlighting
YAML.yaml, .ymlYamlProcessorValidation, formatting, multi-document support
XML.xmlXmlProcessorParsing, validation, entity handling

Markup Files

TypeExtensionsProcessorFeatures
HTML.html, .htmHtmlProcessorOWASP-compliant sanitization, text extraction
SVG.svgSvgProcessorXSS prevention, text injection (not binary)
Markdown.md, .markdownMarkdownProcessorFormatting preservation, metadata extraction
Text.txtTextProcessorPlain text handling, encoding detection

Source Code

TypeExtensionsProcessorFeatures
TypeScript.ts, .tsxSourceCodeProcessorLanguage detection, syntax metadata
JavaScript.js, .jsx, .mjsSourceCodeProcessorModule detection
Python.pySourceCodeProcessorDocstring preservation
Java.javaSourceCodeProcessorPackage detection
Go.goSourceCodeProcessorModule awareness
Rust.rsSourceCodeProcessorCrate detection
C/C++.c, .cpp, .h, .hppSourceCodeProcessorHeader handling
C#.csSourceCodeProcessorNamespace detection
Ruby.rbSourceCodeProcessorGem awareness
PHP.phpSourceCodeProcessorTag handling
Swift.swiftSourceCodeProcessorFramework detection
Kotlin.kt, .ktsSourceCodeProcessorAndroid/JVM awareness
Scala.scalaSourceCodeProcessorSBT integration
Shell.sh, .bash, .zshSourceCodeProcessorShebang detection
SQL.sqlSourceCodeProcessorDialect hints
And 35+ more...VariousSourceCodeProcessorAutomatic language detection

Configuration Files

TypeExtensionsProcessorFeatures
Environment.env, .env.*ConfigProcessorSecret masking option
INI.ini, .cfgConfigProcessorSection parsing
TOML.tomlConfigProcessorCargo.toml, pyproject.toml support
Properties.propertiesConfigProcessorJava properties format

Media Files

TypeExtensionsProcessorFeatures
Video.mp4, .mkv, .webm, .avi, .mov, .m4vVideoProcessorDuration, resolution, codec, frame rate, bitrate extraction via music-metadata
Audio.mp3, .wav, .ogg, .flac, .aac, .m4a, .wmaAudioProcessorCodec, bitrate, sample rate, channels, duration extraction via music-metadata

Video and audio files are not sent as binary to the AI provider. Instead, the processors extract structured metadata and return it as formatted text, keeping token usage minimal (~50-200 tokens per file).

Example video output:

Video File: presentation.mp4
Duration: 13s | Resolution: 640x360 | Video Codec: h264
Frame Rate: 29.97 fps | Bitrate: 345 kbps
Audio: aac, 48000 Hz, 2 channels

Example audio output:

Audio File: recording.mp3
Codec: MPEG 1 Layer 3 | Bitrate: 128 kbps
Sample Rate: 44100 Hz | Channels: 2 (Stereo) | Duration: 1:46

Archives

TypeExtensionsProcessorFeatures
ZIP.zipArchiveProcessorFile listing with sizes, nested content extraction, ZIP bomb detection
TAR.tarArchiveProcessorFile listing with sizes
GZ.gz, .tar.gz, .tgzArchiveProcessorGzip decompression, tar content listing

Archive files return a structured listing of their contents with file sizes and optionally extract text from contained files (routing through existing processors).

Example archive output:

Archive: project.tar.gz
Total entries: 6

Files:
- code/sample.json (60 B)
- code/sample.py (195 B)
- document/sample.txt (607 B)

Security: Archive processing includes ZIP bomb detection (compression ratio limits), path traversal prevention, symlink blocking, entry count limits, and aggregate decompression size limits.

Usage

SDK Usage

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink();

// Process multiple file types in a single request
const result = await neurolink.generate({
input: {
text: "Analyze these files and summarize the key information",
files: [
"./data/report.xlsx", // Excel spreadsheet
"./config/settings.yaml", // YAML configuration
"./src/main.ts", // TypeScript source
"./docs/architecture.svg", // SVG diagram (injected as text)
"./api/schema.json", // JSON schema
],
},
provider: "vertex",
});

console.log(result.content);

CLI Usage

# Single file
neurolink generate "Analyze this spreadsheet" --file ./data.xlsx

# Multiple files
neurolink generate "Compare these configs" \
--file ./config.yaml \
--file ./settings.json \
--file ./app.toml

# Mixed with images and PDFs
neurolink generate "Explain this codebase" \
--file ./src/main.ts \
--file ./docs/diagram.svg \
--pdf ./docs/spec.pdf \
--image ./screenshot.png

Stream Mode

// Streaming with file processing
const result = await neurolink.stream({
input: {
text: "Walk me through this code step by step",
files: ["./src/algorithm.py"],
},
});

for await (const chunk of result.stream) {
if ("content" in chunk) {
process.stdout.write(chunk.content);
}
}

Architecture

ProcessorRegistry

The ProcessorRegistry is a singleton that manages all file processors with priority-based selection:

import { ProcessorRegistry } from "@juspay/neurolink";

// Get the singleton instance
const registry = ProcessorRegistry.getInstance();

// Register a custom processor (lower priority = higher precedence)
registry.register(new MyCustomProcessor(), 50);

// Find processor for a file
const processor = registry.findProcessor({
filename: "data.xlsx",
mimeType: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
size: 1024,
});

// Process a file
const result = await processor.process(fileInfo, fileContent);

BaseFileProcessor

All processors extend the abstract BaseFileProcessor class:

import { BaseFileProcessor, FileInfo, ProcessedFile } from "@juspay/neurolink";

export class MyProcessor extends BaseFileProcessor {
readonly name = "my-processor";
readonly supportedMimeTypes = ["application/x-my-format"];
readonly supportedExtensions = [".myf"];

canProcess(file: FileInfo): boolean {
return this.supportedExtensions.includes(file.extension);
}

async process(file: FileInfo, content: Buffer): Promise<ProcessedFile> {
const text = this.extractText(content);
return {
type: "text",
content: text,
metadata: {
processor: this.name,
originalFilename: file.filename,
},
};
}

getInfo(): ProcessorInfo {
return {
name: this.name,
description: "Processes MY format files",
supportedMimeTypes: this.supportedMimeTypes,
supportedExtensions: this.supportedExtensions,
};
}
}

FileDetector

The FileDetector utility automatically identifies file types:

import { FileDetector } from "@juspay/neurolink";

const detector = new FileDetector();

// Detect by extension
const type1 = detector.detect("report.xlsx");
// Returns: { type: "xlsx", mimeType: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" }

// Detect by content (magic bytes)
const type2 = detector.detectFromContent(buffer);

// SVG special handling - returns "svg" type, not "image"
const type3 = detector.detect("diagram.svg");
// Returns: { type: "svg", mimeType: "image/svg+xml" }

Security Features

OWASP-Compliant Sanitization

The markup processors include security sanitization to prevent XSS and injection attacks:

HTML Sanitization

// HtmlProcessor automatically sanitizes HTML content
// - Removes <script> tags
// - Strips event handlers (onclick, onerror, etc.)
// - Removes javascript: URLs
// - Sanitizes style attributes
// - Blocks dangerous protocols

const result = await neurolink.generate({
input: {
text: "Summarize this HTML content",
files: ["./untrusted-content.html"], // Automatically sanitized
},
});

SVG Sanitization

// SvgProcessor sanitizes SVG before injection
// - Removes embedded scripts
// - Strips foreignObject elements
// - Sanitizes use/href attributes
// - Blocks external entity references

// SVG is injected as TEXT, not as binary image
// This prevents image-based attacks while preserving vector content

File Size Limits

Default size limits prevent denial-of-service attacks:

CategoryDefault LimitConfigurable
Documents50 MBYes
Data files10 MBYes
Code files5 MBYes
Config files1 MBYes
Images20 MBYes
import { ProcessorConfig } from "@juspay/neurolink";

// Configure size limits
ProcessorConfig.setLimits({
maxDocumentSize: 100 * 1024 * 1024, // 100 MB
maxCodeSize: 10 * 1024 * 1024, // 10 MB
});

Error Handling

FileErrorCode Enum

import { FileErrorCode } from "@juspay/neurolink";

try {
const result = await neurolink.generate({
input: { files: ["./corrupted.xlsx"] },
});
} catch (error) {
if (error && typeof error === "object" && "code" in error) {
switch (error.code) {
case FileErrorCode.UNSUPPORTED_TYPE:
console.log("File type not supported");
break;
case FileErrorCode.FILE_TOO_LARGE:
console.log("File too large");
break;
case FileErrorCode.CORRUPTED_FILE:
console.log("File is corrupted");
break;
case FileErrorCode.DOWNLOAD_AUTH_FAILED:
console.log("Cannot read file");
break;
}
}
}

Provider Compatibility

All file processors work across all 13 AI providers. The processed content is formatted as text that any provider can understand:

ProviderDocumentsDataMarkupCodeConfig
OpenAI
Anthropic
Google AI Studio
Google Vertex
AWS Bedrock
Azure OpenAI
Mistral
LiteLLM
Ollama
Hugging Face
SageMaker
OpenAI Compatible
OpenRouter

Note: For binary files like images and PDFs, provider-specific adapters handle the formatting. See PDF Support and Multimodal Chat.

Best Practices

1. Use Appropriate File Types

// Good: Use structured data formats for data
files: ["./data.json", "./config.yaml"];

// Avoid: Using unstructured text for structured data
files: ["./data.txt"]; // Harder for AI to parse
// Good: Group related files together
const result = await neurolink.generate({
input: {
text: "Review this module for best practices",
files: [
"./src/module.ts", // Implementation
"./src/module.test.ts", // Tests
"./src/module.types.ts", // Types
],
},
});

3. Be Mindful of Token Limits

// For large files, consider chunking or summarization
import { ProcessorConfig } from "@juspay/neurolink";

// Enable automatic truncation for very large files
ProcessorConfig.setTruncation({
enabled: true,
maxTokens: 50000,
strategy: "head-tail", // Keep beginning and end
});

4. Use Specific Prompts

// Good: Be specific about what to analyze
const result = await neurolink.generate({
input: {
text: "Find security vulnerabilities in this code, focusing on SQL injection and XSS",
files: ["./src/api.ts"],
},
});

// Less effective: Vague prompt
const result = await neurolink.generate({
input: {
text: "Look at this",
files: ["./src/api.ts"],
},
});

Extending the System

Creating a Custom Processor

import {
BaseFileProcessor,
FileInfo,
ProcessedFile,
ProcessorRegistry,
} from "@juspay/neurolink";

class ProtobufProcessor extends BaseFileProcessor {
readonly name = "protobuf-processor";
readonly supportedMimeTypes = ["application/x-protobuf"];
readonly supportedExtensions = [".proto"];

canProcess(file: FileInfo): boolean {
return file.extension === ".proto";
}

async process(file: FileInfo, content: Buffer): Promise<ProcessedFile> {
const protoText = content.toString("utf-8");

// Add syntax highlighting hints
const formatted = `\`\`\`protobuf\n${protoText}\n\`\`\``;

return {
type: "text",
content: formatted,
metadata: {
processor: this.name,
language: "protobuf",
filename: file.filename,
},
};
}

getInfo() {
return {
name: this.name,
description: "Processes Protocol Buffer definition files",
supportedMimeTypes: this.supportedMimeTypes,
supportedExtensions: this.supportedExtensions,
};
}
}

// Register with priority 50 (lower = higher precedence)
ProcessorRegistry.getInstance().register(new ProtobufProcessor(), 50);