File Processors Guide
NeuroLink includes a comprehensive file processing system that supports 20+ file types with intelligent content extraction, security sanitization, and provider-agnostic formatting. This system enables seamless multimodal AI interactions across all 13 supported providers.
Overview
The file processor system is organized into a modular architecture:
src/lib/processors/
├── base/ # BaseFileProcessor abstract class and types
├── registry/ # ProcessorRegistry singleton for processor selection
├── config/ # MIME types, extensions, language maps, size limits
├── errors/ # FileErrorCode enum and error helpers
├── document/ # Excel, Word, RTF, OpenDocument processors
├── media/ # Video and Audio processors (metadata extraction)
├── archive/ # ZIP, TAR, GZ archive processors (file listing + content extraction)
├── markup/ # SVG, HTML, Markdown, Text processors
├── code/ # SourceCode, Config processors
├── data/ # JSON, YAML, XML processors
├── integration/ # FileProcessorIntegration for registry usage
└── cli/ # CLI helpers for file processing
Quick Start
import { NeuroLink } from "@juspay/neurolink";
const neurolink = new NeuroLink();
const result = await neurolink.generate({
input: {
text: "Summarize this document",
files: ["./report.pdf"],
},
});
console.log(result.content);
Supported File Types
Documents
| Type | Extensions | Processor | Features |
|---|---|---|---|
| Excel | .xlsx, .xls | ExcelProcessor | Multi-sheet extraction, cell formatting, data tables |
| Word | .docx, .doc | WordProcessor | Text extraction, paragraph preservation |
| RTF | .rtf | RtfProcessor | Rich text to plain text conversion |
| OpenDocument | .odt, .ods, .odp | OpenDocumentProcessor | LibreOffice/OpenOffice format support |
Data Files
| Type | Extensions | Processor | Features |
|---|---|---|---|
| JSON | .json | JsonProcessor | Validation, pretty-printing, syntax highlighting |
| YAML | .yaml, .yml | YamlProcessor | Validation, formatting, multi-document support |
| XML | .xml | XmlProcessor | Parsing, validation, entity handling |
Markup Files
| Type | Extensions | Processor | Features |
|---|---|---|---|
| HTML | .html, .htm | HtmlProcessor | OWASP-compliant sanitization, text extraction |
| SVG | .svg | SvgProcessor | XSS prevention, text injection (not binary) |
| Markdown | .md, .markdown | MarkdownProcessor | Formatting preservation, metadata extraction |
| Text | .txt | TextProcessor | Plain text handling, encoding detection |
Source Code
| Type | Extensions | Processor | Features |
|---|---|---|---|
| TypeScript | .ts, .tsx | SourceCodeProcessor | Language detection, syntax metadata |
| JavaScript | .js, .jsx, .mjs | SourceCodeProcessor | Module detection |
| Python | .py | SourceCodeProcessor | Docstring preservation |
| Java | .java | SourceCodeProcessor | Package detection |
| Go | .go | SourceCodeProcessor | Module awareness |
| Rust | .rs | SourceCodeProcessor | Crate detection |
| C/C++ | .c, .cpp, .h, .hpp | SourceCodeProcessor | Header handling |
| C# | .cs | SourceCodeProcessor | Namespace detection |
| Ruby | .rb | SourceCodeProcessor | Gem awareness |
| PHP | .php | SourceCodeProcessor | Tag handling |
| Swift | .swift | SourceCodeProcessor | Framework detection |
| Kotlin | .kt, .kts | SourceCodeProcessor | Android/JVM awareness |
| Scala | .scala | SourceCodeProcessor | SBT integration |
| Shell | .sh, .bash, .zsh | SourceCodeProcessor | Shebang detection |
| SQL | .sql | SourceCodeProcessor | Dialect hints |
| And 35+ more... | Various | SourceCodeProcessor | Automatic language detection |
Configuration Files
| Type | Extensions | Processor | Features |
|---|---|---|---|
| Environment | .env, .env.* | ConfigProcessor | Secret masking option |
| INI | .ini, .cfg | ConfigProcessor | Section parsing |
| TOML | .toml | ConfigProcessor | Cargo.toml, pyproject.toml support |
| Properties | .properties | ConfigProcessor | Java properties format |
Media Files
| Type | Extensions | Processor | Features |
|---|---|---|---|
| Video | .mp4, .mkv, .webm, .avi, .mov, .m4v | VideoProcessor | Duration, resolution, codec, frame rate, bitrate extraction via music-metadata |
| Audio | .mp3, .wav, .ogg, .flac, .aac, .m4a, .wma | AudioProcessor | Codec, bitrate, sample rate, channels, duration extraction via music-metadata |
Video and audio files are not sent as binary to the AI provider. Instead, the processors extract structured metadata and return it as formatted text, keeping token usage minimal (~50-200 tokens per file).
Example video output:
Video File: presentation.mp4
Duration: 13s | Resolution: 640x360 | Video Codec: h264
Frame Rate: 29.97 fps | Bitrate: 345 kbps
Audio: aac, 48000 Hz, 2 channels
Example audio output:
Audio File: recording.mp3
Codec: MPEG 1 Layer 3 | Bitrate: 128 kbps
Sample Rate: 44100 Hz | Channels: 2 (Stereo) | Duration: 1:46
Archives
| Type | Extensions | Processor | Features |
|---|---|---|---|
| ZIP | .zip | ArchiveProcessor | File listing with sizes, nested content extraction, ZIP bomb detection |
| TAR | .tar | ArchiveProcessor | File listing with sizes |
| GZ | .gz, .tar.gz, .tgz | ArchiveProcessor | Gzip decompression, tar content listing |
Archive files return a structured listing of their contents with file sizes and optionally extract text from contained files (routing through existing processors).
Example archive output:
Archive: project.tar.gz
Total entries: 6
Files:
- code/sample.json (60 B)
- code/sample.py (195 B)
- document/sample.txt (607 B)
Security: Archive processing includes ZIP bomb detection (compression ratio limits), path traversal prevention, symlink blocking, entry count limits, and aggregate decompression size limits.
Usage
SDK Usage
import { NeuroLink } from "@juspay/neurolink";
const neurolink = new NeuroLink();
// Process multiple file types in a single request
const result = await neurolink.generate({
input: {
text: "Analyze these files and summarize the key information",
files: [
"./data/report.xlsx", // Excel spreadsheet
"./config/settings.yaml", // YAML configuration
"./src/main.ts", // TypeScript source
"./docs/architecture.svg", // SVG diagram (injected as text)
"./api/schema.json", // JSON schema
],
},
provider: "vertex",
});
console.log(result.content);
CLI Usage
# Single file
neurolink generate "Analyze this spreadsheet" --file ./data.xlsx
# Multiple files
neurolink generate "Compare these configs" \
--file ./config.yaml \
--file ./settings.json \
--file ./app.toml
# Mixed with images and PDFs
neurolink generate "Explain this codebase" \
--file ./src/main.ts \
--file ./docs/diagram.svg \
--pdf ./docs/spec.pdf \
--image ./screenshot.png
Stream Mode
// Streaming with file processing
const result = await neurolink.stream({
input: {
text: "Walk me through this code step by step",
files: ["./src/algorithm.py"],
},
});
for await (const chunk of result.stream) {
if ("content" in chunk) {
process.stdout.write(chunk.content);
}
}
Architecture
ProcessorRegistry
The ProcessorRegistry is a singleton that manages all file processors with priority-based selection:
import { ProcessorRegistry } from "@juspay/neurolink";
// Get the singleton instance
const registry = ProcessorRegistry.getInstance();
// Register a custom processor (lower priority = higher precedence)
registry.register(new MyCustomProcessor(), 50);
// Find processor for a file
const processor = registry.findProcessor({
filename: "data.xlsx",
mimeType: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
size: 1024,
});
// Process a file
const result = await processor.process(fileInfo, fileContent);
BaseFileProcessor
All processors extend the abstract BaseFileProcessor class:
import { BaseFileProcessor, FileInfo, ProcessedFile } from "@juspay/neurolink";
export class MyProcessor extends BaseFileProcessor {
readonly name = "my-processor";
readonly supportedMimeTypes = ["application/x-my-format"];
readonly supportedExtensions = [".myf"];
canProcess(file: FileInfo): boolean {
return this.supportedExtensions.includes(file.extension);
}
async process(file: FileInfo, content: Buffer): Promise<ProcessedFile> {
const text = this.extractText(content);
return {
type: "text",
content: text,
metadata: {
processor: this.name,
originalFilename: file.filename,
},
};
}
getInfo(): ProcessorInfo {
return {
name: this.name,
description: "Processes MY format files",
supportedMimeTypes: this.supportedMimeTypes,
supportedExtensions: this.supportedExtensions,
};
}
}
FileDetector
The FileDetector utility automatically identifies file types:
import { FileDetector } from "@juspay/neurolink";
const detector = new FileDetector();
// Detect by extension
const type1 = detector.detect("report.xlsx");
// Returns: { type: "xlsx", mimeType: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" }
// Detect by content (magic bytes)
const type2 = detector.detectFromContent(buffer);
// SVG special handling - returns "svg" type, not "image"
const type3 = detector.detect("diagram.svg");
// Returns: { type: "svg", mimeType: "image/svg+xml" }
Security Features
OWASP-Compliant Sanitization
The markup processors include security sanitization to prevent XSS and injection attacks:
HTML Sanitization
// HtmlProcessor automatically sanitizes HTML content
// - Removes <script> tags
// - Strips event handlers (onclick, onerror, etc.)
// - Removes javascript: URLs
// - Sanitizes style attributes
// - Blocks dangerous protocols
const result = await neurolink.generate({
input: {
text: "Summarize this HTML content",
files: ["./untrusted-content.html"], // Automatically sanitized
},
});
SVG Sanitization
// SvgProcessor sanitizes SVG before injection
// - Removes embedded scripts
// - Strips foreignObject elements
// - Sanitizes use/href attributes
// - Blocks external entity references
// SVG is injected as TEXT, not as binary image
// This prevents image-based attacks while preserving vector content
File Size Limits
Default size limits prevent denial-of-service attacks:
| Category | Default Limit | Configurable |
|---|---|---|
| Documents | 50 MB | Yes |
| Data files | 10 MB | Yes |
| Code files | 5 MB | Yes |
| Config files | 1 MB | Yes |
| Images | 20 MB | Yes |
import { ProcessorConfig } from "@juspay/neurolink";
// Configure size limits
ProcessorConfig.setLimits({
maxDocumentSize: 100 * 1024 * 1024, // 100 MB
maxCodeSize: 10 * 1024 * 1024, // 10 MB
});
Error Handling
FileErrorCode Enum
import { FileErrorCode } from "@juspay/neurolink";
try {
const result = await neurolink.generate({
input: { files: ["./corrupted.xlsx"] },
});
} catch (error) {
if (error && typeof error === "object" && "code" in error) {
switch (error.code) {
case FileErrorCode.UNSUPPORTED_TYPE:
console.log("File type not supported");
break;
case FileErrorCode.FILE_TOO_LARGE:
console.log("File too large");
break;
case FileErrorCode.CORRUPTED_FILE:
console.log("File is corrupted");
break;
case FileErrorCode.DOWNLOAD_AUTH_FAILED:
console.log("Cannot read file");
break;
}
}
}
Provider Compatibility
All file processors work across all 13 AI providers. The processed content is formatted as text that any provider can understand:
| Provider | Documents | Data | Markup | Code | Config |
|---|---|---|---|---|---|
| OpenAI | ✅ | ✅ | ✅ | ✅ | ✅ |
| Anthropic | ✅ | ✅ | ✅ | ✅ | ✅ |
| Google AI Studio | ✅ | ✅ | ✅ | ✅ | ✅ |
| Google Vertex | ✅ | ✅ | ✅ | ✅ | ✅ |
| AWS Bedrock | ✅ | ✅ | ✅ | ✅ | ✅ |
| Azure OpenAI | ✅ | ✅ | ✅ | ✅ | ✅ |
| Mistral | ✅ | ✅ | ✅ | ✅ | ✅ |
| LiteLLM | ✅ | ✅ | ✅ | ✅ | ✅ |
| Ollama | ✅ | ✅ | ✅ | ✅ | ✅ |
| Hugging Face | ✅ | ✅ | ✅ | ✅ | ✅ |
| SageMaker | ✅ | ✅ | ✅ | ✅ | ✅ |
| OpenAI Compatible | ✅ | ✅ | ✅ | ✅ | ✅ |
| OpenRouter | ✅ | ✅ | ✅ | ✅ | ✅ |
Note: For binary files like images and PDFs, provider-specific adapters handle the formatting. See PDF Support and Multimodal Chat.
Best Practices
1. Use Appropriate File Types
// Good: Use structured data formats for data
files: ["./data.json", "./config.yaml"];
// Avoid: Using unstructured text for structured data
files: ["./data.txt"]; // Harder for AI to parse
2. Combine Related Files
// Good: Group related files together
const result = await neurolink.generate({
input: {
text: "Review this module for best practices",
files: [
"./src/module.ts", // Implementation
"./src/module.test.ts", // Tests
"./src/module.types.ts", // Types
],
},
});
3. Be Mindful of Token Limits
// For large files, consider chunking or summarization
import { ProcessorConfig } from "@juspay/neurolink";
// Enable automatic truncation for very large files
ProcessorConfig.setTruncation({
enabled: true,
maxTokens: 50000,
strategy: "head-tail", // Keep beginning and end
});
4. Use Specific Prompts
// Good: Be specific about what to analyze
const result = await neurolink.generate({
input: {
text: "Find security vulnerabilities in this code, focusing on SQL injection and XSS",
files: ["./src/api.ts"],
},
});
// Less effective: Vague prompt
const result = await neurolink.generate({
input: {
text: "Look at this",
files: ["./src/api.ts"],
},
});
Extending the System
Creating a Custom Processor
import {
BaseFileProcessor,
FileInfo,
ProcessedFile,
ProcessorRegistry,
} from "@juspay/neurolink";
class ProtobufProcessor extends BaseFileProcessor {
readonly name = "protobuf-processor";
readonly supportedMimeTypes = ["application/x-protobuf"];
readonly supportedExtensions = [".proto"];
canProcess(file: FileInfo): boolean {
return file.extension === ".proto";
}
async process(file: FileInfo, content: Buffer): Promise<ProcessedFile> {
const protoText = content.toString("utf-8");
// Add syntax highlighting hints
const formatted = `\`\`\`protobuf\n${protoText}\n\`\`\``;
return {
type: "text",
content: formatted,
metadata: {
processor: this.name,
language: "protobuf",
filename: file.filename,
},
};
}
getInfo() {
return {
name: this.name,
description: "Processes Protocol Buffer definition files",
supportedMimeTypes: this.supportedMimeTypes,
supportedExtensions: this.supportedExtensions,
};
}
}
// Register with priority 50 (lower = higher precedence)
ProcessorRegistry.getInstance().register(new ProtobufProcessor(), 50);
Related Documentation
- Multimodal Chat - Image and media handling
- PDF Support - PDF-specific features
- CSV Support - CSV processing details
- CLI Commands - CLI file options
- SDK API Reference - Full API documentation