Neurolink Workflow Engine - High-Level Design (HLD)
Version: 1.0
Date: November 28, 2025
Status: Implementation Complete
Author: Neurolink Team
📋 Executive Summary
The Neurolink Workflow Engine is a new subsystem that enables advanced AI orchestration patterns through multi-model ensembles and judge-based scoring. It extends Neurolink's existing provider abstraction to support complex workflows where multiple AI models collaborate with evaluation for higher-quality outputs.
Current Phase: Testing & Evaluation - workflows return original responses with scores for AB testing.
Key Value Propositions
- 🎯 Improved Accuracy: Leverage multiple models to cross-validate responses
- ⚖️ Objective Evaluation: Use judge models to score and select best responses (0-100 scale)
- 📊 Comprehensive Logging: Detailed metrics for AB testing and workflow evaluation
- 🔧 Declarative Configuration: Define workflows as composable configs
- 💰 Cost Transparency: Track ensemble performance and costs
🎯 Goals & Non-Goals
Goals (Testing Phase)
- Enable Multi-Model Workflows: Run N models in parallel for the same prompt
- Intelligent Evaluation: Use judge models to score (0-100) and rank responses
- Comprehensive Logging: Detailed metrics for AB testing and evaluation
- Original Output: Return best response unchanged for production safety
- Cost Transparency: Provide clear cost/performance metrics
- Seamless Integration: Work with existing Neurolink provider layer
Non-Goals (Phase 1 - Testing)
- ❌ Response conditioning/modification (deferred until testing validates workflows)
- ❌ Streaming workflow execution (deferred to Phase 2)
- ❌ Stateful/resumable workflows (deferred to Phase 2)
- ❌ DAG-based workflow chaining (deferred to Phase 3)
- ❌ Human-in-the-loop approval steps (deferred to Phase 3)
- ❌ Workflow versioning/migration (deferred to Phase 3)
🏗️ Architecture Overview
System Context
┌─────────────────────────────────────────────────────────────┐
│ Neurolink SDK │
│ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ NeuroLink │────────▶│ Workflow Engine │◀─────┐ │
│ │ Class │ └─────────────────┘ │ │
│ └──────────────┘ │ │ │
│ │ ▼ │ │
│ │ ┌──────────────┐ │ │
│ │ │ Workflow │ │ │
│ │ │ Registry │ │ │
│ │ └──────────────┘ │ │
│ │ │ │ │
│ ▼ ▼ │ │
│ ┌──────────────┐ ┌──────────────┐ │ │
│ │ AI Provider │ │ Ensemble │────────┘ │
│ │ Factory │◀────────│ Executor │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ Judge │ │
│ │ │ Scorer │ │
│ │ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ BaseProvider Layer │ │
│ │ (OpenAI, Anthropic, Google, etc.) │ │
│ └──────────────────────────────────────┘ │
└──────────────────────────────────────────────────────── ─────┘
Component Hierarchy
src/lib/types/
└── workflowTypes.ts # All workflow type definitions (centralized)
src/lib/workflow/
├── index.ts # Public API exports
├── types.ts # Re-exports from types/workflowTypes.ts
├── config.ts # Configuration schemas & defaults
│
├── core/
│ ├── workflowRunner.ts # Main orchestrator
│ ├── workflowRegistry.ts # Workflow template registry
│ ├── ensembleExecutor.ts # Multi-model parallel execution
│ ├── judgeScorer.ts # Judge model scoring
│ └── responseConditioner.ts # Response post-processing
│
├── workflows/ # Built-in workflow implementations
│ ├── consensusWorkflow.ts # 3-5 models + judge
│ ├── fallbackWorkflow.ts # Sequential fallback chain
│ ├── multiJudgeWorkflow.ts # Multiple judges with voting
│ └── adaptiveWorkflow.ts # Dynamic model selection
│
└── utils/
├── workflowValidation.ts # Config validation
└── workflowMetrics.ts # Performance tracking
🔄 Workflow Execution Flow
High-Level Process
┌────────────────────────────────────────────────────────────┐
│ 1. USER REQUEST │
│ neuro.generate({ │
│ workflowConfig: { workflowId: 'consensus-3' }, │
│ input: { text: 'Explain quantum computing' } │
│ }) │
└────────────── ──────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────┐
│ 2. WORKFLOW RESOLUTION │
│ - Load workflow config from registry │
│ - Validate configuration │
│ - Apply runtime overrides (if any) │
└────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────┐
│ 3. ENSEMBLE EXECUTION (Parallel) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Model 1 │ │ Model 2 │ │ Model 3 │ │
│ │ GPT-4o │ │ Claude │ │ Gemini │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ └─────────────┴─────────────┘ │
│ ↓ │
│ [Response 1, Response 2, Response 3] │
└────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────┐
│ 4. JUDGE SCORING (Optional) │
│ - Format responses for judge evaluation │
│ - Call judge model with structured schema │
│ - Parse scores: { resp1: 8.5, resp2: 9.2, resp3: 7.8 } │
│ - Rank/select best response │
└────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────┐
│ 5. RESPONSE CONDITIONING (Optional) │
│ - Calculate confidence score │
│ - Adjust tone based on confidence │
│ - Add metadata (models used, scores, timing) │
│ - Format final response │
└────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────┐
│ 6. RETURN WORKFLOW RESULT │
│ { │
│ content: "Quantum computing is...", │
│ confidence: 0.92, │
│ ensembleResponses: [...], │
│ judgeScores: {...}, │
│ totalTime: 3421 │
│ } │
└────────────────────────────────────────────────────────────┘
🧩 Core Components
1. Workflow Runner
Purpose: Main orchestrator that executes workflows end-to-end
Responsibilities:
- Load and validate workflow configurations
- Coordinate ensemble → judge → conditioning pipeline
- Handle errors and partial failures
- Aggregate results with comprehensive metrics
Key Methods:
class WorkflowRunner {
async execute(
config: WorkflowConfig,
input: WorkflowInput,
): Promise<WorkflowResult>;
async executeWithRetry(
config: WorkflowConfig,
input: WorkflowInput,
retries: number,
): Promise<WorkflowResult>;
}
2. Workflow Registry
Purpose: Manage workflow templates (built-in + custom)
Responsibilities:
- Store workflow configurations
- Provide workflow discovery API
- Validate configs before registration
- Support workflow CRUD operations
Key Methods:
class WorkflowRegistry {
register(config: WorkflowConfig): void;
get(id: string): WorkflowConfig | undefined;
list(): WorkflowConfig[];
validate(config: WorkflowConfig): ValidationResult;
}
3. Ensemble Executor
Purpose: Execute multiple models in parallel
Responsibilities:
- Create provider instances for each model
- Execute requests concurrently via
Promise.all() - Collect responses with timing/usage data
- Handle individual model failures gracefully
Key Methods:
class EnsembleExecutor {
async execute(
models: ModelConfig[],
input: string,
): Promise<EnsembleResponse[]>;
async executeWithTimeout(
models: ModelConfig[],
input: string,
timeout: number,
): Promise<EnsembleResponse[]>;
}
Integration Points:
- Uses
AIProviderFactory.createProvider()for model instantiation - Calls
BaseProvider.generate()for each model - Leverages existing analytics from
core/analytics.ts
4. Judge Scorer
Purpose: Evaluate and rank ensemble responses
Responsibilities:
- Format ensemble results for judge evaluation
- Call judge model with structured output schema
- Parse scores/rankings from judge response
- Support multiple scoring strategies (numeric, ranking, best-pick)
Key Methods:
class JudgeScorer {
async score(
responses: EnsembleResponse[],
judgeConfig: JudgeConfig,
): Promise<JudgeScores>;
async scoreMultiJudge(
responses: EnsembleResponse[],
judgeConfigs: JudgeConfig[],
): Promise<MultiJudgeScores>;
}
Scoring Strategies:
- Numeric Scoring: Return 0-10 scores for each response
- Ranking: Order responses from best to worst
- Best Pick: Select single best response with reasoning
- Multi-Judge Voting: Average scores from multiple judges
5. Response Conditioner
Purpose: Post-process responses based on confidence
Responsibilities:
- Calculate overall confidence score
- Adjust tone based on confidence level
- Add structured metadata
- Format final user-facing response
Key Methods:
class ResponseConditioner {
async condition(
response: string,
confidence: number,
config: ConditioningConfig,
): Promise<ConditionedResponse>;
calculateConfidence(scores: JudgeScores, consensus: number): number;
}
Conditioning Rules:
- High confidence (>0.8): Direct, assertive language
- Medium confidence (0.5-0.8): Balanced, qualified language
- Low confidence (<0.5): Tentative, exploratory language
📊 Data Models
WorkflowConfig
type WorkflowConfig = {
id: string; // Unique identifier
name: string; // Human-readable name
description?: string; // Workflow purpose
type: WorkflowType; // 'ensemble' | 'chain' | 'adaptive'
models: ModelConfig[]; // Ensemble models
judge?: JudgeConfig; // Optional judge configuration
conditioning?: ConditioningConfig; // Optional conditioning
execution?: ExecutionConfig; // Execution settings
metadata?: Record<string, unknown>; // Custom metadata
};
ModelConfig
type ModelConfig = {
provider: AIProviderName; // e.g., 'openai', 'anthropic'
model: string; // e.g., 'gpt-4o', 'claude-3-5-sonnet'
weight?: number; // Weight for voting (0-1)
temperature?: number; // Model temperature
maxTokens?: number; // Max response tokens
systemPrompt?: string; // Custom system prompt
timeout?: number; // Per-model timeout (ms)
};
JudgeConfig
type JudgeConfig = {
provider: AIProviderName; // Judge model provider
model: string; // Judge model name
criteria: string[]; // Evaluation criteria
outputFormat: JudgeOutputFormat; // 'scores' | 'ranking' | 'best'
systemPrompt?: string; // Custom judge prompt
blindEvaluation?: boolean; // Hide provider names
};
WorkflowResult
type WorkflowResult = {
content: string; // Final conditioned response
ensembleResponses: EnsembleResponse[]; // All model responses
judgeScores?: JudgeScores; // Judge evaluation
selectedResponse?: string; // Selected best response
confidence: number; // Overall confidence (0-1)
totalTime: number; // Total execution time (ms)
workflow: string; // Workflow ID used
usage?: AggregatedUsage; // Token usage across all models
analytics?: WorkflowAnalytics; // Detailed analytics
metadata?: Record<string, unknown>; // Custom metadata
};
🔌 Integration Points
With Existing Neurolink Infrastructure
1. AIProviderFactory
// Workflow uses existing factory for provider creation
const provider = await AIProviderFactory.createProvider(
modelConfig.provider,
modelConfig.model,
);
2. BaseProvider
// All models use standard generate() method
const result = await provider.generate({
input: { text: prompt },
temperature: modelConfig.temperature,
systemPrompt: modelConfig.systemPrompt,
});
3. Analytics & Evaluation
// Workflow aggregates existing analytics
const analytics = createAnalytics(provider, model, result, time);
const evaluation = await evaluateResponse(query, response);
4. NeuroLink Class Extension
// Workflow execution via generate() method
export class NeuroLink {
async generate(
options: GenerateOptions & { workflowConfig?: WorkflowGenerateOptions },
): Promise<GenerateResult | WorkflowResult> {
if (options.workflowConfig) {
const workflow = workflowRegistry.get(options.workflowConfig.workflowId);
return await workflowRunner.execute(workflow, options);
}
// ... existing generate logic
}
}
// Standalone registry functions
import {
registerWorkflow,
listWorkflows,
getWorkflow,
} from "@juspay/neurolink/workflow";
registerWorkflow(config);
const workflows = listWorkflows();
const workflow = getWorkflow("consensus-3");
🎨 Built-in Workflows
1. Consensus Workflow (consensus-3)
Purpose: Cross-validate responses across 3 models with judge scoring
{
id: 'consensus-3',
name: 'Three Model Consensus',
type: 'ensemble',
models: [
{ provider: 'openai', model: 'gpt-4o' },
{ provider: 'anthropic', model: 'claude-3-5-sonnet' },
{ provider: 'google-ai', model: 'gemini-2.5-flash' }
],
judge: {
provider: 'openai',
model: 'gpt-4o',
criteria: ['accuracy', 'clarity', 'completeness'],
outputFormat: 'best'
},
conditioning: {
useConfidence: true,
toneAdjustment: 'neutral'
}
}
Use Cases: High-stakes decisions, factual queries, technical explanations
2. Fast Fallback Workflow (fast-fallback)
Purpose: Try fast model first, fallback to powerful model if needed
{
id: 'fast-fallback',
name: 'Fast with Quality Fallback',
type: 'chain',
models: [
{ provider: 'google-ai', model: 'gemini-2.5-flash', timeout: 5000 },
{ provider: 'anthropic', model: 'claude-3-5-sonnet', timeout: 10000 }
],
conditioning: {
useConfidence: true,
metadata: { strategy: 'fast-first' }
}
}
Use Cases: Cost optimization, performance-sensitive applications
3. Quality Max Workflow (quality-max)
Purpose: Maximum quality with dual powerful models
{
id: 'quality-max',
name: 'Maximum Quality Ensemble',
type: 'ensemble',
models: [
{ provider: 'openai', model: 'gpt-4o', temperature: 0.3 },
{ provider: 'anthropic', model: 'claude-3-5-sonnet', temperature: 0.3 }
],
judge: {
provider: 'anthropic',
model: 'claude-3-5-sonnet',
criteria: ['depth', 'reasoning', 'accuracy', 'safety'],
outputFormat: 'scores'
},
conditioning: {
useConfidence: true,
toneAdjustment: 'strengthen'
}
}
Use Cases: Research, analysis, critical business decisions
4. Multi-Judge Workflow (multi-judge-5)
Purpose: Use multiple judges to eliminate bias
{
id: 'multi-judge-5',
name: 'Multi-Judge Consensus',
type: 'ensemble',
models: [
{ provider: 'openai', model: 'gpt-4o' },
{ provider: 'anthropic', model: 'claude-3-5-sonnet' },
{ provider: 'google-ai', model: 'gemini-2.5-pro' }
],
judges: [ // Multiple judges
{ provider: 'openai', model: 'gpt-4o', criteria: ['accuracy'] },
{ provider: 'anthropic', model: 'claude-3-5-sonnet', criteria: ['safety'] }
],
conditioning: {
useConfidence: true,
toneAdjustment: 'neutral'
}
}
Use Cases: Bias-sensitive applications, fairness requirements
📈 Performance Characteristics
Expected Latency
| Workflow Type | Models | Judge | Expected Latency | Cost Multiplier |
|---|---|---|---|---|
| Consensus-3 | 3 | 1 | 3-5 seconds | 4x |
| Fast-Fallback | 1-2 | 0 | 1-3 seconds | 1-2x |
| Quality-Max | 2 | 1 | 3-4 seconds | 3x |
| Multi-Judge-5 | 3 | 2 | 4-6 seconds | 5x |
Optimization Strategies
- Parallel Execution: All ensemble models run concurrently
- Timeout Controls: Per-model timeout prevents hanging
- Early Termination: Optional "first N responses" mode
- Model Selection: Lightweight models for speed, powerful for quality
- Concurrency Control: p-limit for controlled parallel execution