Neurolink Workflow Engine - High-Level Design (HLD)
Version: 1.0
Date: November 28, 2025
Status: Implementation Complete
Author: Neurolink Team
📋 Executive Summary
The Neurolink Workflow Engine is a new subsystem that enables advanced AI orchestration patterns through multi-model ensembles and judge-based scoring. It extends Neurolink's existing provider abstraction to support complex workflows where multiple AI models collaborate with evaluation for higher-quality outputs.
Current Phase: Testing & Evaluation - workflows return original responses with scores for AB testing.
Key Value Propositions
- 🎯 Improved Accuracy: Leverage multiple models to cross-validate responses
- ⚖️ Objective Evaluation: Use judge models to score and select best responses (0-100 scale)
- 📊 Comprehensive Logging: Detailed metrics for AB testing and workflow evaluation
- 🔧 Declarative Configuration: Define workflows as composable configs
- 💰 Cost Transparency: Track ensemble performance and costs
🎯 Goals & Non-Goals
Goals (Testing Phase)
- Enable Multi-Model Workflows: Run N models in parallel for the same prompt
- Intelligent Evaluation: Use judge models to score (0-100) and rank responses
- Comprehensive Logging: Detailed metrics for AB testing and evaluation
- Original Output: Return best response unchanged for production safety
- Cost Transparency: Provide clear cost/performance metrics
- Seamless Integration: Work with existing Neurolink provider layer
Non-Goals (Phase 1 - Testing)
- ❌ Response conditioning/modification (deferred until testing validates workflows)
- ❌ Streaming workflow execution (deferred to Phase 2)
- ❌ Stateful/resumable workflows (deferred to Phase 2)
- ❌ DAG-based workflow chaining (deferred to Phase 3)
- ❌ Human-in-the-loop approval steps (deferred to Phase 3)
- ❌ Workflow versioning/migration (deferred to Phase 3)
🏗️ Architecture Overview
System Context
┌─────────────────────────────────────────────────────────────┐
│ Neurolink SDK │
│ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ NeuroLink │────────▶│ Workflow Engine │◀─────┐ │
│ │ Class │ └─────────────────┘ │ │
│ └──────────────┘ │ │ │
│ │ ▼ │ │
│ │ ┌──────────────┐ │ │
│ │ │ Workflow │ │ │
│ │ │ Registry │ │ │
│ │ └──────────────┘ │ │
│ │ │ │ │
│ ▼ ▼ │ │
│ ┌──────────────┐ ┌──────────────┐ │ │
│ │ AI Provider │ │ Ensemble │────────┘ │
│ │ Factory │◀────────│ Executor │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ Judge │ │
│ │ │ Scorer │ │
│ │ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ BaseProvider Layer │ │
│ │ (OpenAI, Anthropic, Google, etc.) │ │
│ └──────────────────────────────────────┘ │
└──────────────────────────────────────────────────────── ─────┘
Component Hierarchy
src/lib/types/
└── workflowTypes.ts # All workflow type definitions (centralized)
src/lib/workflow/
├── index.ts # Public API exports
├── types.ts # Re-exports from types/workflowTypes.ts
├── config.ts # Configuration schemas & defaults
│
├── core/
│ ├── workflowRunner.ts # Main orchestrator
│ ├── workflowRegistry.ts # Workflow template registry
│ ├── ensembleExecutor.ts # Multi-model parallel execution
│ ├── judgeScorer.ts # Judge model scoring
│ └── responseConditioner.ts # Response post-processing
│
├── workflows/ # Built-in workflow implementations
│ ├── consensusWorkflow.ts # 3-5 models + judge
│ ├── fallbackWorkflow.ts # Sequential fallback chain
│ ├── multiJudgeWorkflow.ts # Multiple judges with voting
│ └── adaptiveWorkflow.ts # Dynamic model selection
│
└── utils/
├── workflowValidation.ts # Config validation
└── workflowMetrics.ts # Performance tracking
🔄 Workflow Execution Flow
High-Level Process
┌────────────────────────────────────────────────────────────┐
│ 1. USER REQUEST │
│ neuro.generate({ │
│ workflowConfig: { workflowId: 'consensus-3' }, │
│ input: { text: 'Explain quantum computing' } │
│ }) │
└────────────── ──────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────┐
│ 2. WORKFLOW RESOLUTION │
│ - Load workflow config from registry │
│ - Validate configuration │
│ - Apply runtime overrides (if any) │
└────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────┐
│ 3. ENSEMBLE EXECUTION (Parallel) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Model 1 │ │ Model 2 │ │ Model 3 │ │
│ │ GPT-4o │ │ Claude │ │ Gemini │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ └─────────────┴─────────────┘ │
│ ↓ │
│ [Response 1, Response 2, Response 3] │
└────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────┐
│ 4. JUDGE SCORING (Optional) │
│ - Format responses for judge evaluation │
│ - Call judge model with structured schema │
│ - Parse scores: { resp1: 8.5, resp2: 9.2, resp3: 7.8 } │
│ - Rank/select best response │
└────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────┐
│ 5. RESPONSE CONDITIONING (Optional) │
│ - Calculate confidence score │
│ - Adjust tone based on confidence │
│ - Add metadata (models used, scores, timing) │
│ - Format final response │
└────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────┐
│ 6. RETURN WORKFLOW RESULT │
│ { │
│ content: "Quantum computing is...", │
│ confidence: 0.92, │
│ ensembleResponses: [...], │
│ judgeScores: {...}, │
│ totalTime: 3421 │
│ } │
└────────────────────────────────────────────────────────────┘
🧩 Core Components
1. Workflow Runner
Purpose: Main orchestrator that executes workflows end-to-end
Responsibilities:
- Load and validate workflow configurations
- Coordinate ensemble → judge → conditioning pipeline
- Handle errors and partial failures
- Aggregate results with comprehensive metrics
Key Methods:
class WorkflowRunner {
async execute(
config: WorkflowConfig,
input: WorkflowInput,
): Promise<WorkflowResult>;
async executeWithRetry(
config: WorkflowConfig,
input: WorkflowInput,
retries: number,
): Promise<WorkflowResult>;
}
2. Workflow Registry
Purpose: Manage workflow templates (built-in + custom)
Responsibilities:
- Store workflow configurations
- Provide workflow discovery API
- Validate configs before registration
- Support workflow CRUD operations
Key Methods:
class WorkflowRegistry {
register(config: WorkflowConfig): void;
get(id: string): WorkflowConfig | undefined;
list(): WorkflowConfig[];
validate(config: WorkflowConfig): ValidationResult;
}
3. Ensemble Executor
Purpose: Execute multiple models in parallel
Responsibilities:
- Create provider instances for each model
- Execute requests concurrently via
Promise.all() - Collect responses with timing/usage data
- Handle individual model failures gracefully
Key Methods:
class EnsembleExecutor {
async execute(
models: ModelConfig[],
input: string,
): Promise<EnsembleResponse[]>;
async executeWithTimeout(
models: ModelConfig[],
input: string,
timeout: number,
): Promise<EnsembleResponse[]>;
}
Integration Points:
- Uses
AIProviderFactory.createProvider()for model instantiation - Calls
BaseProvider.generate()for each model - Leverages existing analytics from
core/analytics.ts
4. Judge Scorer
Purpose: Evaluate and rank ensemble responses
Responsibilities:
- Format ensemble results for judge evaluation
- Call judge model with structured output schema
- Parse scores/rankings from judge response
- Support multiple scoring strategies (numeric, ranking, best-pick)
Key Methods:
class JudgeScorer {
async score(
responses: EnsembleResponse[],
judgeConfig: JudgeConfig,
): Promise<JudgeScores>;
async scoreMultiJudge(
responses: EnsembleResponse[],
judgeConfigs: JudgeConfig[],
): Promise<MultiJudgeScores>;
}
Scoring Strategies:
- Numeric Scoring: Return 0-10 scores for each response
- Ranking: Order responses from best to worst
- Best Pick: Select single best response with reasoning
- Multi-Judge Voting: Average scores from multiple judges
5. Response Conditioner
Purpose: Post-process responses based on confidence
Responsibilities:
- Calculate overall confidence score
- Adjust tone based on confidence level
- Add structured metadata
- Format final user-facing response
Key Methods:
class ResponseConditioner {
async condition(
response: string,
confidence: number,
config: ConditioningConfig,
): Promise<ConditionedResponse>;
calculateConfidence(scores: JudgeScores, consensus: number): number;
}
Conditioning Rules:
- High confidence (>0.8): Direct, assertive language
- Medium confidence (0.5-0.8): Balanced, qualified language
- Low confidence (<0.5): Tentative, exploratory language
📊 Data Models
WorkflowConfig
type WorkflowConfig = {
id: string; // Unique identifier
name: string; // Human-readable name
description?: string; // Workflow purpose
type: WorkflowType; // 'ensemble' | 'chain' | 'adaptive'
models: ModelConfig[]; // Ensemble models
judge?: JudgeConfig; // Optional judge configuration
conditioning?: ConditioningConfig; // Optional conditioning
execution?: ExecutionConfig; // Execution settings
metadata?: Record<string, unknown>; // Custom metadata
};
ModelConfig
type ModelConfig = {
provider: AIProviderName; // e.g., 'openai', 'anthropic'
model: string; // e.g., 'gpt-4o', 'claude-sonnet-4-6'
weight?: number; // Weight for voting (0-1)
temperature?: number; // Model temperature
maxTokens?: number; // Max response tokens
systemPrompt?: string; // Custom system prompt
timeout?: number; // Per-model timeout (ms)
};
JudgeConfig
type JudgeConfig = {
provider: AIProviderName; // Judge model provider
model: string; // Judge model name
criteria: string[]; // Evaluation criteria
outputFormat: JudgeOutputFormat; // 'scores' | 'ranking' | 'best'
systemPrompt?: string; // Custom judge prompt
blindEvaluation?: boolean; // Hide provider names
};
WorkflowResult
type WorkflowResult = {
content: string; // Final conditioned response
ensembleResponses: EnsembleResponse[]; // All model responses
judgeScores?: JudgeScores; // Judge evaluation
selectedResponse?: string; // Selected best response
confidence: number; // Overall confidence (0-1)
totalTime: number; // Total execution time (ms)
workflow: string; // Workflow ID used
usage?: AggregatedUsage; // Token usage across all models
analytics?: WorkflowAnalytics; // Detailed analytics
metadata?: Record<string, unknown>; // Custom metadata
};
🔌 Integration Points
With Existing Neurolink Infrastructure
1. AIProviderFactory
// Workflow uses existing factory for provider creation
const provider = await AIProviderFactory.createProvider(
modelConfig.provider,
modelConfig.model,
);
2. BaseProvider
// All models use standard generate() method
const result = await provider.generate({
input: { text: prompt },
temperature: modelConfig.temperature,
systemPrompt: modelConfig.systemPrompt,
});
3. Analytics & Evaluation
// Workflow aggregates existing analytics
const analytics = createAnalytics(provider, model, result, time);
const evaluation = await evaluateResponse(query, response);
4. NeuroLink Class Extension
// Workflow execution via generate() method
export class NeuroLink {
async generate(
options: GenerateOptions & { workflowConfig?: WorkflowGenerateOptions },
): Promise<GenerateResult | WorkflowResult> {
if (options.workflowConfig) {
const workflow = workflowRegistry.get(options.workflowConfig.workflowId);
return await workflowRunner.execute(workflow, options);
}
// ... existing generate logic
}
}
// Standalone registry functions
import {
registerWorkflow,
listWorkflows,
getWorkflow,
} from "@juspay/neurolink/workflow";
registerWorkflow(config);
const workflows = listWorkflows();
const workflow = getWorkflow("consensus-3");
🎨 Built-in Workflows
1. Consensus Workflow (consensus-3)
Purpose: Cross-validate responses across 3 models with judge scoring
{
id: 'consensus-3',
name: 'Three Model Consensus',
type: 'ensemble',
models: [
{ provider: 'openai', model: 'gpt-4o' },
{ provider: 'anthropic', model: 'claude-sonnet-4-6' },
{ provider: 'google-ai', model: 'gemini-3-flash-preview' }
],
judge: {
provider: 'openai',
model: 'gpt-4o',
criteria: ['accuracy', 'clarity', 'completeness'],
outputFormat: 'best'
},
conditioning: {
useConfidence: true,
toneAdjustment: 'neutral'
}
}
Use Cases: High-stakes decisions, factual queries, technical explanations
2. Fast Fallback Workflow (fast-fallback)
Purpose: Try fast model first, fallback to powerful model if needed
{
id: 'fast-fallback',
name: 'Fast with Quality Fallback',
type: 'chain',
models: [
{ provider: 'google-ai', model: 'gemini-3-flash-preview', timeout: 5000 },
{ provider: 'anthropic', model: 'claude-sonnet-4-6', timeout: 10000 }
],
conditioning: {
useConfidence: true,
metadata: { strategy: 'fast-first' }
}
}
Use Cases: Cost optimization, performance-sensitive applications
3. Quality Max Workflow (quality-max)
Purpose: Maximum quality with dual powerful models
{
id: 'quality-max',
name: 'Maximum Quality Ensemble',
type: 'ensemble',
models: [
{ provider: 'openai', model: 'gpt-4o', temperature: 0.3 },
{ provider: 'anthropic', model: 'claude-sonnet-4-6', temperature: 0.3 }
],
judge: {
provider: 'anthropic',
model: 'claude-sonnet-4-6',
criteria: ['depth', 'reasoning', 'accuracy', 'safety'],
outputFormat: 'scores'
},
conditioning: {
useConfidence: true,
toneAdjustment: 'strengthen'
}
}
Use Cases: Research, analysis, critical business decisions
4. Multi-Judge Workflow (multi-judge-5)
Purpose: Use multiple judges to eliminate bias
{
id: 'multi-judge-5',
name: 'Multi-Judge Consensus',
type: 'ensemble',
models: [
{ provider: 'openai', model: 'gpt-4o' },
{ provider: 'anthropic', model: 'claude-sonnet-4-6' },
{ provider: 'google-ai', model: 'gemini-3-pro-preview' }
],
judges: [ // Multiple judges
{ provider: 'openai', model: 'gpt-4o', criteria: ['accuracy'] },
{ provider: 'anthropic', model: 'claude-sonnet-4-6', criteria: ['safety'] }
],
conditioning: {
useConfidence: true,
toneAdjustment: 'neutral'
}
}
Use Cases: Bias-sensitive applications, fairness requirements
📈 Performance Characteristics
Expected Latency
| Workflow Type | Models | Judge | Expected Latency | Cost Multiplier |
|---|---|---|---|---|
| Consensus-3 | 3 | 1 | 3-5 seconds | 4x |
| Fast-Fallback | 1-2 | 0 | 1-3 seconds | 1-2x |
| Quality-Max | 2 | 1 | 3-4 seconds | 3x |
| Multi-Judge-5 | 3 | 2 | 4-6 seconds | 5x |
Optimization Strategies
- Parallel Execution: All ensemble models run concurrently
- Timeout Controls: Per-model timeout prevents hanging
- Early Termination: Optional "first N responses" mode
- Model Selection: Lightweight models for speed, powerful for quality
- Concurrency Control: p-limit for controlled parallel execution
🔒 Security & Safety
Input Validation
- Validate workflow configs before execution
- Sanitize user inputs before passing to models
- Enforce token limits per model
- Validate judge output schemas
Cost Controls
- Pre-execution cost estimation
- Per-workflow budget limits
- Cost tracking and alerting
- Rate limiting on workflow execution
Error Handling
- Graceful degradation on partial failures
- Retry logic with exponential backoff
- Detailed error logging and metrics
- Fallback to single-model execution
📊 Observability
Metrics to Track
-
Execution Metrics
- Total workflow execution time
- Per-model response time
- Judge scoring time
- Ensemble success rate
-
Quality Metrics
- Judge scores distribution
- Consensus levels
- Confidence scores
- Response variation
-
Cost Metrics
- Total tokens used
- Cost per workflow
- Cost breakdown by model
- Budget utilization
-
Error Metrics
- Model failure rate
- Timeout frequency
- Validation errors
- Retry attempts
Logging
- Structured JSON logs for all workflow executions
- Debug mode for detailed execution traces
- Performance profiling for optimization
- Audit trail for compliance
🚀 API Design
Public API
// Import from main package
import { NeuroLink, WorkflowConfig } from "@juspay/neurolink";
import {
registerWorkflow,
listWorkflows,
getWorkflow,
} from "@juspay/neurolink/workflow";
// Initialize
const neuro = new NeuroLink();
// Execute built-in workflow (TESTING PHASE)
const result = await neuro.generate({
workflowConfig: {
workflowId: "consensus-3",
timeout: 30000,
},
input: { text: "Explain machine learning" },
});
// Result contains original response + evaluation metrics
console.log(result.content); // Original best response (unchanged)
console.log(result.score); // 87 (out of 100)
console.log(result.reasoning); // "Clear and accurate explanation"
// Detailed metrics for AB testing
console.log(result.ensembleResponses); // All 3 model responses
console.log(result.judgeScores); // Individual scores
console.log(result.confidence); // 0.87
console.log(result.totalTime); // 3200ms
// Register custom workflow using standalone function
registerWorkflow({
id: "custom-workflow",
name: "My Custom Workflow",
type: "ensemble",
models: [
{ provider: "openai", model: "gpt-4o" },
{ provider: "anthropic", model: "claude-sonnet-4-6" },
],
});
// Execute custom workflow
const customResult = await neuro.generate({
workflowConfig: { workflowId: "custom-workflow" },
input: { text: "Custom query" },
});
// List available workflows (standalone function)
const workflows = listWorkflows();
// Get workflow details (standalone function)
const workflowConfig = getWorkflow("consensus-3");
🎯 Success Criteria
Phase 1 (MVP)
- ✅ Support 3+ ensemble models running in parallel
- ✅ Implement judge-based scoring with structured output
- ✅ Response conditioning with confidence-based tone adjustment
- ✅ 3 built-in workflows (consensus, fallback, quality-max)
- ✅ Custom workflow registration API
- ✅ Comprehensive analytics and metrics
- ✅ Full TypeScript type safety
- ✅ Integration tests with real providers
Performance Targets
- Latency: <5 seconds for 3-model ensemble
- Success Rate: >95% workflow completion
- Cost Accuracy: ±5% cost estimation accuracy
- Error Recovery: Handle 2/3 model failures gracefully
Documentation
- High-Level Design (this document)
- Low-Level Design with implementation details
- API Reference documentation
- Tutorial with 5+ examples
- Migration guide for existing users
🔮 Future Enhancements (Post-MVP)
Phase 2: Streaming & Advanced Patterns
- Streaming Workflows: Progressive results with
streamWorkflow() - Workflow State Management: Persistent workflow state
- Async Workflows: Background execution with callbacks
- Workflow Chaining: Connect workflows in pipelines
Phase 3: Enterprise Features
- DAG-based Workflows: Complex multi-stage orchestration
- Human-in-the-Loop: Manual approval/judging steps
- Workflow Versioning: Manage workflow evolution
- A/B Testing: Compare workflow performance
- Workflow Marketplace: Share and discover workflows
Phase 4: Advanced Intelligence
- Adaptive Workflows: Auto-select models based on query
- Self-Improving Workflows: Learn from past executions
- Cost Optimization: Auto-route to cheapest viable models
- Quality Prediction: Predict confidence before execution
📚 References
Internal Documentation
External Resources
📝 Appendix
Glossary
- Ensemble: Running multiple models in parallel for the same input
- Judge Model: AI model that evaluates and scores responses
- Conditioning: Post-processing response based on metadata/confidence
- Workflow: Declarative configuration of ensemble + judge + conditioning
- Consensus: Agreement level between ensemble models
- Confidence: Calculated metric representing response reliability
Assumptions
- All providers support concurrent requests
- Judge models support structured output (Zod schemas)
- Sufficient API rate limits for parallel execution
- Network latency is manageable (<1s per model)
Constraints
- Maximum 10 models per ensemble (performance/cost)
- Maximum 3 judges per workflow (complexity)
- Minimum 2 models for meaningful ensemble
- Judge model must differ from ensemble models (bias prevention)
Document Status: ✅ Approved for Implementation
Next Step: Low-Level Design (LLD) document