Skip to main content

Neurolink Workflow Engine - High-Level Design (HLD)

Version: 1.0
Date: November 28, 2025
Status: Implementation Complete
Author: Neurolink Team


📋 Executive Summary

The Neurolink Workflow Engine is a new subsystem that enables advanced AI orchestration patterns through multi-model ensembles and judge-based scoring. It extends Neurolink's existing provider abstraction to support complex workflows where multiple AI models collaborate with evaluation for higher-quality outputs.

Current Phase: Testing & Evaluation - workflows return original responses with scores for AB testing.

Key Value Propositions

  • 🎯 Improved Accuracy: Leverage multiple models to cross-validate responses
  • ⚖️ Objective Evaluation: Use judge models to score and select best responses (0-100 scale)
  • 📊 Comprehensive Logging: Detailed metrics for AB testing and workflow evaluation
  • 🔧 Declarative Configuration: Define workflows as composable configs
  • 💰 Cost Transparency: Track ensemble performance and costs

🎯 Goals & Non-Goals

Goals (Testing Phase)

  1. Enable Multi-Model Workflows: Run N models in parallel for the same prompt
  2. Intelligent Evaluation: Use judge models to score (0-100) and rank responses
  3. Comprehensive Logging: Detailed metrics for AB testing and evaluation
  4. Original Output: Return best response unchanged for production safety
  5. Cost Transparency: Provide clear cost/performance metrics
  6. Seamless Integration: Work with existing Neurolink provider layer

Non-Goals (Phase 1 - Testing)

  • ❌ Response conditioning/modification (deferred until testing validates workflows)
  • ❌ Streaming workflow execution (deferred to Phase 2)
  • ❌ Stateful/resumable workflows (deferred to Phase 2)
  • ❌ DAG-based workflow chaining (deferred to Phase 3)
  • ❌ Human-in-the-loop approval steps (deferred to Phase 3)
  • ❌ Workflow versioning/migration (deferred to Phase 3)

🏗️ Architecture Overview

System Context

┌─────────────────────────────────────────────────────────────┐
│ Neurolink SDK │
│ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ NeuroLink │────────▶│ Workflow Engine │◀─────┐ │
│ │ Class │ └─────────────────┘ │ │
│ └──────────────┘ │ │ │
│ │ ▼ │ │
│ │ ┌──────────────┐ │ │
│ │ │ Workflow │ │ │
│ │ │ Registry │ │ │
│ │ └──────────────┘ │ │
│ │ │ │ │
│ ▼ ▼ │ │
│ ┌──────────────┐ ┌──────────────┐ │ │
│ │ AI Provider │ │ Ensemble │────────┘ │
│ │ Factory │◀────────│ Executor │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ Judge │ │
│ │ │ Scorer │ │
│ │ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ BaseProvider Layer │ │
│ │ (OpenAI, Anthropic, Google, etc.) │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Component Hierarchy

src/lib/types/
└── workflowTypes.ts # All workflow type definitions (centralized)

src/lib/workflow/
├── index.ts # Public API exports
├── types.ts # Re-exports from types/workflowTypes.ts
├── config.ts # Configuration schemas & defaults

├── core/
│ ├── workflowRunner.ts # Main orchestrator
│ ├── workflowRegistry.ts # Workflow template registry
│ ├── ensembleExecutor.ts # Multi-model parallel execution
│ ├── judgeScorer.ts # Judge model scoring
│ └── responseConditioner.ts # Response post-processing

├── workflows/ # Built-in workflow implementations
│ ├── consensusWorkflow.ts # 3-5 models + judge
│ ├── fallbackWorkflow.ts # Sequential fallback chain
│ ├── multiJudgeWorkflow.ts # Multiple judges with voting
│ └── adaptiveWorkflow.ts # Dynamic model selection

└── utils/
├── workflowValidation.ts # Config validation
└── workflowMetrics.ts # Performance tracking

🔄 Workflow Execution Flow

High-Level Process

┌────────────────────────────────────────────────────────────┐
│ 1. USER REQUEST │
│ neuro.generate({ │
│ workflowConfig: { workflowId: 'consensus-3' }, │
│ input: { text: 'Explain quantum computing' } │
│ }) │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ 2. WORKFLOW RESOLUTION │
│ - Load workflow config from registry │
│ - Validate configuration │
│ - Apply runtime overrides (if any) │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ 3. ENSEMBLE EXECUTION (Parallel) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Model 1 │ │ Model 2 │ │ Model 3 │ │
│ │ GPT-4o │ │ Claude │ │ Gemini │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ └─────────────┴─────────────┘ │
│ ↓ │
│ [Response 1, Response 2, Response 3] │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ 4. JUDGE SCORING (Optional) │
│ - Format responses for judge evaluation │
│ - Call judge model with structured schema │
│ - Parse scores: { resp1: 8.5, resp2: 9.2, resp3: 7.8 } │
│ - Rank/select best response │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ 5. RESPONSE CONDITIONING (Optional) │
│ - Calculate confidence score │
│ - Adjust tone based on confidence │
│ - Add metadata (models used, scores, timing) │
│ - Format final response │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ 6. RETURN WORKFLOW RESULT │
│ { │
│ content: "Quantum computing is...", │
│ confidence: 0.92, │
│ ensembleResponses: [...], │
│ judgeScores: {...}, │
│ totalTime: 3421 │
│ } │
└────────────────────────────────────────────────────────────┘

🧩 Core Components

1. Workflow Runner

Purpose: Main orchestrator that executes workflows end-to-end

Responsibilities:

  • Load and validate workflow configurations
  • Coordinate ensemble → judge → conditioning pipeline
  • Handle errors and partial failures
  • Aggregate results with comprehensive metrics

Key Methods:

class WorkflowRunner {
async execute(
config: WorkflowConfig,
input: WorkflowInput,
): Promise<WorkflowResult>;

async executeWithRetry(
config: WorkflowConfig,
input: WorkflowInput,
retries: number,
): Promise<WorkflowResult>;
}

2. Workflow Registry

Purpose: Manage workflow templates (built-in + custom)

Responsibilities:

  • Store workflow configurations
  • Provide workflow discovery API
  • Validate configs before registration
  • Support workflow CRUD operations

Key Methods:

class WorkflowRegistry {
register(config: WorkflowConfig): void;
get(id: string): WorkflowConfig | undefined;
list(): WorkflowConfig[];
validate(config: WorkflowConfig): ValidationResult;
}

3. Ensemble Executor

Purpose: Execute multiple models in parallel

Responsibilities:

  • Create provider instances for each model
  • Execute requests concurrently via Promise.all()
  • Collect responses with timing/usage data
  • Handle individual model failures gracefully

Key Methods:

class EnsembleExecutor {
async execute(
models: ModelConfig[],
input: string,
): Promise<EnsembleResponse[]>;

async executeWithTimeout(
models: ModelConfig[],
input: string,
timeout: number,
): Promise<EnsembleResponse[]>;
}

Integration Points:

  • Uses AIProviderFactory.createProvider() for model instantiation
  • Calls BaseProvider.generate() for each model
  • Leverages existing analytics from core/analytics.ts

4. Judge Scorer

Purpose: Evaluate and rank ensemble responses

Responsibilities:

  • Format ensemble results for judge evaluation
  • Call judge model with structured output schema
  • Parse scores/rankings from judge response
  • Support multiple scoring strategies (numeric, ranking, best-pick)

Key Methods:

class JudgeScorer {
async score(
responses: EnsembleResponse[],
judgeConfig: JudgeConfig,
): Promise<JudgeScores>;

async scoreMultiJudge(
responses: EnsembleResponse[],
judgeConfigs: JudgeConfig[],
): Promise<MultiJudgeScores>;
}

Scoring Strategies:

  1. Numeric Scoring: Return 0-10 scores for each response
  2. Ranking: Order responses from best to worst
  3. Best Pick: Select single best response with reasoning
  4. Multi-Judge Voting: Average scores from multiple judges

5. Response Conditioner

Purpose: Post-process responses based on confidence

Responsibilities:

  • Calculate overall confidence score
  • Adjust tone based on confidence level
  • Add structured metadata
  • Format final user-facing response

Key Methods:

class ResponseConditioner {
async condition(
response: string,
confidence: number,
config: ConditioningConfig,
): Promise<ConditionedResponse>;

calculateConfidence(scores: JudgeScores, consensus: number): number;
}

Conditioning Rules:

  • High confidence (>0.8): Direct, assertive language
  • Medium confidence (0.5-0.8): Balanced, qualified language
  • Low confidence (<0.5): Tentative, exploratory language

📊 Data Models

WorkflowConfig

type WorkflowConfig = {
id: string; // Unique identifier
name: string; // Human-readable name
description?: string; // Workflow purpose
type: WorkflowType; // 'ensemble' | 'chain' | 'adaptive'

models: ModelConfig[]; // Ensemble models
judge?: JudgeConfig; // Optional judge configuration
conditioning?: ConditioningConfig; // Optional conditioning
execution?: ExecutionConfig; // Execution settings

metadata?: Record<string, unknown>; // Custom metadata
};

ModelConfig

type ModelConfig = {
provider: AIProviderName; // e.g., 'openai', 'anthropic'
model: string; // e.g., 'gpt-4o', 'claude-3-5-sonnet'
weight?: number; // Weight for voting (0-1)
temperature?: number; // Model temperature
maxTokens?: number; // Max response tokens
systemPrompt?: string; // Custom system prompt
timeout?: number; // Per-model timeout (ms)
};

JudgeConfig

type JudgeConfig = {
provider: AIProviderName; // Judge model provider
model: string; // Judge model name
criteria: string[]; // Evaluation criteria
outputFormat: JudgeOutputFormat; // 'scores' | 'ranking' | 'best'
systemPrompt?: string; // Custom judge prompt
blindEvaluation?: boolean; // Hide provider names
};

WorkflowResult

type WorkflowResult = {
content: string; // Final conditioned response

ensembleResponses: EnsembleResponse[]; // All model responses
judgeScores?: JudgeScores; // Judge evaluation
selectedResponse?: string; // Selected best response

confidence: number; // Overall confidence (0-1)
totalTime: number; // Total execution time (ms)
workflow: string; // Workflow ID used

usage?: AggregatedUsage; // Token usage across all models
analytics?: WorkflowAnalytics; // Detailed analytics
metadata?: Record<string, unknown>; // Custom metadata
};

🔌 Integration Points

1. AIProviderFactory

// Workflow uses existing factory for provider creation
const provider = await AIProviderFactory.createProvider(
modelConfig.provider,
modelConfig.model,
);

2. BaseProvider

// All models use standard generate() method
const result = await provider.generate({
input: { text: prompt },
temperature: modelConfig.temperature,
systemPrompt: modelConfig.systemPrompt,
});

3. Analytics & Evaluation

// Workflow aggregates existing analytics
const analytics = createAnalytics(provider, model, result, time);
const evaluation = await evaluateResponse(query, response);
// Workflow execution via generate() method
export class NeuroLink {
async generate(
options: GenerateOptions & { workflowConfig?: WorkflowGenerateOptions },
): Promise<GenerateResult | WorkflowResult> {
if (options.workflowConfig) {
const workflow = workflowRegistry.get(options.workflowConfig.workflowId);
return await workflowRunner.execute(workflow, options);
}
// ... existing generate logic
}
}

// Standalone registry functions
import {
registerWorkflow,
listWorkflows,
getWorkflow,
} from "@juspay/neurolink/workflow";

registerWorkflow(config);
const workflows = listWorkflows();
const workflow = getWorkflow("consensus-3");

🎨 Built-in Workflows

1. Consensus Workflow (consensus-3)

Purpose: Cross-validate responses across 3 models with judge scoring

{
id: 'consensus-3',
name: 'Three Model Consensus',
type: 'ensemble',
models: [
{ provider: 'openai', model: 'gpt-4o' },
{ provider: 'anthropic', model: 'claude-3-5-sonnet' },
{ provider: 'google-ai', model: 'gemini-2.5-flash' }
],
judge: {
provider: 'openai',
model: 'gpt-4o',
criteria: ['accuracy', 'clarity', 'completeness'],
outputFormat: 'best'
},
conditioning: {
useConfidence: true,
toneAdjustment: 'neutral'
}
}

Use Cases: High-stakes decisions, factual queries, technical explanations


2. Fast Fallback Workflow (fast-fallback)

Purpose: Try fast model first, fallback to powerful model if needed

{
id: 'fast-fallback',
name: 'Fast with Quality Fallback',
type: 'chain',
models: [
{ provider: 'google-ai', model: 'gemini-2.5-flash', timeout: 5000 },
{ provider: 'anthropic', model: 'claude-3-5-sonnet', timeout: 10000 }
],
conditioning: {
useConfidence: true,
metadata: { strategy: 'fast-first' }
}
}

Use Cases: Cost optimization, performance-sensitive applications


3. Quality Max Workflow (quality-max)

Purpose: Maximum quality with dual powerful models

{
id: 'quality-max',
name: 'Maximum Quality Ensemble',
type: 'ensemble',
models: [
{ provider: 'openai', model: 'gpt-4o', temperature: 0.3 },
{ provider: 'anthropic', model: 'claude-3-5-sonnet', temperature: 0.3 }
],
judge: {
provider: 'anthropic',
model: 'claude-3-5-sonnet',
criteria: ['depth', 'reasoning', 'accuracy', 'safety'],
outputFormat: 'scores'
},
conditioning: {
useConfidence: true,
toneAdjustment: 'strengthen'
}
}

Use Cases: Research, analysis, critical business decisions


4. Multi-Judge Workflow (multi-judge-5)

Purpose: Use multiple judges to eliminate bias

{
id: 'multi-judge-5',
name: 'Multi-Judge Consensus',
type: 'ensemble',
models: [
{ provider: 'openai', model: 'gpt-4o' },
{ provider: 'anthropic', model: 'claude-3-5-sonnet' },
{ provider: 'google-ai', model: 'gemini-2.5-pro' }
],
judges: [ // Multiple judges
{ provider: 'openai', model: 'gpt-4o', criteria: ['accuracy'] },
{ provider: 'anthropic', model: 'claude-3-5-sonnet', criteria: ['safety'] }
],
conditioning: {
useConfidence: true,
toneAdjustment: 'neutral'
}
}

Use Cases: Bias-sensitive applications, fairness requirements


📈 Performance Characteristics

Expected Latency

Workflow TypeModelsJudgeExpected LatencyCost Multiplier
Consensus-3313-5 seconds4x
Fast-Fallback1-201-3 seconds1-2x
Quality-Max213-4 seconds3x
Multi-Judge-5324-6 seconds5x

Optimization Strategies

  1. Parallel Execution: All ensemble models run concurrently
  2. Timeout Controls: Per-model timeout prevents hanging
  3. Early Termination: Optional "first N responses" mode
  4. Model Selection: Lightweight models for speed, powerful for quality
  5. Concurrency Control: p-limit for controlled parallel execution

🔒 Security & Safety

Input Validation

  • Validate workflow configs before execution
  • Sanitize user inputs before passing to models
  • Enforce token limits per model
  • Validate judge output schemas

Cost Controls

  • Pre-execution cost estimation
  • Per-workflow budget limits
  • Cost tracking and alerting
  • Rate limiting on workflow execution

Error Handling

  • Graceful degradation on partial failures
  • Retry logic with exponential backoff
  • Detailed error logging and metrics
  • Fallback to single-model execution

📊 Observability

Metrics to Track

  1. Execution Metrics

    • Total workflow execution time
    • Per-model response time
    • Judge scoring time
    • Ensemble success rate
  2. Quality Metrics

    • Judge scores distribution
    • Consensus levels
    • Confidence scores
    • Response variation
  3. Cost Metrics

    • Total tokens used
    • Cost per workflow
    • Cost breakdown by model
    • Budget utilization
  4. Error Metrics

    • Model failure rate
    • Timeout frequency
    • Validation errors
    • Retry attempts

Logging

  • Structured JSON logs for all workflow executions
  • Debug mode for detailed execution traces
  • Performance profiling for optimization
  • Audit trail for compliance

🚀 API Design

Public API

// Import from main package
import { NeuroLink, WorkflowConfig } from "@juspay/neurolink";
import {
registerWorkflow,
listWorkflows,
getWorkflow,
} from "@juspay/neurolink/workflow";

// Initialize
const neuro = new NeuroLink();

// Execute built-in workflow (TESTING PHASE)
const result = await neuro.generate({
workflowConfig: {
workflowId: "consensus-3",
timeout: 30000,
},
input: { text: "Explain machine learning" },
});

// Result contains original response + evaluation metrics
console.log(result.content); // Original best response (unchanged)
console.log(result.score); // 87 (out of 100)
console.log(result.reasoning); // "Clear and accurate explanation"

// Detailed metrics for AB testing
console.log(result.ensembleResponses); // All 3 model responses
console.log(result.judgeScores); // Individual scores
console.log(result.confidence); // 0.87
console.log(result.totalTime); // 3200ms

// Register custom workflow using standalone function
registerWorkflow({
id: "custom-workflow",
name: "My Custom Workflow",
type: "ensemble",
models: [
{ provider: "openai", model: "gpt-4o" },
{ provider: "anthropic", model: "claude-3-5-sonnet" },
],
});

// Execute custom workflow
const customResult = await neuro.generate({
workflowConfig: { workflowId: "custom-workflow" },
input: { text: "Custom query" },
});

// List available workflows (standalone function)
const workflows = listWorkflows();

// Get workflow details (standalone function)
const workflowConfig = getWorkflow("consensus-3");

🎯 Success Criteria

Phase 1 (MVP)

  • ✅ Support 3+ ensemble models running in parallel
  • ✅ Implement judge-based scoring with structured output
  • ✅ Response conditioning with confidence-based tone adjustment
  • ✅ 3 built-in workflows (consensus, fallback, quality-max)
  • ✅ Custom workflow registration API
  • ✅ Comprehensive analytics and metrics
  • ✅ Full TypeScript type safety
  • ✅ Integration tests with real providers

Performance Targets

  • Latency: <5 seconds for 3-model ensemble
  • Success Rate: >95% workflow completion
  • Cost Accuracy: ±5% cost estimation accuracy
  • Error Recovery: Handle 2/3 model failures gracefully

Documentation

  • High-Level Design (this document)
  • Low-Level Design with implementation details
  • API Reference documentation
  • Tutorial with 5+ examples
  • Migration guide for existing users

🔮 Future Enhancements (Post-MVP)

Phase 2: Streaming & Advanced Patterns

  • Streaming Workflows: Progressive results with streamWorkflow()
  • Workflow State Management: Persistent workflow state
  • Async Workflows: Background execution with callbacks
  • Workflow Chaining: Connect workflows in pipelines

Phase 3: Enterprise Features

  • DAG-based Workflows: Complex multi-stage orchestration
  • Human-in-the-Loop: Manual approval/judging steps
  • Workflow Versioning: Manage workflow evolution
  • A/B Testing: Compare workflow performance
  • Workflow Marketplace: Share and discover workflows

Phase 4: Advanced Intelligence

  • Adaptive Workflows: Auto-select models based on query
  • Self-Improving Workflows: Learn from past executions
  • Cost Optimization: Auto-route to cheapest viable models
  • Quality Prediction: Predict confidence before execution

📚 References

Internal Documentation

External Resources


📝 Appendix

Glossary

  • Ensemble: Running multiple models in parallel for the same input
  • Judge Model: AI model that evaluates and scores responses
  • Conditioning: Post-processing response based on metadata/confidence
  • Workflow: Declarative configuration of ensemble + judge + conditioning
  • Consensus: Agreement level between ensemble models
  • Confidence: Calculated metric representing response reliability

Assumptions

  1. All providers support concurrent requests
  2. Judge models support structured output (Zod schemas)
  3. Sufficient API rate limits for parallel execution
  4. Network latency is manageable (<1s per model)

Constraints

  1. Maximum 10 models per ensemble (performance/cost)
  2. Maximum 3 judges per workflow (complexity)
  3. Minimum 2 models for meaningful ensemble
  4. Judge model must differ from ensemble models (bias prevention)

Document Status: ✅ Approved for Implementation
Next Step: Low-Level Design (LLD) document