Skip to main content

Auto Evaluation Engine

NeuroLink provides an automated quality gate that scores every response using an LLM-as-judge pipeline. Scores, rationales, and severity flags are surfaced in both CLI and SDK workflows so you can monitor drift and enforce minimum quality thresholds.

What It Does

  • Generates a structured evaluation payload (result.evaluation) for every call with enableEvaluation: true.
  • Calculates relevance, accuracy, completeness, and an overall score (1–10) using a RAGAS-style rubric.
  • Supports retry loops: re-ask the provider when the score falls below your threshold.
  • Emits analytics-friendly JSON so you can pipe results into dashboards.

Quick Start

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink();

const result = await neurolink.generate({
input: { text: "Explain quantum computing" },
enableEvaluation: true,
evaluationDomain: "science",
});

console.log(result.evaluation);
// { relevance: 9.0, accuracy: 8.5, completeness: 8.0, overall: 8.5, isPassing: true, ... }
LLM Costs

Evaluation uses additional AI calls to the judge model (default: gemini-2.5-flash). Each evaluated response incurs extra API costs. For high-volume production workloads, consider sampling (e.g., evaluate 10% of requests) or disabling evaluation after quality stabilizes.

Usage Examples

import { NeuroLink } from "@juspay/neurolink";

// Enable orchestration for automatic provider/model selection
const neurolink = new NeuroLink({ enableOrchestration: true });

const result = await neurolink.generate({
// Task classifier analyzes prompt to determine best provider
input: { text: "Create quarterly performance summary" },
enableEvaluation: true, // Enable LLM-as-judge quality scoring
evaluationDomain: "Enterprise Finance", // Provide domain context to shape evaluation rubric
factoryConfig: {
enhancementType: "domain-configuration", // Apply domain-specific prompt enhancements
domainType: "finance",
},
});

// Check if response passes the configured quality threshold
if (result.evaluation && !result.evaluation.isPassing) {
console.warn("Quality gate failed", result.evaluation.details?.message);
}

Streaming with Evaluation

const stream = await neurolink.stream({
input: { text: "Walk through the incident postmortem" },
enableEvaluation: true, // (1)!
});

let final;
for await (const chunk of stream) {
if (chunk.evaluation) {
// (2)!
final = chunk.evaluation; // (3)!
}
}
console.log(final?.overallScore); // (4)!
  1. Evaluation works in streaming mode
  2. Evaluation payload arrives in final chunks
  3. Capture the evaluation object
  4. Access overall score (1-10) and sub-scores

Configuration Options

OptionWhereDescription
enableEvaluationCLI flag / request optionTurns the middleware on for this call.
evaluationDomainCLI flag / request optionProvides context to the judge model (e.g., "Healthcare").
NEUROLINK_EVALUATION_THRESHOLDEnv variable / loop session varMinimum passing score; failures trigger retries or errors.
NEUROLINK_EVALUATION_MODELEnv variable / middleware configOverride the default judge model (defaults to gemini-2.5-flash).
NEUROLINK_EVALUATION_PROVIDEREnv variableForce the judge provider (google-ai by default).
NEUROLINK_EVALUATION_RETRY_ATTEMPTSEnv variableNumber of re-evaluation attempts before surfacing failure.
NEUROLINK_EVALUATION_TIMEOUTEnv variableMillisecond timeout for judge requests.
offTopicThresholdMiddleware configScore below which a response is flagged as off-topic.
highSeverityThresholdMiddleware configScore threshold for triggering high-severity alerts.

Set global defaults by exporting environment variables in your .env:

NEUROLINK_EVALUATION_PROVIDER="google-ai"
NEUROLINK_EVALUATION_MODEL="gemini-2.5-flash"
NEUROLINK_EVALUATION_THRESHOLD=7
NEUROLINK_EVALUATION_RETRY_ATTEMPTS=2
NEUROLINK_EVALUATION_TIMEOUT=15000

Loop sessions respect these values. Inside neurolink loop, use set NEUROLINK_EVALUATION_THRESHOLD 8 or unset NEUROLINK_EVALUATION_THRESHOLD to adjust the gate on the fly.

Best Practices

Cost Optimization

Only enable evaluation when needed: during prompt engineering, quality regression testing, or high-stakes production calls. For routine operations, disable evaluation and rely on Analytics for zero-cost observability.

  • Pair evaluation with analytics to track cost vs. quality trends.
  • Lower the threshold during experimentation, then tighten once prompts stabilise.
  • Register a custom onEvaluationComplete handler to forward scores to BI systems.
  • Exclude massive prompts from evaluation when latency matters; analytics is zero-cost without evaluation.

Troubleshooting

IssueFix
Evaluation model not configuredEnsure judge provider API keys are present or set NEUROLINK_EVALUATION_PROVIDER.
CLI exits with failureLower NEUROLINK_EVALUATION_THRESHOLD or configure the middleware with blocking: false.
Evaluation takes too longReduce NEUROLINK_EVALUATION_RETRY_ATTEMPTS or switch to a smaller judge model (e.g., gemini-2.5-flash-lite).
Off-topic false positivesIncrease offTopicThreshold to a lower score (e.g., 3).
JSON output missing evaluation blockConfirm --format json and --enableEvaluation are both set.

Q4 2025 Features:

Q3 2025 Features:

Documentation: