Skip to main content

Monitoring & Observability Guide

Comprehensive monitoring for AI applications with Prometheus, Grafana, and cloud-native tools


Overview

Production AI applications require robust monitoring to track performance, costs, errors, and usage patterns. This guide covers implementing comprehensive observability using industry-standard tools and cloud-native services.

Key Metrics to Track

  • 📊 Request Metrics: Count, rate, latency percentiles
  • 💰 Cost Tracking: Token usage, per-model costs
  • ❌ Error Rates: Failures, rate limits, timeouts
  • ⚡ Performance: Latency, throughput, queue depth
  • 🎯 Model Usage: Distribution across providers/models
  • 👥 User Analytics: Per-user costs, quotas

Monitoring Stack

  • Prometheus: Metrics collection and storage
  • Grafana: Visualization and dashboards
  • CloudWatch: AWS-native monitoring
  • Application Insights: Azure monitoring
  • Cloud Logging: Google Cloud logging

Quick Start

1. Setup Prometheus

# Docker Compose setup
cat > docker-compose.yml <<EOF
version: '3.8'

services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'

grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin

volumes:
prometheus-data:
grafana-data:
EOF

# Start services
docker-compose up -d

2. Configure Prometheus

# prometheus.yml
global:
scrape_interval: 15s

scrape_configs:
- job_name: "neurolink-api"
static_configs:
- targets: ["localhost:3001"] # Your API metrics endpoint

3. Add Metrics to Application

npm install prom-client
// metrics.ts
import { Registry, Counter, Histogram, Gauge } from "prom-client";

export const register = new Registry();

// Request counters
export const aiRequestsTotal = new Counter({
name: "ai_requests_total",
help: "Total AI requests",
labelNames: ["provider", "model", "status"],
registers: [register],
});

// Latency histogram
export const aiRequestDuration = new Histogram({
name: "ai_request_duration_seconds",
help: "AI request duration in seconds",
labelNames: ["provider", "model"],
buckets: [0.1, 0.5, 1, 2, 5, 10, 30],
registers: [register],
});

// Token usage counter
export const aiTokensUsed = new Counter({
name: "ai_tokens_used_total",
help: "Total tokens consumed",
labelNames: ["provider", "model", "type"],
registers: [register],
});

// Cost tracking
export const aiCostTotal = new Counter({
name: "ai_cost_total_usd",
help: "Total AI cost in USD",
labelNames: ["provider", "model"],
registers: [register],
});

// Active requests gauge
export const aiRequestsActive = new Gauge({
name: "ai_requests_active",
help: "Currently active AI requests",
labelNames: ["provider"],
registers: [register],
});

// Error counter
export const aiErrorsTotal = new Counter({
name: "ai_errors_total",
help: "Total AI request errors",
labelNames: ["provider", "model", "error_type"],
registers: [register],
});
// app.ts
import { NeuroLink } from "@juspay/neurolink";
import {
register,
aiRequestsTotal,
aiRequestDuration,
aiTokensUsed,
aiCostTotal,
aiRequestsActive,
aiErrorsTotal,
} from "./metrics";

const ai = new NeuroLink({
providers: [
{ name: "openai", config: { apiKey: process.env.OPENAI_API_KEY } },
{ name: "anthropic", config: { apiKey: process.env.ANTHROPIC_API_KEY } },
],
onRequest: (req) => {
aiRequestsActive.inc({ provider: req.provider });
},
onSuccess: (result) => {
// Record request
aiRequestsTotal.inc({
provider: result.provider,
model: result.model,
status: "success",
});

// Record latency
aiRequestDuration.observe(
{ provider: result.provider, model: result.model },
result.latency / 1000, // Convert ms to seconds
);

// Record tokens
aiTokensUsed.inc(
{ provider: result.provider, model: result.model, type: "input" },
result.usage.promptTokens,
);
aiTokensUsed.inc(
{ provider: result.provider, model: result.model, type: "output" },
result.usage.completionTokens,
);

// Record cost
aiCostTotal.inc(
{ provider: result.provider, model: result.model },
result.cost,
);

// Decrement active
aiRequestsActive.dec({ provider: result.provider });
},
onError: (error, provider, model) => {
// Record error
aiErrorsTotal.inc({
provider,
model: model || "unknown",
error_type: error.message.includes("rate limit")
? "rate_limit"
: error.message.includes("timeout")
? "timeout"
: "other",
});

// Record failed request
aiRequestsTotal.inc({
provider,
model: model || "unknown",
status: "error",
});

// Decrement active
aiRequestsActive.dec({ provider });
},
});

// Metrics endpoint
app.get("/metrics", async (req, res) => {
res.setHeader("Content-Type", register.contentType);
res.send(await register.metrics());
});

Grafana Dashboards

Create Dashboard

{
"dashboard": {
"title": "NeuroLink Monitoring",
"panels": [
{
"title": "Requests Per Second",
"targets": [
{
"expr": "rate(ai_requests_total[5m])",
"legendFormat": "{{provider}} - {{model}}"
}
],
"type": "graph"
},
{
"title": "Average Latency",
"targets": [
{
"expr": "rate(ai_request_duration_seconds_sum[5m]) / rate(ai_request_duration_seconds_count[5m])",
"legendFormat": "{{provider}} - {{model}}"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(ai_errors_total[5m])",
"legendFormat": "{{provider}} - {{error_type}}"
}
],
"type": "graph"
},
{
"title": "Hourly Cost",
"targets": [
{
"expr": "rate(ai_cost_total_usd[1h]) * 3600",
"legendFormat": "{{provider}}"
}
],
"type": "graph"
},
{
"title": "Token Usage",
"targets": [
{
"expr": "rate(ai_tokens_used_total[5m])",
"legendFormat": "{{provider}} - {{type}}"
}
],
"type": "graph"
}
]
}
}

Key Dashboard Panels

1. Request Rate

rate(ai_requests_total[5m])

2. P95 Latency

histogram_quantile(0.95, rate(ai_request_duration_seconds_bucket[5m]))

3. Success Rate

sum(rate(ai_requests_total{status="success"}[5m])) / sum(rate(ai_requests_total[5m])) * 100

4. Cost Per Hour

rate(ai_cost_total_usd[1h]) * 3600

5. Tokens Per Request

rate(ai_tokens_used_total[5m]) / rate(ai_requests_total[5m])

Cloud-Native Monitoring

AWS CloudWatch

import { CloudWatch } from "@aws-sdk/client-cloudwatch";

const cloudwatch = new CloudWatch({ region: "us-east-1" });

async function publishMetrics(result: any) {
await cloudwatch.putMetricData({
Namespace: "NeuroLink/AI",
MetricData: [
{
MetricName: "Requests",
Value: 1,
Unit: "Count",
Dimensions: [
{ Name: "Provider", Value: result.provider },
{ Name: "Model", Value: result.model },
],
Timestamp: new Date(),
},
{
MetricName: "Latency",
Value: result.latency,
Unit: "Milliseconds",
Dimensions: [{ Name: "Provider", Value: result.provider }],
Timestamp: new Date(),
},
{
MetricName: "TokensUsed",
Value: result.usage.totalTokens,
Unit: "Count",
Dimensions: [
{ Name: "Provider", Value: result.provider },
{ Name: "Model", Value: result.model },
],
Timestamp: new Date(),
},
{
MetricName: "Cost",
Value: result.cost,
Unit: "None",
Dimensions: [{ Name: "Provider", Value: result.provider }],
Timestamp: new Date(),
},
],
});
}

const ai = new NeuroLink({
providers: [
/* ... */
],
onSuccess: async (result) => {
await publishMetrics(result);
},
});

Azure Application Insights

import { ApplicationInsights } from "@azure/monitor-opentelemetry";

const appInsights = new ApplicationInsights({
connectionString: process.env.APPLICATIONINSIGHTS_CONNECTION_STRING,
});

appInsights.start();

const ai = new NeuroLink({
providers: [
/* ... */
],
onSuccess: (result) => {
appInsights.trackEvent({
name: "AI_Request",
properties: {
provider: result.provider,
model: result.model,
tokens: result.usage.totalTokens,
cost: result.cost,
},
measurements: {
latency: result.latency,
tokensUsed: result.usage.totalTokens,
cost: result.cost,
},
});

appInsights.trackMetric({
name: "AI_Latency",
value: result.latency,
properties: { provider: result.provider },
});
},
onError: (error, provider) => {
appInsights.trackException({
exception: error,
properties: { provider },
});
},
});

Google Cloud Operations

import { Logging } from "@google-cloud/logging";
import { MetricServiceClient } from "@google-cloud/monitoring";

const logging = new Logging();
const log = logging.log("neurolink-requests");

const metrics = new MetricServiceClient();

const ai = new NeuroLink({
providers: [
/* ... */
],
onSuccess: async (result) => {
// Log to Cloud Logging
await log.write(
log.entry(
{
resource: { type: "global" },
severity: "INFO",
},
{
event: "ai_request",
provider: result.provider,
model: result.model,
tokens: result.usage.totalTokens,
latency: result.latency,
cost: result.cost,
},
),
);

// Send to Cloud Monitoring
await metrics.createTimeSeries({
name: metrics.projectPath(process.env.GCP_PROJECT_ID!),
timeSeries: [
{
metric: {
type: "custom.googleapis.com/neurolink/latency",
labels: { provider: result.provider },
},
resource: { type: "global" },
points: [
{
interval: { endTime: { seconds: Date.now() / 1000 } },
value: { doubleValue: result.latency },
},
],
},
],
});
},
});

Alerting

Prometheus Alerts

# alerts.yml
groups:
- name: neurolink_alerts
interval: 30s
rules:
# High error rate
- alert: HighAIErrorRate
expr: rate(ai_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High AI error rate detected"
description: "Error rate is {{ $value }} errors/sec for {{ $labels.provider }}"

# High latency
- alert: HighAILatency
expr: histogram_quantile(0.95, rate(ai_request_duration_seconds_bucket[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High AI latency detected"
description: "P95 latency is {{ $value }}s for {{ $labels.provider }}"

# High cost
- alert: HighAICost
expr: rate(ai_cost_total_usd[1h]) * 3600 > 100
for: 15m
labels:
severity: critical
annotations:
summary: "High AI costs detected"
description: "Hourly cost is ${{ $value }}"

# Provider down
- alert: AIProviderDown
expr: up{job="neurolink-api"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "AI provider is down"
description: "{{ $labels.instance }} has been down for 2 minutes"

Alertmanager Configuration

# alertmanager.yml
global:
slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

route:
group_by: ["alertname", "provider"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: "slack-notifications"

receivers:
- name: "slack-notifications"
slack_configs:
- channel: "#ai-alerts"
title: "{{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"

- name: "pagerduty"
pagerduty_configs:
- service_key: "YOUR_PAGERDUTY_KEY"

Custom Monitoring Dashboards

Real-Time Cost Dashboard

class CostDashboard {
private costs = new Map<string, number>();
private hourlySnapshot: number[] = [];

recordCost(provider: string, cost: number) {
const current = this.costs.get(provider) || 0;
this.costs.set(provider, current + cost);
}

takeHourlySnapshot() {
const total = Array.from(this.costs.values()).reduce(
(sum, cost) => sum + cost,
0,
);

this.hourlySnapshot.push(total);

// Keep last 24 hours
if (this.hourlySnapshot.length > 24) {
this.hourlySnapshot.shift();
}
}

getDashboardData() {
return {
totalToday: Array.from(this.costs.values()).reduce(
(sum, cost) => sum + cost,
0,
),
byProvider: Object.fromEntries(this.costs),
hourlyTrend: this.hourlySnapshot,
projectedMonthly: this.hourlySnapshot.reduce((a, b) => a + b, 0) * 30,
};
}
}

// Usage
const dashboard = new CostDashboard();

const ai = new NeuroLink({
providers: [
/* ... */
],
onSuccess: (result) => {
dashboard.recordCost(result.provider, result.cost);
},
});

// Snapshot every hour
setInterval(() => dashboard.takeHourlySnapshot(), 3600000);

// API endpoint
app.get("/dashboard/costs", (req, res) => {
res.json(dashboard.getDashboardData());
});

Best Practices

1. ✅ Track All Key Metrics

// ✅ Good: Comprehensive tracking
onSuccess: (result) => {
metrics.recordLatency(result.latency);
metrics.recordTokens(result.usage.totalTokens);
metrics.recordCost(result.cost);
metrics.recordProvider(result.provider);
};

2. ✅ Set Up Alerts

# ✅ Good: Proactive alerting
- alert: HighCosts
expr: rate(ai_cost_total_usd[1h]) * 3600 > 100

3. ✅ Use Histograms for Latency

// ✅ Good: Percentile tracking
const latencyHistogram = new Histogram({
buckets: [0.1, 0.5, 1, 2, 5, 10, 30],
});

4. ✅ Monitor Error Rates

// ✅ Good: Error categorization
aiErrorsTotal.inc({
provider,
error_type: categorizeError(error),
});

5. ✅ Dashboard for Stakeholders

// ✅ Good: Business-friendly dashboard
app.get("/dashboard/summary", (req, res) => {
res.json({
requestsToday: getRequestCount(),
costToday: getTotalCost(),
avgLatency: getAvgLatency(),
errorRate: getErrorRate(),
});
});

Feature Guides:

Enterprise Guides:


Additional Resources


Need Help? Join our GitHub Discussions or open an issue.