Cost Optimization

Problem

AI API costs can accumulate quickly, especially with:

Large context windows
Frequent API calls
Expensive models (GPT-4, Claude Opus)
Inefficient prompt engineering

Solution

Implement cost optimization strategies:

Use cheaper models when appropriate
Minimize context size
Cache responses
Implement token counting
Use model routing based on complexity

Code

import { NeuroLink } from "@juspay/neurolink";

type CostOptimizer = {
  maxTokens?: number;
  cacheResponses?: boolean;
  useSmartRouting?: boolean;
};

class CostEfficientNeuroLink {
  private neurolink: NeuroLink;
  private cache = new Map<string, any>();
  private tokenCosts = {
    "gpt-4": { input: 0.03, output: 0.06 },
    "gpt-3.5-turbo": { input: 0.0015, output: 0.002 },
    "claude-3-opus": { input: 0.015, output: 0.075 },
    "claude-3-sonnet": { input: 0.003, output: 0.015 },
    "claude-3-haiku": { input: 0.00025, output: 0.00125 },
    "gemini-pro": { input: 0.00025, output: 0.0005 },
  };

  constructor(options: CostOptimizer = {}) {
    this.neurolink = new NeuroLink();
  }

  /**
   * Route to cheaper model for simple queries
   */
  selectModel(
    prompt: string,
    forceModel?: string,
  ): {
    provider: string;
    model: string;
  } {
    if (forceModel) {
      return { provider: "openai", model: forceModel };
    }

    // Simple heuristics for model selection
    const isComplex =
      prompt.length > 500 ||
      prompt.includes("analyze") ||
      prompt.includes("complex") ||
      prompt.includes("reasoning");

    const requiresCreativity =
      prompt.includes("creative") ||
      prompt.includes("story") ||
      prompt.includes("poem");

    if (isComplex && requiresCreativity) {
      return { provider: "openai", model: "gpt-4" };
    }

    if (isComplex) {
      return { provider: "anthropic", model: "claude-3-sonnet-20240229" };
    }

    // Simple queries → cheapest model
    return { provider: "anthropic", model: "claude-3-haiku-20240307" };
  }

  /**
   * Generate cache key from prompt
   */
  private getCacheKey(prompt: string, model: string): string {
    const normalized = prompt.trim().toLowerCase();
    return `${model}:${normalized}`;
  }

  /**
   * Estimate cost for a request
   */
  estimateCost(
    inputTokens: number,
    outputTokens: number,
    model: string,
  ): number {
    const costs = this.tokenCosts[model as keyof typeof this.tokenCosts];
    if (!costs) return 0;

    return (
      (inputTokens / 1000) * costs.input + (outputTokens / 1000) * costs.output
    );
  }

  /**
   * Generate with cost optimization
   */
  async generateCostEffective(
    prompt: string,
    options: {
      useCache?: boolean;
      maxTokens?: number;
      forceModel?: string;
    } = {},
  ) {
    const { provider, model } = this.selectModel(prompt, options.forceModel);
    const cacheKey = this.getCacheKey(prompt, model);

    // Check cache first
    if (options.useCache !== false && this.cache.has(cacheKey)) {
      console.log("💰 Using cached response (cost: $0.00)");
      return this.cache.get(cacheKey);
    }

    // Truncate very long prompts
    const maxPromptLength = 2000;
    const truncatedPrompt =
      prompt.length > maxPromptLength
        ? prompt.slice(0, maxPromptLength) + "..."
        : prompt;

    const result = await this.neurolink.generate({
      input: { text: truncatedPrompt },
      provider,
      model,
      maxTokens: options.maxTokens || 500, // Limit output tokens
    });

    // Estimate and log cost
    const inputTokens = this.estimateTokens(truncatedPrompt);
    const outputTokens = this.estimateTokens(result.content);
    const cost = this.estimateCost(inputTokens, outputTokens, model);

    console.log(`💰 Cost estimate: $${cost.toFixed(4)} (${model})`);

    // Cache the response
    if (options.useCache !== false) {
      this.cache.set(cacheKey, result);
    }

    return result;
  }

  /**
   * Estimate token count (rough approximation)
   */
  private estimateTokens(text: string): number {
    // Rough estimate: ~4 characters per token
    return Math.ceil(text.length / 4);
  }

  /**
   * Batch similar requests to minimize overhead
   */
  async batchGenerate(prompts: string[]) {
    const results = [];
    let totalCost = 0;

    for (const prompt of prompts) {
      const result = await this.generateCostEffective(prompt, {
        useCache: true,
      });
      results.push(result);

      // Track cumulative cost
      const cost = (this.estimateTokens(result.content) * 0.002) / 1000;
      totalCost += cost;
    }

    console.log(`\n💰 Total batch cost: $${totalCost.toFixed(4)}`);
    return results;
  }

  /**
   * Clear cache to free memory
   */
  clearCache() {
    this.cache.clear();
  }

  /**
   * Get cache statistics
   */
  getCacheStats() {
    return {
      entries: this.cache.size,
      estimatedSavings: this.cache.size * 0.01, // Rough estimate
    };
  }
}

// Usage Example
async function main() {
  const optimizer = new CostEfficientNeuroLink();

  // Simple query → uses cheap model (Haiku)
  const simple = await optimizer.generateCostEffective("What is 2+2?", {
    useCache: true,
  });
  console.log("Simple:", simple.content);

  // Complex query → uses better model (Sonnet)
  const complex = await optimizer.generateCostEffective(
    "Analyze the economic implications of quantum computing on financial markets",
    { maxTokens: 300 },
  );
  console.log("Complex:", complex.content);

  // Batch processing with caching
  const prompts = [
    "What is TypeScript?",
    "What is TypeScript?", // Cached!
    "Explain async/await",
  ];
  await optimizer.batchGenerate(prompts);

  // Check savings
  const stats = optimizer.getCacheStats();
  console.log(
    `\n📊 Cache stats: ${stats.entries} entries, ~$${stats.estimatedSavings.toFixed(2)} saved`,
  );
}

main();

Explanation

1. Smart Model Routing

The selectModel() method analyzes the prompt to choose the most cost-effective model:

Simple queries → Claude Haiku ($0.00025/1K input tokens)
Complex queries → Claude Sonnet ($0.003/1K input tokens)
Complex + Creative → GPT-4 ($0.03/1K input tokens)

2. Response Caching

Identical prompts return cached responses at zero cost. Perfect for:

Repeated queries
Development/testing
Common questions in production

3. Token Limiting

Set maxTokens to prevent unexpectedly long (expensive) responses:

Summaries: 200-300 tokens
Explanations: 500-1000 tokens
Creative content: 1000-2000 tokens

4. Cost Tracking

Estimate costs per request to monitor spending:

Input: 250 tokens × $0.003/1K = $0.00075
Output: 500 tokens × $0.015/1K = $0.00750
Total: $0.00825

5. Prompt Truncation

Very long prompts increase costs without adding value. Truncate to essential context.

Variations

Context Window Compression

Compress conversation history to reduce tokens:

function compressContext(messages: Array<{ role: string; content: string }>) {
  // Keep system message and last N messages
  const system = messages.find((m) => m.role === "system");
  const recent = messages.slice(-5); // Last 5 messages

  // Summarize older messages
  const older = messages.slice(1, -5);
  const summary =
    older.length > 0
      ? `[Previous conversation: ${older.length} messages covering ${older.map((m) => m.content.slice(0, 20)).join(", ")}...]`
      : "";

  return [system, { role: "assistant", content: summary }, ...recent].filter(
    Boolean,
  );
}

Model Tier System

Explicitly define cost tiers:

enum ModelTier {
  ULTRA_CHEAP = "claude-3-haiku-20240307", // $0.00025/1K
  CHEAP = "gpt-3.5-turbo", // $0.0015/1K
  BALANCED = "claude-3-sonnet-20240229", // $0.003/1K
  POWERFUL = "gpt-4", // $0.03/1K
}

async function generateWithTier(prompt: string, tier: ModelTier) {
  return neurolink.generate({
    input: { text: prompt },
    model: tier,
  });
}

Budget Enforcement

Set spending limits:

class BudgetEnforcer {
  private spentToday = 0;
  private dailyLimit = 10.0; // $10/day

  async generate(neurolink: NeuroLink, prompt: string) {
    const estimatedCost = 0.01; // Rough estimate

    if (this.spentToday + estimatedCost > this.dailyLimit) {
      throw new Error(
        `Budget exceeded: $${this.spentToday.toFixed(2)}/$${this.dailyLimit}`,
      );
    }

    const result = await neurolink.generate({ input: { text: prompt } });
    this.spentToday += estimatedCost;

    return result;
  }
}

Cost Comparison

Task Type	Best Model	Cost (per 1K tokens)	Use Case
Simple Q&A	Claude Haiku	$0.00025	FAQs, basic queries
Data extraction	GPT-3.5 Turbo	$0.0015	JSON parsing, classification
Analysis	Claude Sonnet	$0.003	Summaries, explanations
Deep reasoning	GPT-4	$0.03	Complex problem-solving
Creative writing	GPT-4	$0.03	Stories, marketing copy

Problem​

Solution​

Code​

Explanation​

1. Smart Model Routing​

2. Response Caching​

3. Token Limiting​

4. Cost Tracking​

5. Prompt Truncation​

Variations​

Context Window Compression​

Model Tier System​

Budget Enforcement​

Cost Comparison​

See Also​