AI in Production

Taking AI from Prototype to Production

Building an AI demo is easy. Deploying a reliable, cost-effective AI application that serves thousands of users is hard. This lesson covers the engineering practices that separate production AI systems from prototypes.

⚙️ Production Challenges:

Latency, cost, reliability, hallucinations, context limits, rate limits, security, evaluation, monitoring — all need engineering solutions.

Latency Optimization

🚀 Streaming Responses

Don't wait for the full response — stream tokens as they're generated. Reduces perceived latency from seconds to milliseconds for first token.

// Stream AI responses in a Next.js API route
import { OpenAI } from 'openai';

const openai = new OpenAI();

export async function POST(req) {
  const { prompt } = await req.json();

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  });

  // Convert OpenAI stream to ReadableStream
  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const text = chunk.choices[0]?.delta?.content || '';
        controller.enqueue(encoder.encode(text));
      }
      controller.close();
    },
  });

  return new Response(readable, {
    headers: { 'Content-Type': 'text/plain; charset=utf-8' },
  });
}

💾 Caching

Cache LLM responses for identical or similar queries. Semantic caching goes beyond exact match — cache responses for semantically similar questions.

// Simple LLM response cache
const cache = new Map();

async function cachedLLMCall(prompt, options = {}) {
  const cacheKey = JSON.stringify({ prompt, model: options.model });

  if (cache.has(cacheKey)) {
    console.log('Cache hit!');
    return cache.get(cacheKey);
  }

  const response = await openai.chat.completions.create({
    model: options.model || 'gpt-4o-mini',
    messages: [{ role: 'user', content: prompt }],
  });

  const result = response.choices[0].message.content;
  cache.set(cacheKey, result);

  // TTL: expire after 1 hour
  setTimeout(() => cache.delete(cacheKey), 3600000);

  return result;
}

⚡ Model Selection

Use the smallest model that meets quality requirements. Route simple queries to cheaper/faster models, complex ones to capable models.

// Smart model routing
function selectModel(query) {
  const complexity = estimateComplexity(query);

  if (complexity === 'simple') return 'gpt-4o-mini';    // Fast, cheap
  if (complexity === 'medium') return 'gpt-4o';          // Balanced
  return 'gpt-4';                                         // Most capable
}

function estimateComplexity(query) {
  const wordCount = query.split(' ').length;
  const hasCode = /```|function |const |class /.test(query);
  const needsReasoning = /why|how|explain|compare|analyze/.test(query.toLowerCase());

  if (wordCount < 20 && !hasCode && !needsReasoning) return 'simple';
  if (hasCode || needsReasoning) return 'complex';
  return 'medium';
}

Cost Management

📊 Token Budgeting

Set max_tokens limits. Track usage per user/feature. Set spending alerts. GPT-4 is 10-30x more expensive than GPT-4o-mini.

🔄 Prompt Compression

Summarize long context before injecting into prompts. Use system prompts efficiently — they're sent with every request.

📉 Batch Processing

For non-real-time tasks, use batch APIs (50% cheaper with OpenAI). Process overnight for reports, data enrichment.

🏠 Self-hosting

For high volume, run open models (Llama, Mistral) on your infrastructure. Higher upfront cost but zero per-token cost.

Monitoring & Observability

// Production AI monitoring
class AIMonitor {
  constructor() {
    this.metrics = [];
  }

  async trackCall(name, fn) {
    const start = Date.now();
    let success = true;
    let tokensUsed = 0;
    let error = null;

    try {
      const result = await fn();
      tokensUsed = result.usage?.total_tokens || 0;
      return result;
    } catch (err) {
      success = false;
      error = err.message;
      throw err;
    } finally {
      const metric = {
        name,
        duration: Date.now() - start,
        success,
        tokensUsed,
        error,
        timestamp: new Date().toISOString(),
      };

      this.metrics.push(metric);
      await this.sendToAnalytics(metric);

      // Alert on anomalies
      if (metric.duration > 30000) {
        console.warn(`⚠️ Slow AI call: ${name} took ${metric.duration}ms`);
      }
      if (!success) {
        console.error(`❌ AI call failed: ${name} - ${error}`);
      }
    }
  }

  getAverageLatency(name) {
    const relevant = this.metrics.filter(m => m.name === name && m.success);
    return relevant.reduce((sum, m) => sum + m.duration, 0) / relevant.length;
  }

  getSuccessRate(name) {
    const relevant = this.metrics.filter(m => m.name === name);
    const successful = relevant.filter(m => m.success).length;
    return (successful / relevant.length * 100).toFixed(1) + '%';
  }
}

Handling Failures Gracefully

// Retry with exponential backoff + fallback
async function resilientAICall(prompt, options = {}) {
  const models = ['gpt-4o', 'gpt-4o-mini']; // Primary + fallback
  const maxRetries = 3;

  for (const model of models) {
    for (let attempt = 1; attempt <= maxRetries; attempt++) {
      try {
        return await openai.chat.completions.create({
          model,
          messages: [{ role: 'user', content: prompt }],
          timeout: 30000, // 30s timeout
          ...options,
        });
      } catch (err) {
        const isRateLimit = err.status === 429;
        const isServerError = err.status >= 500;

        if (isRateLimit || isServerError) {
          const delay = Math.pow(2, attempt) * 1000; // 2s, 4s, 8s
          console.warn(`Retry ${attempt}/${maxRetries} for ${model} in ${delay}ms`);
          await new Promise(r => setTimeout(r, delay));
          continue;
        }
        throw err; // Non-retryable error
      }
    }
    console.warn(`All retries failed for ${model}, trying fallback...`);
  }

  throw new Error('All AI models and retries exhausted');
}

Security Considerations

Prompt Injection

Users may try to override system prompts. Sanitize user input. Use separate system and user message roles. Never include secrets in prompts.

Data Leakage

Don't send PII or sensitive data to external APIs. Redact before sending. Use data processing agreements with AI providers.

Output Validation

Validate AI outputs before acting on them. Parse structured outputs with try/catch. Never execute AI-generated code without sandboxing.

🔑 Key Takeaways

• Stream responses to reduce perceived latency to milliseconds
• Route simple queries to cheaper models; reserve expensive models for complex tasks
• Cache responses, compress prompts, and batch non-urgent requests to control costs
• Implement retry logic with fallback models for reliability
• Monitor latency, success rate, token usage, and costs per feature
• Guard against prompt injection and never send PII to external APIs

📚 Continue Learning

Apply your AI knowledge with these hands-on tutorials: