AI in Production
Deploy AI applications at scale — latency optimization, cost management, monitoring, evaluation, and reliability patterns
Taking AI from Prototype to Production
Building an AI demo is easy. Deploying a reliable, cost-effective AI application that serves thousands of users is hard. This lesson covers the engineering practices that separate production AI systems from prototypes.
⚙️ Production Challenges:
Latency, cost, reliability, hallucinations, context limits, rate limits, security, evaluation, monitoring — all need engineering solutions.
Latency Optimization
🚀 Streaming Responses
Don't wait for the full response — stream tokens as they're generated. Reduces perceived latency from seconds to milliseconds for first token.
// Stream AI responses in a Next.js API route
import { OpenAI } from 'openai';
const openai = new OpenAI();
export async function POST(req) {
const { prompt } = await req.json();
const stream = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: prompt }],
stream: true,
});
// Convert OpenAI stream to ReadableStream
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
const text = chunk.choices[0]?.delta?.content || '';
controller.enqueue(encoder.encode(text));
}
controller.close();
},
});
return new Response(readable, {
headers: { 'Content-Type': 'text/plain; charset=utf-8' },
});
}
💾 Caching
Cache LLM responses for identical or similar queries. Semantic caching goes beyond exact match — cache responses for semantically similar questions.
// Simple LLM response cache
const cache = new Map();
async function cachedLLMCall(prompt, options = {}) {
const cacheKey = JSON.stringify({ prompt, model: options.model });
if (cache.has(cacheKey)) {
console.log('Cache hit!');
return cache.get(cacheKey);
}
const response = await openai.chat.completions.create({
model: options.model || 'gpt-4o-mini',
messages: [{ role: 'user', content: prompt }],
});
const result = response.choices[0].message.content;
cache.set(cacheKey, result);
// TTL: expire after 1 hour
setTimeout(() => cache.delete(cacheKey), 3600000);
return result;
}
⚡ Model Selection
Use the smallest model that meets quality requirements. Route simple queries to cheaper/faster models, complex ones to capable models.
// Smart model routing
function selectModel(query) {
const complexity = estimateComplexity(query);
if (complexity === 'simple') return 'gpt-4o-mini'; // Fast, cheap
if (complexity === 'medium') return 'gpt-4o'; // Balanced
return 'gpt-4'; // Most capable
}
function estimateComplexity(query) {
const wordCount = query.split(' ').length;
const hasCode = /```|function |const |class /.test(query);
const needsReasoning = /why|how|explain|compare|analyze/.test(query.toLowerCase());
if (wordCount < 20 && !hasCode && !needsReasoning) return 'simple';
if (hasCode || needsReasoning) return 'complex';
return 'medium';
}
Cost Management
📊 Token Budgeting
Set max_tokens limits. Track usage per user/feature. Set spending alerts. GPT-4 is 10-30x more expensive than GPT-4o-mini.
🔄 Prompt Compression
Summarize long context before injecting into prompts. Use system prompts efficiently — they're sent with every request.
📉 Batch Processing
For non-real-time tasks, use batch APIs (50% cheaper with OpenAI). Process overnight for reports, data enrichment.
🏠 Self-hosting
For high volume, run open models (Llama, Mistral) on your infrastructure. Higher upfront cost but zero per-token cost.
Monitoring & Observability
// Production AI monitoring
class AIMonitor {
constructor() {
this.metrics = [];
}
async trackCall(name, fn) {
const start = Date.now();
let success = true;
let tokensUsed = 0;
let error = null;
try {
const result = await fn();
tokensUsed = result.usage?.total_tokens || 0;
return result;
} catch (err) {
success = false;
error = err.message;
throw err;
} finally {
const metric = {
name,
duration: Date.now() - start,
success,
tokensUsed,
error,
timestamp: new Date().toISOString(),
};
this.metrics.push(metric);
await this.sendToAnalytics(metric);
// Alert on anomalies
if (metric.duration > 30000) {
console.warn(`⚠️ Slow AI call: ${name} took ${metric.duration}ms`);
}
if (!success) {
console.error(`❌ AI call failed: ${name} - ${error}`);
}
}
}
getAverageLatency(name) {
const relevant = this.metrics.filter(m => m.name === name && m.success);
return relevant.reduce((sum, m) => sum + m.duration, 0) / relevant.length;
}
getSuccessRate(name) {
const relevant = this.metrics.filter(m => m.name === name);
const successful = relevant.filter(m => m.success).length;
return (successful / relevant.length * 100).toFixed(1) + '%';
}
}
Handling Failures Gracefully
// Retry with exponential backoff + fallback
async function resilientAICall(prompt, options = {}) {
const models = ['gpt-4o', 'gpt-4o-mini']; // Primary + fallback
const maxRetries = 3;
for (const model of models) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await openai.chat.completions.create({
model,
messages: [{ role: 'user', content: prompt }],
timeout: 30000, // 30s timeout
...options,
});
} catch (err) {
const isRateLimit = err.status === 429;
const isServerError = err.status >= 500;
if (isRateLimit || isServerError) {
const delay = Math.pow(2, attempt) * 1000; // 2s, 4s, 8s
console.warn(`Retry ${attempt}/${maxRetries} for ${model} in ${delay}ms`);
await new Promise(r => setTimeout(r, delay));
continue;
}
throw err; // Non-retryable error
}
}
console.warn(`All retries failed for ${model}, trying fallback...`);
}
throw new Error('All AI models and retries exhausted');
}
Security Considerations
Prompt Injection
Users may try to override system prompts. Sanitize user input. Use separate system and user message roles. Never include secrets in prompts.
Data Leakage
Don't send PII or sensitive data to external APIs. Redact before sending. Use data processing agreements with AI providers.
Output Validation
Validate AI outputs before acting on them. Parse structured outputs with try/catch. Never execute AI-generated code without sandboxing.
🔑 Key Takeaways
- • Stream responses to reduce perceived latency to milliseconds
- • Route simple queries to cheaper models; reserve expensive models for complex tasks
- • Cache responses, compress prompts, and batch non-urgent requests to control costs
- • Implement retry logic with fallback models for reliability
- • Monitor latency, success rate, token usage, and costs per feature
- • Guard against prompt injection and never send PII to external APIs
📚 Continue Learning
Apply your AI knowledge with these hands-on tutorials:
LangChain Tutorial
Build AI apps with chains, agents, RAG, and tool calling using LangChain.js.
Vercel AI SDK
Build streaming AI chat applications with React and Next.js.
Prompt Engineering
Master the art of writing effective prompts for LLMs.
Docker & DevOps
Containerize and deploy your AI applications to production.