- Home
- /Guides
- /build-deploy
- /Deployment Patterns: Serverless, Edge, and Containers
Deployment Patterns: Serverless, Edge, and Containers
How to deploy AI systems in production. Compare serverless, edge, container, and self-hosted options.
TL;DR
Deploying AI is different from traditional apps. Models are large, inference can be slow, and costs scale unpredictably. Your deployment choice affects latency, cost, and operational overhead. Cloud APIs (OpenAI, Anthropic) are fastest to ship but costly at scale. Serverless (Lambda, Vercel) works for bursty traffic but has cold starts. Containers (ECS, Kubernetes) give you control and better economics at volume. Edge deployment reduces latency for global users. Self-hosting maximizes cost efficiency but requires ML ops expertise. Most production systems use a hybrid approach: cloud APIs for complex tasks, self-hosted models for high-volume simple ones.
Why Deployment Matters for AI
Shipping AI to production isn't like deploying a CRUD app. Your model might be gigabytes in size. Inference can take seconds, not milliseconds. Costs can balloon from dollars to thousands overnight if traffic spikes.
Your deployment pattern determines:
- Latency: Can users wait 2 seconds for a response, or do you need sub-200ms?
- Cost: Are you paying $0.01 per request or $0.0001?
- Reliability: What happens when your provider has an outage?
- Control: Can you optimize the model, change providers, or audit what's happening?
There's no single best answer. A chatbot handling 100 requests/day has different needs than a translation service processing millions. Let's break down your options.
Deployment Pattern 1: Cloud APIs
What it is: Call OpenAI, Anthropic, Google, or other hosted AI APIs directly. No infrastructure to manage.
import anthropic
client = anthropic.Anthropic(api_key="your-key")
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "Summarize this article"}]
)
Pros:
- Zero infrastructure work. Ship in minutes.
- Always get the latest models without redeployment.
- Handles scale automatically (up to rate limits).
- Top-tier model quality (GPT-4, Claude, Gemini).
Cons:
- Expensive at scale ($0.003-$0.06 per 1K tokens adds up fast).
- Vendor lock-in and rate limits.
- No control over model, data handling, or latency.
- Dependent on third-party uptime.
When to use:
- Prototyping or validating product-market fit.
- Low to medium volume (<10M tokens/month).
- Tasks requiring frontier model capabilities.
- When you need to ship fast without ML expertise.
Reality check: Most startups begin here. Smart ones plan their exit strategy. Track your token usage from day one. When you hit $5K/month in API costs, start exploring alternatives.
Deployment Pattern 2: Serverless Functions
What it is: Deploy your AI logic to AWS Lambda, Google Cloud Functions, Vercel, or Cloudflare Workers. Infrastructure scales automatically, you pay per request.
// Vercel Edge Function
export async function POST(request) {
const { prompt } = await request.json();
const response = await fetch('https://api.anthropic.com/v1/messages', {
method: 'POST',
headers: { 'x-api-key': process.env.ANTHROPIC_KEY },
body: JSON.stringify({ model: 'claude-3-haiku-20240307', messages: [{ role: 'user', content: prompt }] })
});
return response.json();
}
Pros:
- Zero server management. Auto-scales from 0 to 1000s of requests.
- Pay only for execution time (no idle costs).
- Perfect for bursty or unpredictable traffic.
- Fast deployment and iteration.
Cons:
- Cold starts (1-5 second delays when scaling from zero).
- Execution time limits (Lambda: 15 min max, Vercel: 60s on hobby tier).
- Memory constraints make running large models impractical.
- Still calling external APIs (same cost issues).
When to use:
- Orchestrating calls to cloud AI APIs.
- Lightweight inference with small models (<500MB).
- Applications with sporadic traffic patterns.
- Rapid prototyping and iteration.
Reality check: Serverless is great for the API layer but terrible for running large models yourself. Use it to route requests, handle authentication, and cache results. Run the actual AI elsewhere.
Deployment Pattern 3: Containers
What it is: Package your model and dependencies in Docker containers. Deploy to AWS ECS, Google Cloud Run, or Kubernetes. You control the infrastructure.
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Pros:
- Full control over model, environment, and dependencies.
- Can run models locally (cut API costs by 10-100x).
- Predictable scaling with auto-scaling groups.
- Easy local development and testing.
- Works with any model (open-source or fine-tuned).
Cons:
- You manage infrastructure (scaling, health checks, deployments).
- Baseline costs even with zero traffic (containers must stay warm).
- Requires DevOps knowledge (Docker, orchestration, monitoring).
- GPU instances are expensive ($1-10/hour per instance).
When to use:
- Medium to high volume (>10M requests/month).
- Self-hosting open-source models (Llama, Mistral, Stable Diffusion).
- When you need consistent low latency (no cold starts).
- Applications requiring custom model configurations.
Example architecture:
Load Balancer → ECS Cluster (3-10 containers)
├─ Container 1: FastAPI + Llama 3.1 8B
├─ Container 2: FastAPI + Llama 3.1 8B
└─ Container 3: FastAPI + Llama 3.1 8B
Reality check: Containers hit the sweet spot for production AI. You get control without managing bare metal. Start with Cloud Run or ECS Fargate (managed containers) before diving into Kubernetes complexity.
Deployment Pattern 4: Edge Deployment
What it is: Run AI models at edge locations close to users. Cloudflare Workers AI, Vercel Edge, or self-managed edge nodes.
// Cloudflare Workers AI
export default {
async fetch(request, env) {
const response = await env.AI.run('@cf/meta/llama-2-7b-chat-int8', {
prompt: "What is the capital of France?"
});
return new Response(JSON.stringify(response));
}
};
Pros:
- Ultra-low latency (models run near users, not in one region).
- Reduced egress costs (data doesn't travel far).
- Better user experience for global audiences.
- Built-in CDN and DDoS protection.
Cons:
- Limited model size (must fit in edge environment, typically <1GB).
- Fewer model options (mostly quantized or distilled models).
- Higher per-request costs than centralized hosting.
- Debugging distributed systems is harder.
When to use:
- Global user base requiring consistent low latency.
- Simple inference tasks (classification, small completions).
- Real-time applications (chat, autocomplete, moderation).
- When you can use smaller, optimized models.
Reality check: Edge is amazing for user-facing features where 100ms matters. But don't over-engineer. If 90% of users are in one region, a well-placed container cluster beats edge complexity.
Deployment Pattern 5: Self-Hosted on Dedicated Hardware
What it is: Run models on your own servers or cloud instances with GPUs. Full control, maximum cost efficiency at scale.
Pros:
- Lowest cost per request at high volume (amortize hardware over millions of requests).
- Complete control over data, model, and infrastructure.
- No vendor lock-in or rate limits.
- Can optimize everything (model quantization, batching, caching).
Cons:
- High upfront investment (GPUs, servers, or reserved instances).
- Requires ML ops expertise (model optimization, monitoring, scaling).
- You're responsible for uptime, security, and compliance.
- Slower iteration vs. managed services.
When to use:
- Very high volume (>100M requests/month).
- Strict data privacy or compliance requirements.
- Cost optimization is critical (you've done the math).
- You have ML/DevOps team capacity.
Economics example:
- Cloud API: $0.01 per request × 10M requests = $100K/month
- Self-hosted: 4× A100 GPUs ($10K/month) + engineering = $15K/month
- Break-even: ~1-2M requests/month
Reality check: Self-hosting makes sense when you've validated product-market fit and have predictable volume. Don't self-host to save $500/month—the engineering time costs more.
Decision Framework: Choosing Your Deployment Pattern
Ask yourself these questions:
1. What's your traffic volume?
- <100K requests/month → Cloud APIs
- 100K-10M → Serverless + Cloud APIs
- 10M-100M → Containers
100M → Self-hosted or hybrid
2. What's your latency requirement?
- <100ms → Edge deployment
- <500ms → Containers in user regions
- <2s → Serverless or cloud APIs acceptable
3. What's your budget?
- Prototype/MVP → Cloud APIs (ship fast, optimize later)
- Cost-sensitive → Containers with open-source models
- Volume play → Self-hosted at scale
4. What's your team's expertise?
- No ML/DevOps → Cloud APIs
- DevOps team → Containers
- ML ops team → Self-hosted
5. What model do you need?
- Frontier models (GPT-4, Claude Opus) → Cloud APIs (no choice)
- Open-source works → Containers or self-hosted
- Simple tasks → Edge with small models
Hybrid Approach: The Real-World Pattern
Most production systems use multiple deployment patterns:
User Request
├─ Simple/high-volume (80% traffic) → Self-hosted Llama 3.1 on containers
├─ Complex reasoning (15% traffic) → Claude API
├─ Real-time autocomplete (5% traffic) → Edge-deployed small model
└─ Fallback → Cloud API if self-hosted is down
Why hybrid works:
- Cloud APIs for complex tasks that need frontier models.
- Self-hosted for high-volume, cost-sensitive workflows.
- Edge for latency-critical user interactions.
- Automatic cost optimization (route based on request complexity).
Implementation tip: Build abstraction layers. Don't hard-code provider calls everywhere. Create a service that routes requests based on complexity, cost, and availability.
Scaling Strategies Across Patterns
Auto-scaling:
- Serverless: Automatic, configure concurrency limits to control costs.
- Containers: Set CPU/memory thresholds, scale horizontally (more containers).
- Self-hosted: Pre-warm instances during known traffic spikes, use predictive scaling.
Caching:
- Cache responses for identical prompts (save 30-60% of API costs).
- Semantic caching: Store embeddings, return similar cached responses.
- TTL based on content freshness requirements.
Load balancing:
- Round-robin for stateless inference.
- Least-connections for long-running requests.
- Sticky sessions if maintaining conversation state.
Batching:
- Self-hosted: Batch multiple requests for GPU efficiency (5-10x throughput).
- Trade-off: Slight latency increase for massive cost reduction.
Monitoring and Observability
Track these metrics regardless of deployment pattern:
Performance:
- Latency (p50, p95, p99)
- Throughput (requests/second)
- Error rates and types
Cost:
- Cost per request
- Token usage (for LLMs)
- Infrastructure costs vs. API costs
Quality:
- Response quality scores
- User feedback/ratings
- Model drift detection
Tools: Use Langfuse, Langsmith, or Helicone for AI-specific observability. They track prompts, completions, costs, and latency in one dashboard.
Start Simple, Optimize Later
The best deployment pattern is the one that gets you to production fastest. Start with cloud APIs. Measure everything. When costs or latency hurt, optimize strategically.
Don't prematurely optimize. A startup burning $2K/month on Claude API but finding product-market fit is winning. A startup spending 3 months self-hosting to save $500/month is losing.
Deployment evolution path:
- Week 1: Cloud APIs (ship the MVP)
- Month 3: Add caching and prompt optimization (cut costs 40%)
- Month 6: Hybrid approach (self-host simple tasks, APIs for complex)
- Year 1: Full self-hosted if volume justifies it
Deploy smart. Measure ruthlessly. Optimize when it matters.
Was this guide helpful?
Your feedback helps us improve our guides
Key Terms Used in This Guide
Related Guides
Fine-Tuning vs RAG: Which Should You Use?
IntermediateCompare fine-tuning and RAG to customize AI. Learn when each approach works best, how they differ, and how to combine them.
Context Management: Handling Long Conversations and Documents
IntermediateMaster context window management for AI. Learn strategies for long conversations, document processing, memory systems, and context optimization.
Orchestration Options: LangChain, LlamaIndex, and Beyond
IntermediateFrameworks for building AI workflows. Compare LangChain, LlamaIndex, Haystack, and custom solutions.