Monitoring AI Systems in Production
By Marcin Piekarski builtweb.com.au · Last Updated: 11 February 2026
TL;DR: Production AI requires continuous monitoring. Track performance, detect drift, alert on failures, and maintain quality over time.
TL;DR
AI systems in production need continuous monitoring because their performance degrades over time. Track response quality, latency, error rates, cost, and data drift. Set up alerts for anomalies, build dashboards for visibility, and create runbooks so your team knows exactly what to do when something goes wrong. Monitoring is not optional — it is the difference between a reliable product and one that silently breaks.
Why it matters
Traditional software either works or it does not. A function that returns the correct result today will return the correct result tomorrow. AI is different. A model that performs brilliantly at launch can gradually degrade as the real world changes around it. Customer language shifts, new topics emerge, seasonal patterns change, and the model — frozen in time from its training data — falls behind.
Without monitoring, you will not know this is happening until users complain, revenue drops, or something goes publicly wrong. By then, the damage is done. Companies that deploy AI without robust monitoring are essentially flying blind, hoping that the model keeps working but having no way to verify it.
The good news is that effective AI monitoring builds on traditional software monitoring practices. If you already track uptime, error rates, and response times, you are halfway there. The AI-specific additions — quality metrics, drift detection, and evaluation pipelines — are what make the difference.
What to monitor: the four pillars
Effective AI monitoring covers four categories. Missing any one of them leaves a blind spot.
Performance metrics tell you whether the system is technically healthy:
- Latency (response time): How long does each request take? Track p50, p95, and p99 percentiles, not just averages. An average of 500ms might hide the fact that 5% of users are waiting 5 seconds.
- Error rate: What percentage of requests fail? Even a 1% error rate means thousands of users per day are having a bad experience at scale.
- Throughput: How many requests are you handling per second? Is it trending up (growth) or down (potential issue)?
- Availability: What is your uptime? Can your system handle traffic spikes without degrading?
Quality metrics tell you whether the AI is doing its job well:
- Accuracy, precision, and recall for classification tasks. Are the model's predictions actually correct?
- User ratings and feedback: Thumbs up/down, star ratings, or explicit feedback. This is the most direct signal of whether users find the AI helpful.
- Task success rate: For goal-oriented AI (like a customer support bot), what percentage of interactions successfully resolve the user's issue?
- Hallucination rate: For generative AI, how often does the model produce factually incorrect or fabricated information?
Cost metrics tell you whether the AI is financially sustainable:
- Token usage per request and in aggregate. Are costs trending up?
- API spend broken down by model, feature, and user segment.
- Cost per interaction: How much does each user interaction cost? This is critical for pricing and profitability decisions.
Usage patterns tell you how the AI is being used:
- Query volume over time. Traffic patterns reveal peak hours, seasonal trends, and growth trajectories.
- Feature popularity: Which AI features are users engaging with most?
- Geographic distribution: Are you seeing traffic from unexpected regions?
- User segments: Do different user groups have different quality experiences?
Data drift: the silent model killer
Data drift is the most insidious problem in production AI. It happens when the real-world data the model encounters gradually diverges from the data it was trained on.
Feature drift occurs when the distribution of inputs changes. A model trained on English-language customer support queries might start receiving more queries in Spanish. A product recommendation model trained on pre-pandemic shopping data might struggle with post-pandemic patterns. The model is not broken — the world has changed.
Concept drift is even harder to detect. The relationships between inputs and outputs change. What used to be a positive customer review ("sick product!") means something different now. Economic conditions shift, and the risk factors for loan default change. The model's learned relationships become stale.
How to detect drift:
- Compare the statistical distributions of your inputs over time. If today's input distribution looks significantly different from last month's, you may have feature drift.
- Track performance metrics over time. A gradual decline in accuracy is a classic drift signal.
- Use statistical tests (Kolmogorov-Smirnov test, chi-square test) to formally measure distribution changes.
- Monitor the model's confidence scores. A sudden increase in low-confidence predictions often indicates the model is seeing unfamiliar inputs.
How to respond to drift:
- Retrain the model on recent data.
- Update prompts if you are using a language model with prompt-based instructions.
- Expand your training data to cover the new patterns.
- In extreme cases, deploy a completely new model architecture better suited to the changed environment.
Setting up your monitoring stack
Logging is the foundation. Log every request and response, including the full prompt (or a hash of it for privacy), the model's output, latency, token counts, error codes, and user metadata. Store logs in a searchable system (like Elasticsearch or a cloud logging service) so you can investigate issues quickly.
For cost and privacy reasons, you may not want to log every single request in full. Common strategies include sampling (log 10% of requests in full, metadata for all), or aggregating (store summaries every minute rather than individual requests).
Dashboards turn raw logs into actionable information. Build dashboards that show:
- Real-time request volume and error rates
- Latency percentile trends over the past 24 hours, 7 days, and 30 days
- Daily cost broken down by feature and model
- Quality metric trends (user ratings, task success rates)
- Drift indicators (input distribution changes)
The best dashboards tell you at a glance whether everything is healthy or whether something needs attention. Avoid the temptation to add every possible metric. Focus on the 5-10 numbers that matter most for your specific application.
Alerts notify you when something goes wrong before users notice. Good alerts are specific, actionable, and not too noisy.
Configuring alerts that work
Alert fatigue is real. If your team receives 50 alerts per day, they will start ignoring all of them, including the critical ones. Setting thresholds carefully is an art:
Start conservatively. Begin with wide thresholds and tighten them as you learn what is normal for your system. Some recommended starting points:
- Error rate above 5% (adjust based on your baseline)
- p95 latency more than 3x the normal value
- Daily cost more than 20% above the trailing average
- User satisfaction score dropping more than 10% week over week
Use multiple severity levels. Not every alert needs to wake someone up at 3am. Use tiers: informational (log it), warning (investigate during business hours), critical (page the on-call engineer immediately).
Include context in alerts. "Error rate is 8%" is less helpful than "Error rate is 8% (normally 2%), started 15 minutes ago, concentrated in the summarisation feature, possibly related to the deployment at 14:30."
Building response runbooks
When an alert fires, your team should not have to figure out what to do from scratch. Runbooks document the steps for common scenarios:
High error rate:
- Check the AI provider's status page for outages.
- Review the most recent deployment for changes.
- Examine error logs for patterns (is it one model, one feature, or one user segment?).
- If caused by a deployment, roll back.
- If caused by a provider outage, activate fallback (cached responses, alternative provider, graceful degradation).
Performance degradation:
- Check for data drift by comparing recent input distributions to baseline.
- Run your evaluation suite on a sample of recent requests.
- If drift is detected, prioritise a prompt update or model retrain.
- If no drift, investigate infrastructure issues (memory, CPU, network).
Cost spike:
- Identify the source (which feature, model, or user segment?).
- Check for abuse patterns (one user making 100,000 requests).
- Implement rate limiting if needed.
- Review whether the cost spike correlates with a feature launch or traffic event.
Continuous evaluation
Do not wait for problems to become visible. Run automated evaluations continuously:
Automated testing runs a predefined set of test cases against your production model on a schedule (daily or weekly). If the model's scores on these test cases drop below a threshold, you get an early warning before users are affected.
Human review complements automated testing. Sample 50-100 outputs per week and have a human evaluate them for quality, accuracy, and safety. Automated metrics cannot catch everything — a response might be technically correct but unhelpfully worded, or accurate but insensitive.
User feedback loops close the circle. Make it easy for users to flag bad responses (a simple thumbs down button works). Aggregate this feedback and review it weekly. Patterns in negative feedback reveal systematic issues that individual metrics might miss.
Tools and platforms
General APM (Application Performance Monitoring): Datadog, New Relic, and Grafana are excellent for tracking latency, error rates, and throughput. They integrate easily with most tech stacks and provide dashboarding and alerting out of the box.
AI-specific platforms: Weights & Biases, Arize AI, and Langfuse are purpose-built for AI monitoring. They offer features like drift detection, prompt tracking, evaluation pipelines, and AI-specific dashboards that general APM tools lack.
LLM observability: Tools like LangSmith (from LangChain) and Helicone provide detailed logging and analysis specifically for language model applications, tracking prompts, completions, latency, and cost at the request level.
Custom solutions give you maximum control but require more engineering effort. Building on top of your existing stack (Prometheus + Grafana, ELK stack, or cloud-native tools) works well if you have the engineering bandwidth.
Common mistakes
Not monitoring until something breaks. Set up monitoring before your first production user. Retrofitting monitoring after an incident is stressful, error-prone, and always happens at the worst time.
Monitoring only technical metrics and ignoring quality. Your system can have 99.9% uptime and sub-second latency while giving terrible answers. Quality metrics are just as important as performance metrics.
Setting too many alerts. Alert fatigue is a real and dangerous problem. Start with a small number of high-signal alerts and expand gradually. Every alert should be actionable.
Not tracking costs in real time. AI costs can spike suddenly due to bugs, abuse, or traffic surges. By the time you see the monthly bill, the damage is done. Set up real-time cost tracking and daily alerts.
Treating monitoring as a one-time setup. Your monitoring needs will evolve as your product changes. Review and update your dashboards, alerts, and runbooks quarterly.
What's next?
Build on your monitoring knowledge with these related guides:
- AI Incident Response for handling production issues effectively
- AI Evaluation Metrics for choosing the right quality metrics
- AI Cost Management for keeping your AI spend under control
- Responsible AI Deployment for the full picture of production AI best practices
Frequently Asked Questions
How quickly should I set up monitoring for a new AI feature?
Before it goes to production. At minimum, have logging, basic dashboards (latency, error rate, cost), and critical alerts in place before your first real user. You can refine thresholds and add quality metrics in the first few weeks, but flying blind from day one is a recipe for surprises.
How often should I review my AI monitoring dashboards?
Check your real-time dashboard daily during the first few weeks after launch, then weekly once things stabilise. Do a deep review monthly, looking at long-term trends in quality, cost, and usage patterns. Review and update alert thresholds quarterly as your system evolves and you learn what is normal.
What is the minimum monitoring setup for a small AI application?
At minimum, log all requests and responses (or a representative sample), track error rates and latency, set up a cost alert for daily spend, and implement a way for users to flag bad responses (even a simple thumbs down button). This takes a few hours to set up and covers the most critical blind spots. You can expand from there as your application grows.
How do I know if my AI model needs retraining versus just a prompt update?
Check what has changed. If the model is getting new types of inputs it was not designed for, a prompt update that provides better instructions may be enough. If the model's accuracy has degraded gradually across all input types, that usually indicates data drift and calls for retraining. If the model is making consistent errors on a specific topic, adding examples or instructions to the prompt is usually the fastest fix.
Was this guide helpful?
Your feedback helps us improve our guides
About the Authors
Marcin Piekarski· Frontend Lead & AI Educator
Marcin is a Frontend Lead with 20+ years in tech. Currently building headless ecommerce at Harvey Norman (Next.js, Node.js, GraphQL). He created Field Guide to AI to help others understand AI tools practically—without the jargon.
Credentials & Experience:
- 20+ years web development experience
- Frontend Lead at Harvey Norman (10 years)
- Worked with: Gumtree, CommBank, Woolworths, Optus, M&C Saatchi
- Runs AI workshops for teams
- Founder of builtweb.com.au
- Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
- Specializes in React ecosystem: React, Next.js, Node.js
Areas of Expertise:
Prism AI· AI Research & Writing Assistant
Prism AI is the AI ghostwriter behind Field Guide to AI—a collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.
Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication.
Key Terms Used in This Guide
Related Guides
AI Deployment Lifecycle: From Development to Production
IntermediateLearn the stages of deploying AI systems safely. From staging to production—practical guidance for each phase of the AI deployment lifecycle.
11 min readAI Incident Response: Handling AI System Failures
IntermediateLearn to respond effectively when AI systems fail. From detection to resolution—practical procedures for managing AI incidents and minimizing harm.
10 min readAI Cost Management: Controlling AI Spending
IntermediateLearn to manage and optimize AI costs. From usage tracking to cost optimization strategies—practical guidance for keeping AI spending under control.
10 min read