TL;DR

Monitor AI systems continuously for performance degradation, data drift, cost spikes, and user satisfaction. Alert on anomalies and have runbooks for common failures.

What to monitor

Performance metrics:

  • Latency (response time)
  • Error rate
  • Throughput (requests/sec)
  • Availability (uptime)

Quality metrics:

  • Accuracy, precision, recall
  • User ratings/feedback
  • Task success rate

Cost metrics:

  • Token usage
  • API spend
  • Per-user cost

Usage patterns:

  • Query volume
  • Popular features
  • Geographic distribution

Data drift

What is it?

  • When real-world data changes from training data
  • Model performance degrades over time

Types:

  • Feature drift: Input distribution changes
  • Concept drift: Relationships change

Detection:

  • Compare input distributions over time
  • Track performance trends
  • Statistical tests (KS test, chi-square)

Response:

Setting up monitoring

Logging:

  • Log all requests and responses
  • Include metadata (user, timestamp, model version)
  • Sample or aggregate for cost

Dashboards:

  • Real-time metrics
  • Historical trends
  • Segment by user, feature, region

Alerts:

  • Error rate spikes
  • Latency increases
  • Cost anomalies
  • Negative feedback surge

Alert thresholds

Set carefully:

  • Too sensitive: Alert fatigue
  • Too loose: Miss real issues

Start with:

  • Error rate > 5% (adjust based on baseline)
  • p95 latency > 3x normal
  • Daily cost > 20% above average

Response runbooks

High error rate:

  1. Check API status
  2. Review recent changes
  3. Rollback if needed
  4. Escalate to on-call

Performance degradation:

  1. Check model drift
  2. Review recent data
  3. Retrain or adjust prompts

Cost spike:

  1. Identify source (user, feature)
  2. Check for abuse
  3. Rate limit if needed

Continuous evaluation

Automated testing:

  • Run eval suite regularly
  • Catch regressions early

Human review:

  • Sample outputs weekly
  • Qualitative assessment
  • Discover edge cases

User feedback:

  • Thumbs up/down
  • Detailed surveys
  • Support ticket analysis

Tools and platforms

APM tools:

  • Datadog, New Relic, Grafana

ML-specific:

Custom:

  • Build on your stack
  • More control, more effort

Best practices

  1. Monitor from day 1
  2. Set up alerts before incidents
  3. Review dashboards weekly
  4. Update thresholds as you learn
  5. Document common issues and fixes

What's next

  • Responsible AI Deployment
  • A/B Testing AI
  • Incident Response for AI