TL;DR

MLOps for LLMs includes: prompt versioning, evaluation in CI/CD, production monitoring, incident response, and managing the full model lifecycle from development to retirement.

Versioning

What to version:

  • System prompts
  • Model versions
  • Retrieval configurations
  • Evaluation datasets

Tools: Git for prompts, model registries for models

CI/CD pipeline

  1. Commit prompt change
  2. Run automated evaluations
  3. Compare to baseline
  4. If passing, deploy to staging
  5. Canary deployment to production
  6. Monitor and rollback if needed

Monitoring

System metrics: Latency, error rate, throughput
Quality metrics: LLM-as-judge scores, user feedback
Cost metrics: Token usage, API spend
Business metrics: User engagement, task completion

Incident response

  • Automated alerts on degradation
  • Runbooks for common issues
  • Rollback procedures
  • Post-incident reviews

Evaluation

  • Regression test suite
  • A/B testing framework
  • Shadow deployments
  • Human evaluation sampling

Tools

  • LangSmith: Tracing, evaluation, monitoring
  • Weights & Biases: Experiment tracking
  • MLflow: Model registry, deployment
  • Datadog/Grafana: Monitoring

Best practices

  • Automate everything
  • Monitor proactively
  • Document thoroughly
  • Practice rollbacks
  • Blameless postmortems