Chapter 11: Model Monitoring in Production🔗

"A deployed model is not the end — it's the beginning. Models degrade silently without monitoring."

11.1 Why Models Degrade🔗

ML models in production face a fundamental challenge: the world changes, but the model doesn't (unless retrained).

MODEL AT TRAINING TIME:              MODEL 6 MONTHS LATER:
  User avg age: 28                     User avg age: 35 (shifted!)
  Avg income: $45,000                  Avg income: $62,000 (shifted!)
  Accuracy: 0.91                       Accuracy: 0.71 ← Silent degradation!

This is called data drift or concept drift.

11.2 Types of Drift🔗

┌──────────────────────────────────────────────────────────────┐
│                    TYPES OF DRIFT                            │
│                                                              │
│  DATA DRIFT (Covariate Shift):                               │
│  Input feature distribution changes                          │
│  P(X) changes, P(Y|X) stays same                             │
│  Example: New user demographics                              │
│                                                              │
│  CONCEPT DRIFT:                                              │
│  Relationship between features and label changes             │
│  P(Y|X) changes                                              │
│  Example: Fraud patterns evolve                              │
│                                                              │
│  LABEL DRIFT:                                                │
│  Target variable distribution changes                        │
│  P(Y) changes                                                │
│  Example: More fraud cases overall                           │
│                                                              │
│  PREDICTION DRIFT:                                           │
│  Model output distribution shifts                            │
│  Example: Model starts predicting one class too much         │
└──────────────────────────────────────────────────────────────┘

11.3 What to Monitor🔗

┌──────────────────────────────────────────────────────────┐
│              MONITORING DIMENSIONS                       │
│                                                          │
│  1️⃣  MODEL PERFORMANCE METRICS                           │
│      ├── Accuracy / F1 / AUC (if labels available)       │
│      ├── Prediction confidence scores                    │
│      └── Prediction latency                              │
│                                                          │
│  2️⃣  DATA QUALITY                                         │
│      ├── Missing values in input                         │
│      ├── Feature value ranges (out-of-range inputs?)     │
│      └── Feature distribution vs training baseline       │
│                                                          │
│  3️⃣  INFRASTRUCTURE                                       │
│      ├── CPU / Memory usage                              │
│      ├── Requests per second                             │
│      ├── Error rates (HTTP 4xx, 5xx)                     │
│      └── Container health                                │
│                                                          │
│  4️⃣  BUSINESS METRICS                                     │
│      ├── Click-through rate (for recommendation models)  │
│      ├── Revenue per prediction                          │
│      └── Model-specific KPIs                             │
└──────────────────────────────────────────────────────────┘

11.4 Monitoring Architecture🔗

┌──────────────────────────────────────────────────────────────────────┐
│                   MONITORING ARCHITECTURE                            │
│                                                                      │
│  ML Model API                                                        │
│  (FastAPI/Flask)                                                     │
│       │                                                              │
│       │ expose metrics on /metrics                                   │
│       ▼                                                              │
│  ┌──────────────┐                                                    │
│  │  Prometheus  │ ← scrapes metrics every 15 seconds                 │
│  │  (time-series│                                                    │
│  │   database)  │                                                    │
│  └──────┬───────┘                                                    │
│         │                                                            │
│         ▼                                                            │
│  ┌──────────────┐     ┌──────────────────────────┐                  │
│  │   Grafana    │──── │    Alert Manager         │                  │
│  │  (dashboards)│     │  (sends to Slack/Email)  │                  │
│  └──────────────┘     └──────────────────────────┘                  │
│                                                                      │
│  (For data drift:)                                                   │
│  ┌──────────────────────────────┐                                    │
│  │  Evidently AI / Alibi-Detect │ ← statistical drift detection      │
│  └──────────────────────────────┘                                    │
└──────────────────────────────────────────────────────────────────────┘

11.5 Instrumenting Your Model API🔗

# src/serve.py — with Prometheus metrics
from fastapi import FastAPI, Request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from prometheus_client import CONTENT_TYPE_LATEST
import time
import pickle
import numpy as np
from starlette.responses import Response

app = FastAPI()

# Define metrics
PREDICTION_COUNT = Counter(
    'ml_predictions_total',
    'Total number of predictions',
    ['model_version', 'result_class']
)

PREDICTION_LATENCY = Histogram(
    'ml_prediction_latency_seconds',
    'Prediction request latency',
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)

PREDICTION_CONFIDENCE = Histogram(
    'ml_prediction_confidence',
    'Confidence score of predictions',
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)

FEATURE_VALUES = Histogram(
    'ml_input_feature_age',
    'Distribution of age feature in requests',
    buckets=[18, 25, 35, 45, 55, 65, 75]
)

with open("models/model.pkl", "rb") as f:
    model = pickle.load(f)

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

@app.post("/predict")
def predict(data: dict):
    start_time = time.time()

    features = np.array(data["features"]).reshape(1, -1)

    # Track input feature distribution
    FEATURE_VALUES.observe(data["features"][0])  # track 'age' feature

    # Make prediction
    prediction = model.predict(features)[0]
    confidence = model.predict_proba(features).max()

    # Record metrics
    latency = time.time() - start_time
    PREDICTION_LATENCY.observe(latency)
    PREDICTION_COUNT.labels(model_version="v2", result_class=str(prediction)).inc()
    PREDICTION_CONFIDENCE.observe(confidence)

    return {"prediction": int(prediction), "confidence": float(confidence)}

11.6 Prometheus Configuration🔗

# monitoring/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

scrape_configs:
  - job_name: "ml-model"
    static_configs:
      - targets: ["ml-api:8000"]
    metrics_path: /metrics

  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod

# monitoring/alert_rules.yml
groups:
  - name: ml-model-alerts
    rules:

      - alert: HighPredictionLatency
        expr: histogram_quantile(0.95, ml_prediction_latency_seconds) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ML model p95 latency > 500ms"

      - alert: LowPredictionConfidence
        expr: histogram_quantile(0.5, ml_prediction_confidence) < 0.6
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Model confidence dropping — possible drift!"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5%"

11.7 Data Drift Detection with Evidently🔗

# drift_check.py — run periodically in CI/CT
import pandas as pd
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset

# Load reference (training) data and current production data
reference_data = pd.read_csv("data/train.csv")
current_data = pd.read_csv("data/production_logs_last_7days.csv")

# Define columns
column_mapping = ColumnMapping(
    target="label",
    prediction="predicted_label",
    numerical_features=["age", "income", "score"],
    categorical_features=["category", "region"],
)

# Create and run drift report
report = Report(metrics=[
    DataDriftPreset(),
    DataQualityPreset(),
])

report.run(
    reference_data=reference_data,
    current_data=current_data,
    column_mapping=column_mapping,
)

# Save HTML report
report.save_html("reports/drift_report.html")

# Check if drift detected
results = report.as_dict()
drift_score = results["metrics"][0]["result"]["dataset_drift"]

if drift_score:
    print("⚠️  DATA DRIFT DETECTED — triggering retraining!")
    # Trigger Jenkins or GitHub Actions retraining job
    import subprocess
    subprocess.run(["curl", "-X", "POST", "http://jenkins:8080/job/retrain/build"])
else:
    print("✅ No significant drift detected.")

11.8 Grafana Dashboard Setup🔗

Key panels to include in your ML monitoring dashboard:

┌──────────────────────────────────────────────────────────────────┐
│                    GRAFANA ML DASHBOARD                          │
│                                                                  │
│  ┌────────────────────┐  ┌────────────────────┐                 │
│  │  Predictions/sec   │  │  p95 Latency (ms)  │                 │
│  │       ▂▄▆▄▂        │  │    ─────────────   │                 │
│  │       123 req/s    │  │      42ms          │                 │
│  └────────────────────┘  └────────────────────┘                 │
│  ┌────────────────────┐  ┌────────────────────┐                 │
│  │ Confidence Scores  │  │  Feature Drift     │                 │
│  │ Histogram          │  │  Score             │                 │
│  │ ▇▇▇▅▃▁▁            │  │  age: 0.12 ✅      │                 │
│  │ Most > 0.8 ✅       │  │  income: 0.43 ⚠️  │                 │
│  └────────────────────┘  └────────────────────┘                 │
│  ┌─────────────────────────────────────────────┐                │
│  │  Prediction Class Distribution Over Time   │                │
│  │  Class 0: ████████████████ 72%             │                │
│  │  Class 1: █████ 28%                        │                │
│  └─────────────────────────────────────────────┘                │
└──────────────────────────────────────────────────────────────────┘

11.9 Retraining Triggers🔗

TRIGGER TYPES:
  ┌─────────────────────────────────────────────────────┐
  │                                                     │
  │  📅 SCHEDULE       → Retrain every Monday at 2am    │
  │                                                     │
  │  📊 PERFORMANCE    → Retrain if accuracy < 0.80     │
  │                                                     │
  │  📉 DRIFT DETECTED → Retrain if drift score > 0.3   │
  │                                                     │
  │  📦 NEW DATA       → Retrain when 10k new rows land │
  │                                                     │
  │  🚨 INCIDENT       → Manual emergency retrain       │
  └─────────────────────────────────────────────────────┘

Next Chapter → 12: End-to-End Project