28 Model Monitoring

Chapter 11: Model Monitoring in ProductionπŸ”—

"A deployed model is not the end β€” it's the beginning. Models degrade silently without monitoring."


11.1 Why Models DegradeπŸ”—

ML models in production face a fundamental challenge: the world changes, but the model doesn't (unless retrained).

MODEL AT TRAINING TIME:              MODEL 6 MONTHS LATER:
  User avg age: 28                     User avg age: 35 (shifted!)
  Avg income: $45,000                  Avg income: $62,000 (shifted!)
  Accuracy: 0.91                       Accuracy: 0.71 ← Silent degradation!

This is called data drift or concept drift.


11.2 Types of DriftπŸ”—

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    TYPES OF DRIFT                            β”‚
β”‚                                                              β”‚
β”‚  DATA DRIFT (Covariate Shift):                               β”‚
β”‚  Input feature distribution changes                          β”‚
β”‚  P(X) changes, P(Y|X) stays same                             β”‚
β”‚  Example: New user demographics                              β”‚
β”‚                                                              β”‚
β”‚  CONCEPT DRIFT:                                              β”‚
β”‚  Relationship between features and label changes             β”‚
β”‚  P(Y|X) changes                                              β”‚
β”‚  Example: Fraud patterns evolve                              β”‚
β”‚                                                              β”‚
β”‚  LABEL DRIFT:                                                β”‚
β”‚  Target variable distribution changes                        β”‚
β”‚  P(Y) changes                                                β”‚
β”‚  Example: More fraud cases overall                           β”‚
β”‚                                                              β”‚
β”‚  PREDICTION DRIFT:                                           β”‚
β”‚  Model output distribution shifts                            β”‚
β”‚  Example: Model starts predicting one class too much         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

11.3 What to MonitorπŸ”—

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              MONITORING DIMENSIONS                       β”‚
β”‚                                                          β”‚
β”‚  1️⃣  MODEL PERFORMANCE METRICS                           β”‚
β”‚      β”œβ”€β”€ Accuracy / F1 / AUC (if labels available)       β”‚
β”‚      β”œβ”€β”€ Prediction confidence scores                    β”‚
β”‚      └── Prediction latency                              β”‚
β”‚                                                          β”‚
β”‚  2️⃣  DATA QUALITY                                         β”‚
β”‚      β”œβ”€β”€ Missing values in input                         β”‚
β”‚      β”œβ”€β”€ Feature value ranges (out-of-range inputs?)     β”‚
β”‚      └── Feature distribution vs training baseline       β”‚
β”‚                                                          β”‚
β”‚  3️⃣  INFRASTRUCTURE                                       β”‚
β”‚      β”œβ”€β”€ CPU / Memory usage                              β”‚
β”‚      β”œβ”€β”€ Requests per second                             β”‚
β”‚      β”œβ”€β”€ Error rates (HTTP 4xx, 5xx)                     β”‚
β”‚      └── Container health                                β”‚
β”‚                                                          β”‚
β”‚  4️⃣  BUSINESS METRICS                                     β”‚
β”‚      β”œβ”€β”€ Click-through rate (for recommendation models)  β”‚
β”‚      β”œβ”€β”€ Revenue per prediction                          β”‚
β”‚      └── Model-specific KPIs                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

11.4 Monitoring ArchitectureπŸ”—

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   MONITORING ARCHITECTURE                            β”‚
β”‚                                                                      β”‚
β”‚  ML Model API                                                        β”‚
β”‚  (FastAPI/Flask)                                                     β”‚
β”‚       β”‚                                                              β”‚
β”‚       β”‚ expose metrics on /metrics                                   β”‚
β”‚       β–Ό                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                    β”‚
β”‚  β”‚  Prometheus  β”‚ ← scrapes metrics every 15 seconds                 β”‚
β”‚  β”‚  (time-seriesβ”‚                                                    β”‚
β”‚  β”‚   database)  β”‚                                                    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                                                    β”‚
β”‚         β”‚                                                            β”‚
β”‚         β–Ό                                                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚  β”‚   Grafana    │──── β”‚    Alert Manager         β”‚                  β”‚
β”‚  β”‚  (dashboards)β”‚     β”‚  (sends to Slack/Email)  β”‚                  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚                                                                      β”‚
β”‚  (For data drift:)                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                    β”‚
β”‚  β”‚  Evidently AI / Alibi-Detect β”‚ ← statistical drift detection      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

11.5 Instrumenting Your Model APIπŸ”—

# src/serve.py β€” with Prometheus metrics
from fastapi import FastAPI, Request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from prometheus_client import CONTENT_TYPE_LATEST
import time
import pickle
import numpy as np
from starlette.responses import Response

app = FastAPI()

# Define metrics
PREDICTION_COUNT = Counter(
    'ml_predictions_total',
    'Total number of predictions',
    ['model_version', 'result_class']
)

PREDICTION_LATENCY = Histogram(
    'ml_prediction_latency_seconds',
    'Prediction request latency',
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)

PREDICTION_CONFIDENCE = Histogram(
    'ml_prediction_confidence',
    'Confidence score of predictions',
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)

FEATURE_VALUES = Histogram(
    'ml_input_feature_age',
    'Distribution of age feature in requests',
    buckets=[18, 25, 35, 45, 55, 65, 75]
)

with open("models/model.pkl", "rb") as f:
    model = pickle.load(f)

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

@app.post("/predict")
def predict(data: dict):
    start_time = time.time()

    features = np.array(data["features"]).reshape(1, -1)

    # Track input feature distribution
    FEATURE_VALUES.observe(data["features"][0])  # track 'age' feature

    # Make prediction
    prediction = model.predict(features)[0]
    confidence = model.predict_proba(features).max()

    # Record metrics
    latency = time.time() - start_time
    PREDICTION_LATENCY.observe(latency)
    PREDICTION_COUNT.labels(model_version="v2", result_class=str(prediction)).inc()
    PREDICTION_CONFIDENCE.observe(confidence)

    return {"prediction": int(prediction), "confidence": float(confidence)}

11.6 Prometheus ConfigurationπŸ”—

# monitoring/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

scrape_configs:
  - job_name: "ml-model"
    static_configs:
      - targets: ["ml-api:8000"]
    metrics_path: /metrics

  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
# monitoring/alert_rules.yml
groups:
  - name: ml-model-alerts
    rules:

      - alert: HighPredictionLatency
        expr: histogram_quantile(0.95, ml_prediction_latency_seconds) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ML model p95 latency > 500ms"

      - alert: LowPredictionConfidence
        expr: histogram_quantile(0.5, ml_prediction_confidence) < 0.6
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Model confidence dropping β€” possible drift!"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5%"

11.7 Data Drift Detection with EvidentlyπŸ”—

# drift_check.py β€” run periodically in CI/CT
import pandas as pd
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset

# Load reference (training) data and current production data
reference_data = pd.read_csv("data/train.csv")
current_data = pd.read_csv("data/production_logs_last_7days.csv")

# Define columns
column_mapping = ColumnMapping(
    target="label",
    prediction="predicted_label",
    numerical_features=["age", "income", "score"],
    categorical_features=["category", "region"],
)

# Create and run drift report
report = Report(metrics=[
    DataDriftPreset(),
    DataQualityPreset(),
])

report.run(
    reference_data=reference_data,
    current_data=current_data,
    column_mapping=column_mapping,
)

# Save HTML report
report.save_html("reports/drift_report.html")

# Check if drift detected
results = report.as_dict()
drift_score = results["metrics"][0]["result"]["dataset_drift"]

if drift_score:
    print("⚠️  DATA DRIFT DETECTED β€” triggering retraining!")
    # Trigger Jenkins or GitHub Actions retraining job
    import subprocess
    subprocess.run(["curl", "-X", "POST", "http://jenkins:8080/job/retrain/build"])
else:
    print("βœ… No significant drift detected.")

11.8 Grafana Dashboard SetupπŸ”—

Key panels to include in your ML monitoring dashboard:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    GRAFANA ML DASHBOARD                          β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚  β”‚  Predictions/sec   β”‚  β”‚  p95 Latency (ms)  β”‚                 β”‚
β”‚  β”‚       β–‚β–„β–†β–„β–‚        β”‚  β”‚    ─────────────   β”‚                 β”‚
β”‚  β”‚       123 req/s    β”‚  β”‚      42ms          β”‚                 β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚  β”‚ Confidence Scores  β”‚  β”‚  Feature Drift     β”‚                 β”‚
β”‚  β”‚ Histogram          β”‚  β”‚  Score             β”‚                 β”‚
β”‚  β”‚ ▇▇▇▅▃▁▁            β”‚  β”‚  age: 0.12 βœ…      β”‚                 β”‚
β”‚  β”‚ Most > 0.8 βœ…       β”‚  β”‚  income: 0.43 ⚠️  β”‚                 β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”‚
β”‚  β”‚  Prediction Class Distribution Over Time   β”‚                β”‚
β”‚  β”‚  Class 0: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 72%             β”‚                β”‚
β”‚  β”‚  Class 1: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 28%                        β”‚                β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

11.9 Retraining TriggersπŸ”—

TRIGGER TYPES:
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                                                     β”‚
  β”‚  πŸ“… SCHEDULE       β†’ Retrain every Monday at 2am    β”‚
  β”‚                                                     β”‚
  β”‚  πŸ“Š PERFORMANCE    β†’ Retrain if accuracy < 0.80     β”‚
  β”‚                                                     β”‚
  β”‚  πŸ“‰ DRIFT DETECTED β†’ Retrain if drift score > 0.3   β”‚
  β”‚                                                     β”‚
  β”‚  πŸ“¦ NEW DATA       β†’ Retrain when 10k new rows land β”‚
  β”‚                                                     β”‚
  β”‚  🚨 INCIDENT       β†’ Manual emergency retrain       β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Next Chapter β†’ 12: End-to-End Project