Chapter 11: Model Monitoring in Productionπ
"A deployed model is not the end β it's the beginning. Models degrade silently without monitoring."
11.1 Why Models Degradeπ
ML models in production face a fundamental challenge: the world changes, but the model doesn't (unless retrained).
MODEL AT TRAINING TIME: MODEL 6 MONTHS LATER:
User avg age: 28 User avg age: 35 (shifted!)
Avg income: $45,000 Avg income: $62,000 (shifted!)
Accuracy: 0.91 Accuracy: 0.71 β Silent degradation!
This is called data drift or concept drift.
11.2 Types of Driftπ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TYPES OF DRIFT β
β β
β DATA DRIFT (Covariate Shift): β
β Input feature distribution changes β
β P(X) changes, P(Y|X) stays same β
β Example: New user demographics β
β β
β CONCEPT DRIFT: β
β Relationship between features and label changes β
β P(Y|X) changes β
β Example: Fraud patterns evolve β
β β
β LABEL DRIFT: β
β Target variable distribution changes β
β P(Y) changes β
β Example: More fraud cases overall β
β β
β PREDICTION DRIFT: β
β Model output distribution shifts β
β Example: Model starts predicting one class too much β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
11.3 What to Monitorπ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MONITORING DIMENSIONS β
β β
β 1οΈβ£ MODEL PERFORMANCE METRICS β
β βββ Accuracy / F1 / AUC (if labels available) β
β βββ Prediction confidence scores β
β βββ Prediction latency β
β β
β 2οΈβ£ DATA QUALITY β
β βββ Missing values in input β
β βββ Feature value ranges (out-of-range inputs?) β
β βββ Feature distribution vs training baseline β
β β
β 3οΈβ£ INFRASTRUCTURE β
β βββ CPU / Memory usage β
β βββ Requests per second β
β βββ Error rates (HTTP 4xx, 5xx) β
β βββ Container health β
β β
β 4οΈβ£ BUSINESS METRICS β
β βββ Click-through rate (for recommendation models) β
β βββ Revenue per prediction β
β βββ Model-specific KPIs β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
11.4 Monitoring Architectureπ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MONITORING ARCHITECTURE β
β β
β ML Model API β
β (FastAPI/Flask) β
β β β
β β expose metrics on /metrics β
β βΌ β
β ββββββββββββββββ β
β β Prometheus β β scrapes metrics every 15 seconds β
β β (time-seriesβ β
β β database) β β
β ββββββββ¬ββββββββ β
β β β
β βΌ β
β ββββββββββββββββ ββββββββββββββββββββββββββββ β
β β Grafana βββββ β Alert Manager β β
β β (dashboards)β β (sends to Slack/Email) β β
β ββββββββββββββββ ββββββββββββββββββββββββββββ β
β β
β (For data drift:) β
β ββββββββββββββββββββββββββββββββ β
β β Evidently AI / Alibi-Detect β β statistical drift detection β
β ββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
11.5 Instrumenting Your Model APIπ
# src/serve.py β with Prometheus metrics
from fastapi import FastAPI, Request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from prometheus_client import CONTENT_TYPE_LATEST
import time
import pickle
import numpy as np
from starlette.responses import Response
app = FastAPI()
# Define metrics
PREDICTION_COUNT = Counter(
'ml_predictions_total',
'Total number of predictions',
['model_version', 'result_class']
)
PREDICTION_LATENCY = Histogram(
'ml_prediction_latency_seconds',
'Prediction request latency',
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)
PREDICTION_CONFIDENCE = Histogram(
'ml_prediction_confidence',
'Confidence score of predictions',
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)
FEATURE_VALUES = Histogram(
'ml_input_feature_age',
'Distribution of age feature in requests',
buckets=[18, 25, 35, 45, 55, 65, 75]
)
with open("models/model.pkl", "rb") as f:
model = pickle.load(f)
@app.get("/metrics")
def metrics():
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
@app.post("/predict")
def predict(data: dict):
start_time = time.time()
features = np.array(data["features"]).reshape(1, -1)
# Track input feature distribution
FEATURE_VALUES.observe(data["features"][0]) # track 'age' feature
# Make prediction
prediction = model.predict(features)[0]
confidence = model.predict_proba(features).max()
# Record metrics
latency = time.time() - start_time
PREDICTION_LATENCY.observe(latency)
PREDICTION_COUNT.labels(model_version="v2", result_class=str(prediction)).inc()
PREDICTION_CONFIDENCE.observe(confidence)
return {"prediction": int(prediction), "confidence": float(confidence)}
11.6 Prometheus Configurationπ
# monitoring/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
scrape_configs:
- job_name: "ml-model"
static_configs:
- targets: ["ml-api:8000"]
metrics_path: /metrics
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
# monitoring/alert_rules.yml
groups:
- name: ml-model-alerts
rules:
- alert: HighPredictionLatency
expr: histogram_quantile(0.95, ml_prediction_latency_seconds) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "ML model p95 latency > 500ms"
- alert: LowPredictionConfidence
expr: histogram_quantile(0.5, ml_prediction_confidence) < 0.6
for: 15m
labels:
severity: critical
annotations:
summary: "Model confidence dropping β possible drift!"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate above 5%"
11.7 Data Drift Detection with Evidentlyπ
# drift_check.py β run periodically in CI/CT
import pandas as pd
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
# Load reference (training) data and current production data
reference_data = pd.read_csv("data/train.csv")
current_data = pd.read_csv("data/production_logs_last_7days.csv")
# Define columns
column_mapping = ColumnMapping(
target="label",
prediction="predicted_label",
numerical_features=["age", "income", "score"],
categorical_features=["category", "region"],
)
# Create and run drift report
report = Report(metrics=[
DataDriftPreset(),
DataQualityPreset(),
])
report.run(
reference_data=reference_data,
current_data=current_data,
column_mapping=column_mapping,
)
# Save HTML report
report.save_html("reports/drift_report.html")
# Check if drift detected
results = report.as_dict()
drift_score = results["metrics"][0]["result"]["dataset_drift"]
if drift_score:
print("β οΈ DATA DRIFT DETECTED β triggering retraining!")
# Trigger Jenkins or GitHub Actions retraining job
import subprocess
subprocess.run(["curl", "-X", "POST", "http://jenkins:8080/job/retrain/build"])
else:
print("β
No significant drift detected.")
11.8 Grafana Dashboard Setupπ
Key panels to include in your ML monitoring dashboard:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GRAFANA ML DASHBOARD β
β β
β ββββββββββββββββββββββ ββββββββββββββββββββββ β
β β Predictions/sec β β p95 Latency (ms) β β
β β βββββ β β βββββββββββββ β β
β β 123 req/s β β 42ms β β
β ββββββββββββββββββββββ ββββββββββββββββββββββ β
β ββββββββββββββββββββββ ββββββββββββββββββββββ β
β β Confidence Scores β β Feature Drift β β
β β Histogram β β Score β β
β β ββββ
βββ β β age: 0.12 β
β β
β β Most > 0.8 β
β β income: 0.43 β οΈ β β
β ββββββββββββββββββββββ ββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Prediction Class Distribution Over Time β β
β β Class 0: ββββββββββββββββ 72% β β
β β Class 1: βββββ 28% β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
11.9 Retraining Triggersπ
TRIGGER TYPES:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β π
SCHEDULE β Retrain every Monday at 2am β
β β
β π PERFORMANCE β Retrain if accuracy < 0.80 β
β β
β π DRIFT DETECTED β Retrain if drift score > 0.3 β
β β
β π¦ NEW DATA β Retrain when 10k new rows land β
β β
β π¨ INCIDENT β Manual emergency retrain β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Next Chapter β 12: End-to-End Project