Chapter 11: CI/CD for Machine Learning🔗
"CI/CD is the heartbeat of MLOps — automating what would otherwise be manual, error-prone, and slow."
11.1 CI/CD in ML Context🔗
TRADITIONAL CI/CD: ML CI/CD:
Code → Test → Build Code + Data + Model → Test + Validate + Train
→ Deploy → Evaluate → Package → Deploy → Monitor
The ML CI/CD pipeline has more stages and more failure modes than standard software CI/CD.
11.2 The Complete ML CI/CD Pipeline🔗
┌─────────────────────────────────────────────────────────────────────────┐
│ ML CI/CD PIPELINE │
│ │
│ TRIGGER: git push / new data / schedule / drift alert │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ CONTINUOUS INTEGRATION (CI) │ │
│ │ ① Lint code (flake8, black) ② Unit tests (pytest) │ │
│ │ ③ Data schema validation ④ Feature validation │ │
│ │ ⑤ Security scan (Snyk/Trivy) │ │
│ └──────────────────────────────────────┬──────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ CONTINUOUS TRAINING (CT) │ │
│ │ ⑥ Pull data (dvc pull) ⑦ Train model │ │
│ │ ⑧ Log to MLflow/W&B ⑨ Evaluate metrics │ │
│ │ ⑩ Quality gate (accuracy ≥ threshold?) │ │
│ │ ⑪ Fairness check ⑫ Register model │ │
│ └──────────────────────────────────────┬──────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ CONTINUOUS DELIVERY (CD) │ │
│ │ ⑬ Build Docker image ⑭ Scan image (Trivy) │ │
│ │ ⑮ Push to registry (GAR) ⑯ Deploy to staging (K8s) │ │
│ │ ⑰ Integration tests ⑱ Load test │ │
│ │ ⑲ Manual approval gate │ │
│ │ ⑳ Deploy to production (canary → full) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ CONTINUOUS MONITORING (CM) │
│ ㉑ Prometheus metrics → Grafana → Alerts → Retrain trigger │
└─────────────────────────────────────────────────────────────────────────┘
11.3 Testing Strategy for ML (The Testing Pyramid)🔗
▲ Rare, expensive
/|\
/ | \
/ | \
/ E2E \
/ Tests \ ← Full pipeline end-to-end
/───────────\
/ Integration \
/ Tests \ ← API responses, model loading
/─────────────────\
/ Model Tests \
/ (accuracy, bias, \ ← Model quality gates
/ fairness, latency) \
/──────────────────────── \
/ Unit Tests \
/ (features, preprocessing, \ ← Fast, numerous
/ utilities, transforms) \
▼──────────────────────────────── ▼
Base (lots of tests)
Unit Tests🔗
# tests/unit/test_preprocess.py
import pytest
import pandas as pd
import numpy as np
from src.preprocess import encode_plan, handle_missing, create_features
def test_encode_plan_known_values():
plans = pd.Series(["basic", "standard", "premium"])
encoded = encode_plan(plans)
assert list(encoded) == [0, 1, 2]
def test_encode_plan_unknown_raises():
with pytest.raises(ValueError, match="Unknown plan"):
encode_plan(pd.Series(["enterprise"]))
def test_handle_missing_fills_median():
df = pd.DataFrame({"age": [25, np.nan, 35], "income": [50000, 60000, np.nan]})
result = handle_missing(df)
assert result.isnull().sum().sum() == 0
def test_create_features_output_shape():
df = pd.DataFrame({"age": [25], "income": [50000], "tenure": [12]})
features = create_features(df)
assert "income_per_year_tenure" in features.columns
assert features.shape[0] == 1
Model Quality Tests🔗
# tests/model/test_model_quality.py
import pytest
import pickle
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score
@pytest.fixture
def model():
with open("models/model.pkl", "rb") as f:
return pickle.load(f)
@pytest.fixture
def test_data():
df = pd.read_csv("data/test.csv")
X = df.drop("churned", axis=1)
y = df["churned"]
return X, y
def test_model_accuracy_above_threshold(model, test_data):
X, y = test_data
acc = accuracy_score(y, model.predict(X))
assert acc >= 0.85, f"Accuracy {acc:.3f} below threshold 0.85"
def test_model_f1_above_threshold(model, test_data):
X, y = test_data
f1 = f1_score(y, model.predict(X))
assert f1 >= 0.80, f"F1 {f1:.3f} below threshold 0.80"
def test_model_no_all_one_predictions(model, test_data):
X, _ = test_data
preds = model.predict(X)
assert preds.mean() < 0.95, "Model predicts churn for >95% — possible issue"
assert preds.mean() > 0.05, "Model predicts churn for <5% — possible issue"
def test_model_inference_speed(model, test_data):
import time
X, _ = test_data
sample = X.iloc[:100]
start = time.time()
model.predict(sample)
elapsed_ms = (time.time() - start) * 1000
assert elapsed_ms < 500, f"100-sample inference took {elapsed_ms:.0f}ms (>500ms threshold)"
Integration Tests🔗
# tests/integration/test_api.py
import requests
import pytest
BASE_URL = "http://localhost:8000"
def test_health_endpoint():
r = requests.get(f"{BASE_URL}/health")
assert r.status_code == 200
assert r.json()["status"] == "healthy"
def test_prediction_endpoint_valid_input():
payload = {
"customer_id": "C001",
"age": 35,
"income": 65000,
"tenure_months": 12,
"monthly_charges": 75.5,
"plan": "standard"
}
r = requests.post(f"{BASE_URL}/predict", json=payload)
assert r.status_code == 200
body = r.json()
assert "prediction" in body
assert "confidence" in body
assert body["confidence"] >= 0.0
assert body["confidence"] <= 1.0
assert body["prediction"] in [0, 1]
def test_prediction_endpoint_invalid_age():
payload = {"customer_id": "C001", "age": -5, ...}
r = requests.post(f"{BASE_URL}/predict", json=payload)
assert r.status_code == 422 # validation error
11.4 Quality Gates🔗
Quality gates are checks in the pipeline that stop deployment if criteria aren't met.
# src/quality_gate.py
import json
import sys
def check_quality_gate(metrics_path: str, thresholds: dict) -> bool:
with open(metrics_path) as f:
metrics = json.load(f)
failed = []
for metric, threshold in thresholds.items():
value = metrics.get(metric, 0)
if value < threshold:
failed.append(f"{metric}: {value:.4f} < {threshold}")
if failed:
print("❌ Quality Gate FAILED:")
for f in failed:
print(f" - {f}")
return False
print("✅ Quality Gate PASSED")
return True
if __name__ == "__main__":
passed = check_quality_gate(
metrics_path="metrics/results.json",
thresholds={
"accuracy": 0.85,
"f1_score": 0.80,
"fairness_gap": -0.10, # negative = we check it's NOT below
}
)
sys.exit(0 if passed else 1)
11.5 GitHub Actions — Complete ML Workflow🔗
# .github/workflows/ml-pipeline.yml
name: ML CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
schedule:
- cron: "0 2 * * 1" # Weekly retraining Monday 2am
env:
GCP_PROJECT: ${{ secrets.GCP_PROJECT }}
GAR_REPO: us-central1-docker.pkg.dev/${{ secrets.GCP_PROJECT }}/ml-repo
jobs:
ci:
name: CI — Lint, Test, Validate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with: {python-version: "3.10"}
- run: pip install -r requirements.txt
- run: flake8 src/ --max-line-length=100
- run: black --check src/
- run: pytest tests/unit/ -v --tb=short
train-and-evaluate:
name: CT — Train, Evaluate, Gate
needs: ci
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with: {python-version: "3.10"}
- run: pip install -r requirements.txt
- name: Authenticate GCP
uses: google-github-actions/auth@v2
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
- run: dvc pull
- run: python src/validate_data.py
- run: python src/train.py
- run: python src/quality_gate.py
- name: Upload metrics
uses: actions/upload-artifact@v4
with:
name: model-metrics
path: metrics/
build-and-push:
name: CD — Build & Push Docker
needs: train-and-evaluate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/auth@v2
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
- run: gcloud auth configure-docker us-central1-docker.pkg.dev
- run: |
docker build -t $GAR_REPO/churn-model:${{ github.sha }} .
docker push $GAR_REPO/churn-model:${{ github.sha }}
deploy:
name: CD — Deploy to Production
needs: build-and-push
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production # requires manual approval in GitHub
steps:
- uses: google-github-actions/get-gke-credentials@v2
with:
cluster_name: ml-cluster
location: us-central1
- run: |
kubectl set image deployment/churn-model \
churn-model=$GAR_REPO/churn-model:${{ github.sha }} \
-n production
kubectl rollout status deployment/churn-model -n production
Next → Chapter 12: Jenkins