Chapter 11: CI/CD for Machine Learning🔗

"CI/CD is the heartbeat of MLOps — automating what would otherwise be manual, error-prone, and slow."

11.1 CI/CD in ML Context🔗

TRADITIONAL CI/CD:          ML CI/CD:
  Code → Test → Build         Code + Data + Model → Test + Validate + Train
  → Deploy                    → Evaluate → Package → Deploy → Monitor

The ML CI/CD pipeline has more stages and more failure modes than standard software CI/CD.

11.2 The Complete ML CI/CD Pipeline🔗

┌─────────────────────────────────────────────────────────────────────────┐
│                       ML CI/CD PIPELINE                                 │
│                                                                         │
│  TRIGGER: git push / new data / schedule / drift alert                  │
│       │                                                                 │
│       ▼                                                                 │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  CONTINUOUS INTEGRATION (CI)                                    │   │
│  │  ① Lint code (flake8, black)    ② Unit tests (pytest)           │   │
│  │  ③ Data schema validation       ④ Feature validation            │   │
│  │  ⑤ Security scan (Snyk/Trivy)                                   │   │
│  └──────────────────────────────────────┬──────────────────────────┘   │
│                                         ▼                               │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  CONTINUOUS TRAINING (CT)                                       │   │
│  │  ⑥ Pull data (dvc pull)         ⑦ Train model                  │   │
│  │  ⑧ Log to MLflow/W&B            ⑨ Evaluate metrics              │   │
│  │  ⑩ Quality gate (accuracy ≥ threshold?)                         │   │
│  │  ⑪ Fairness check               ⑫ Register model               │   │
│  └──────────────────────────────────────┬──────────────────────────┘   │
│                                         ▼                               │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  CONTINUOUS DELIVERY (CD)                                       │   │
│  │  ⑬ Build Docker image           ⑭ Scan image (Trivy)            │   │
│  │  ⑮ Push to registry (GAR)       ⑯ Deploy to staging (K8s)      │   │
│  │  ⑰ Integration tests            ⑱ Load test                    │   │
│  │  ⑲ Manual approval gate                                         │   │
│  │  ⑳ Deploy to production (canary → full)                         │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│  CONTINUOUS MONITORING (CM)                                             │
│  ㉑ Prometheus metrics → Grafana → Alerts → Retrain trigger             │
└─────────────────────────────────────────────────────────────────────────┘

11.3 Testing Strategy for ML (The Testing Pyramid)🔗

                    ▲ Rare, expensive
                   /|\ 
                  / | \
                 /  |  \
                / E2E   \
               / Tests   \           ← Full pipeline end-to-end
              /───────────\
             /  Integration \
            /   Tests        \       ← API responses, model loading
           /─────────────────\
          /    Model Tests    \
         /  (accuracy, bias,   \     ← Model quality gates
        /   fairness, latency)  \
       /──────────────────────── \
      /       Unit Tests          \
     /  (features, preprocessing,  \  ← Fast, numerous
    /    utilities, transforms)     \
   ▼──────────────────────────────── ▼
            Base (lots of tests)

Unit Tests🔗

# tests/unit/test_preprocess.py
import pytest
import pandas as pd
import numpy as np
from src.preprocess import encode_plan, handle_missing, create_features

def test_encode_plan_known_values():
    plans = pd.Series(["basic", "standard", "premium"])
    encoded = encode_plan(plans)
    assert list(encoded) == [0, 1, 2]

def test_encode_plan_unknown_raises():
    with pytest.raises(ValueError, match="Unknown plan"):
        encode_plan(pd.Series(["enterprise"]))

def test_handle_missing_fills_median():
    df = pd.DataFrame({"age": [25, np.nan, 35], "income": [50000, 60000, np.nan]})
    result = handle_missing(df)
    assert result.isnull().sum().sum() == 0

def test_create_features_output_shape():
    df = pd.DataFrame({"age": [25], "income": [50000], "tenure": [12]})
    features = create_features(df)
    assert "income_per_year_tenure" in features.columns
    assert features.shape[0] == 1

Model Quality Tests🔗

# tests/model/test_model_quality.py
import pytest
import pickle
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score

@pytest.fixture
def model():
    with open("models/model.pkl", "rb") as f:
        return pickle.load(f)

@pytest.fixture
def test_data():
    df = pd.read_csv("data/test.csv")
    X = df.drop("churned", axis=1)
    y = df["churned"]
    return X, y

def test_model_accuracy_above_threshold(model, test_data):
    X, y = test_data
    acc = accuracy_score(y, model.predict(X))
    assert acc >= 0.85, f"Accuracy {acc:.3f} below threshold 0.85"

def test_model_f1_above_threshold(model, test_data):
    X, y = test_data
    f1 = f1_score(y, model.predict(X))
    assert f1 >= 0.80, f"F1 {f1:.3f} below threshold 0.80"

def test_model_no_all_one_predictions(model, test_data):
    X, _ = test_data
    preds = model.predict(X)
    assert preds.mean() < 0.95, "Model predicts churn for >95% — possible issue"
    assert preds.mean() > 0.05, "Model predicts churn for <5% — possible issue"

def test_model_inference_speed(model, test_data):
    import time
    X, _ = test_data
    sample = X.iloc[:100]

    start = time.time()
    model.predict(sample)
    elapsed_ms = (time.time() - start) * 1000

    assert elapsed_ms < 500, f"100-sample inference took {elapsed_ms:.0f}ms (>500ms threshold)"

Integration Tests🔗

# tests/integration/test_api.py
import requests
import pytest

BASE_URL = "http://localhost:8000"

def test_health_endpoint():
    r = requests.get(f"{BASE_URL}/health")
    assert r.status_code == 200
    assert r.json()["status"] == "healthy"

def test_prediction_endpoint_valid_input():
    payload = {
        "customer_id": "C001",
        "age": 35,
        "income": 65000,
        "tenure_months": 12,
        "monthly_charges": 75.5,
        "plan": "standard"
    }
    r = requests.post(f"{BASE_URL}/predict", json=payload)
    assert r.status_code == 200
    body = r.json()
    assert "prediction" in body
    assert "confidence" in body
    assert body["confidence"] >= 0.0
    assert body["confidence"] <= 1.0
    assert body["prediction"] in [0, 1]

def test_prediction_endpoint_invalid_age():
    payload = {"customer_id": "C001", "age": -5, ...}
    r = requests.post(f"{BASE_URL}/predict", json=payload)
    assert r.status_code == 422  # validation error

11.4 Quality Gates🔗

Quality gates are checks in the pipeline that stop deployment if criteria aren't met.

# src/quality_gate.py
import json
import sys

def check_quality_gate(metrics_path: str, thresholds: dict) -> bool:
    with open(metrics_path) as f:
        metrics = json.load(f)

    failed = []
    for metric, threshold in thresholds.items():
        value = metrics.get(metric, 0)
        if value < threshold:
            failed.append(f"{metric}: {value:.4f} < {threshold}")

    if failed:
        print("❌ Quality Gate FAILED:")
        for f in failed:
            print(f"  - {f}")
        return False

    print("✅ Quality Gate PASSED")
    return True

if __name__ == "__main__":
    passed = check_quality_gate(
        metrics_path="metrics/results.json",
        thresholds={
            "accuracy": 0.85,
            "f1_score": 0.80,
            "fairness_gap": -0.10,  # negative = we check it's NOT below
        }
    )
    sys.exit(0 if passed else 1)

11.5 GitHub Actions — Complete ML Workflow🔗

# .github/workflows/ml-pipeline.yml
name: ML CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  schedule:
    - cron: "0 2 * * 1"  # Weekly retraining Monday 2am

env:
  GCP_PROJECT: ${{ secrets.GCP_PROJECT }}
  GAR_REPO: us-central1-docker.pkg.dev/${{ secrets.GCP_PROJECT }}/ml-repo

jobs:
  ci:
    name: CI — Lint, Test, Validate
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with: {python-version: "3.10"}
      - run: pip install -r requirements.txt
      - run: flake8 src/ --max-line-length=100
      - run: black --check src/
      - run: pytest tests/unit/ -v --tb=short

  train-and-evaluate:
    name: CT — Train, Evaluate, Gate
    needs: ci
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with: {python-version: "3.10"}
      - run: pip install -r requirements.txt

      - name: Authenticate GCP
        uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      - run: dvc pull
      - run: python src/validate_data.py
      - run: python src/train.py
      - run: python src/quality_gate.py

      - name: Upload metrics
        uses: actions/upload-artifact@v4
        with:
          name: model-metrics
          path: metrics/

  build-and-push:
    name: CD — Build & Push Docker
    needs: train-and-evaluate
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      - run: gcloud auth configure-docker us-central1-docker.pkg.dev
      - run: |
          docker build -t $GAR_REPO/churn-model:${{ github.sha }} .
          docker push $GAR_REPO/churn-model:${{ github.sha }}

  deploy:
    name: CD — Deploy to Production
    needs: build-and-push
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production    # requires manual approval in GitHub
    steps:
      - uses: google-github-actions/get-gke-credentials@v2
        with:
          cluster_name: ml-cluster
          location: us-central1
      - run: |
          kubectl set image deployment/churn-model \
            churn-model=$GAR_REPO/churn-model:${{ github.sha }} \
            -n production
          kubectl rollout status deployment/churn-model -n production

Next → Chapter 12: Jenkins