Chapter 10: Experiment Tracking with MLflow & DVC🔗

"You can't improve what you don't measure. Experiment tracking is your ML lab notebook."

10.1 The Problem Without Tracking🔗

WITHOUT experiment tracking:
  - Tried XGBoost with lr=0.1 last Tuesday... what accuracy did I get?
  - Which model did we deploy last month?
  - My colleague's model was better — what were their settings?
  - We need to reproduce model v3... where's the data?

WITH MLflow:
  ✅ Every run logged: params, metrics, artifacts
  ✅ Compare runs visually in the UI
  ✅ Register and version models
  ✅ Reproduce any experiment anytime

10.2 What is MLflow?🔗

MLflow is an open-source platform for managing the ML lifecycle. It has 4 core components:

┌───────────────────────────────────────────────────────────┐
│                    MLFLOW COMPONENTS                      │
│                                                           │
│  ┌──────────────────┐   ┌──────────────────────────────┐ │
│  │  MLflow Tracking │   │  MLflow Projects              │ │
│  │                  │   │                              │ │
│  │  Log experiments:│   │  Reproducible runs:          │ │
│  │  - Parameters    │   │  - MLproject file            │ │
│  │  - Metrics       │   │  - Conda/Docker env          │ │
│  │  - Artifacts     │   │  - Run with: mlflow run .    │ │
│  └──────────────────┘   └──────────────────────────────┘ │
│  ┌──────────────────┐   ┌──────────────────────────────┐ │
│  │  MLflow Models   │   │  MLflow Model Registry        │ │
│  │                  │   │                              │ │
│  │  Standard format │   │  Lifecycle stages:           │ │
│  │  for packaging:  │   │  None → Staging → Production │ │
│  │  - sklearn       │   │  + Version control           │ │
│  │  - pytorch       │   │  + Annotations               │ │
│  │  - tensorflow    │   │                              │ │
│  └──────────────────┘   └──────────────────────────────┘ │
└───────────────────────────────────────────────────────────┘

10.3 MLflow Tracking — Logging Experiments🔗

import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

# Set the tracking server (or use local ./mlruns by default)
mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("churn-prediction")

# Start an experiment run
with mlflow.start_run(run_name="GBM-experiment-v3"):

    # ── Log parameters ──────────────────────────────
    params = {
        "n_estimators": 200,
        "learning_rate": 0.05,
        "max_depth": 5,
        "subsample": 0.8,
    }
    mlflow.log_params(params)

    # ── Train the model ─────────────────────────────
    model = GradientBoostingClassifier(**params)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    # ── Log metrics ─────────────────────────────────
    mlflow.log_metrics({
        "accuracy": accuracy_score(y_test, y_pred),
        "f1_score": f1_score(y_test, y_pred),
        "auc_roc": roc_auc_score(y_test, y_proba),
    })

    # ── Log the model artifact ──────────────────────
    mlflow.sklearn.log_model(
        model,
        artifact_path="model",
        registered_model_name="churn-classifier"
    )

    # ── Log other artifacts ─────────────────────────
    mlflow.log_artifact("reports/confusion_matrix.png")
    mlflow.log_artifact("data/processed/features.csv")

    print("Run ID:", mlflow.active_run().info.run_id)

10.4 MLflow Tracking UI🔗

MLflow UI (http://localhost:5000):

  Experiments
  └── churn-prediction
       ├── Run: GBM-v1    params: lr=0.1, n=100   accuracy: 0.82  ← old
       ├── Run: GBM-v2    params: lr=0.05, n=200  accuracy: 0.87  ← better
       └── Run: GBM-v3    params: lr=0.05, n=200  accuracy: 0.89  ← BEST ✓
            │
            ├── Parameters: {n_estimators: 200, lr: 0.05, ...}
            ├── Metrics:    {accuracy: 0.89, f1: 0.87, auc: 0.93}
            └── Artifacts:  model.pkl, confusion_matrix.png

10.5 Model Registry🔗

The MLflow Model Registry provides lifecycle management for production models.

Model Lifecycle:

  Training ──▶ None ──▶ Staging ──▶ Production ──▶ Archived
                 │         │              │
              New model  Testing     Live traffic
              (unreviewed)  (QA)      (serving)

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register a new model version
result = mlflow.register_model(
    model_uri=f"runs:/{run_id}/model",
    name="churn-classifier"
)

# Transition to Staging
client.transition_model_version_stage(
    name="churn-classifier",
    version=result.version,
    stage="Staging"
)

# After testing passes, promote to Production
client.transition_model_version_stage(
    name="churn-classifier",
    version=result.version,
    stage="Production"
)

# Load production model anywhere
model = mlflow.sklearn.load_model("models:/churn-classifier/Production")
prediction = model.predict(new_data)

10.6 MLflow + Jenkins Integration🔗

// In Jenkinsfile: evaluate model and gate deployment
stage('Evaluate & Register Model') {
    steps {
        sh '''
            python -c "
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Get latest run
experiment = client.get_experiment_by_name('churn-prediction')
runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=['metrics.accuracy DESC'],
    max_results=1
)
best_run = runs[0]
accuracy = best_run.data.metrics['accuracy']
print(f'Best accuracy: {accuracy}')

# Gate: only register if better than threshold
assert accuracy >= 0.85, f'Model accuracy {accuracy} below threshold!'

# Register model
mlflow.register_model(
    model_uri=f'runs:/{best_run.info.run_id}/model',
    name='churn-classifier'
)
print('Model registered successfully!')
"
        '''
    }
}

10.7 Comparing Experiments🔗

# Compare multiple runs programmatically
import mlflow
import pandas as pd

client = mlflow.tracking.MlflowClient()
experiment = client.get_experiment_by_name("churn-prediction")

runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["metrics.accuracy DESC"]
)

# Build comparison dataframe
comparison = pd.DataFrame([{
    "run_id": r.info.run_id[:8],
    "run_name": r.data.tags.get("mlflow.runName"),
    "accuracy": r.data.metrics.get("accuracy"),
    "f1_score": r.data.metrics.get("f1_score"),
    "n_estimators": r.data.params.get("n_estimators"),
    "learning_rate": r.data.params.get("learning_rate"),
} for r in runs])

print(comparison.to_string(index=False))
# Output:
# run_id  run_name    accuracy  f1_score  n_estimators  learning_rate
# a3f1bc  GBM-v3      0.8900    0.8700    200           0.05   ← BEST
# 7e2d9a  GBM-v2      0.8700    0.8500    200           0.10
# 2f4c11  RF-v1       0.8200    0.8000    100           None

10.8 MLflow Setup (Docker Compose)🔗

# docker-compose.yml addition for MLflow
mlflow:
  image: ghcr.io/mlflow/mlflow:latest
  ports:
    - "5000:5000"
  environment:
    - MLFLOW_BACKEND_STORE_URI=postgresql://mlflow:password@db:5432/mlflow
    - MLFLOW_ARTIFACT_ROOT=gs://my-project/mlflow-artifacts
  command: >
    mlflow server
    --backend-store-uri postgresql://mlflow:password@db:5432/mlflow
    --artifact-root gs://my-project/mlflow-artifacts
    --host 0.0.0.0
    --port 5000
  depends_on:
    - db

db:
  image: postgres:15
  environment:
    POSTGRES_USER: mlflow
    POSTGRES_PASSWORD: password
    POSTGRES_DB: mlflow

Next Chapter → 11: Model Monitoring