Chapter 12: End-to-End MLOps Project🔗

"Putting it all together: Git → CI/CD → Docker → Kubernetes → GCP → Monitor"

12.1 Project Overview🔗

Goal: Build and deploy a Customer Churn Prediction model with a full MLOps pipeline.

┌─────────────────────────────────────────────────────────────────────────┐
│                    COMPLETE MLOPS PROJECT                               │
│                                                                         │
│  Problem:  Predict if a telecom customer will churn (leave service)     │
│  Model:    GradientBoostingClassifier                                   │
│  Stack:    GitHub → Jenkins → Docker → GKE → GCP → Prometheus/Grafana  │
│                                                                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐ │
│  │  GitHub  │─▶│ Jenkins  │─▶│  Docker  │─▶│   GKE    │─▶│ Monitor  │ │
│  │  (code + │  │ (CI/CD)  │  │(package) │  │(deploy)  │  │(Grafana) │ │
│  │   data)  │  │          │  │          │  │          │  │          │ │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  └──────────┘ │
└─────────────────────────────────────────────────────────────────────────┘

12.2 Project Structure🔗

churn-prediction-mlops/
│
├── .github/
│   └── workflows/
│       └── pr-checks.yml           ← GitHub Actions (PR validation)
│
├── data/
│   ├── raw/
│   │   └── churn_data.csv.dvc      ← DVC pointer (actual in GCS)
│   └── processed/
│       └── features.csv.dvc
│
├── src/
│   ├── preprocess.py               ← data cleaning + feature engineering
│   ├── train.py                    ← model training + MLflow logging
│   ├── evaluate.py                 ← model evaluation + report
│   ├── validate_data.py            ← data quality checks
│   └── serve.py                    ← FastAPI model server
│
├── tests/
│   ├── unit/
│   │   ├── test_preprocess.py
│   │   └── test_features.py
│   └── integration/
│       └── test_api.py
│
├── k8s/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── hpa.yaml
│   └── namespace.yaml
│
├── monitoring/
│   ├── prometheus.yml
│   ├── alert_rules.yml
│   └── grafana-dashboard.json
│
├── Dockerfile                      ← container definition
├── Jenkinsfile                     ← CI/CD pipeline
├── docker-compose.yml              ← local dev stack
├── requirements.txt
├── dvc.yaml                        ← DVC pipeline
└── README.md

12.3 Step 1: Data & DVC Setup🔗

# Initialize project
git init churn-prediction-mlops
cd churn-prediction-mlops
dvc init

# Add GCS as DVC remote
dvc remote add -d gcs-remote gs://my-project-data/dvc-store
dvc remote modify gcs-remote credentialpath /path/to/sa-key.json

# Track data
dvc add data/raw/churn_data.csv
git add data/raw/churn_data.csv.dvc .gitignore
git commit -m "track raw churn dataset"
dvc push

# dvc.yaml — define reproducible pipeline
stages:
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - src/preprocess.py
      - data/raw/churn_data.csv
    outs:
      - data/processed/features.csv

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed/features.csv
    params:
      - config/train_config.yaml:
          - n_estimators
          - learning_rate
    outs:
      - models/model.pkl
    metrics:
      - metrics/results.json

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/processed/features.csv
    metrics:
      - metrics/results.json

12.4 Step 2: Training Script with MLflow🔗

# src/train.py
import mlflow
import mlflow.sklearn
import yaml
import pickle
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
import pandas as pd
import json
import os

# Load config
with open("config/train_config.yaml") as f:
    config = yaml.safe_load(f)

# Load data
df = pd.read_csv("data/processed/features.csv")
X = df.drop("churned", axis=1)
y = df["churned"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# MLflow setup
mlflow.set_tracking_uri(os.getenv("MLFLOW_TRACKING_URI", "http://mlflow:5000"))
mlflow.set_experiment("churn-prediction")

with mlflow.start_run(run_name=f"GBM-{config['version']}"):
    # Log config as params
    mlflow.log_params(config)

    # Train
    model = GradientBoostingClassifier(
        n_estimators=config["n_estimators"],
        learning_rate=config["learning_rate"],
        max_depth=config["max_depth"],
        random_state=42
    )
    model.fit(X_train, y_train)

    # Evaluate
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    metrics = {
        "accuracy": accuracy_score(y_test, y_pred),
        "f1_score": f1_score(y_test, y_pred),
        "auc_roc": roc_auc_score(y_test, y_proba),
    }

    mlflow.log_metrics(metrics)
    mlflow.sklearn.log_model(model, "model", registered_model_name="churn-classifier")

    # Save metrics for DVC and CI gate
    os.makedirs("metrics", exist_ok=True)
    with open("metrics/results.json", "w") as f:
        json.dump(metrics, f)

    print(f"✅ Training complete: {metrics}")

# Save model locally for Docker
os.makedirs("models", exist_ok=True)
with open("models/model.pkl", "wb") as f:
    pickle.dump(model, f)

12.5 Step 3: Dockerfile🔗

# Dockerfile
FROM python:3.10-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --prefix=/install --no-cache-dir -r requirements.txt

FROM python:3.10-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY src/serve.py .
COPY models/ ./models/
RUN useradd --create-home appuser && chown -R appuser /app
USER appuser
EXPOSE 8000
HEALTHCHECK --interval=30s CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8000"]

12.6 Step 4: Full Jenkinsfile🔗

// Jenkinsfile
pipeline {
    agent { docker { image 'python:3.10-slim' } }

    environment {
        GCR_IMAGE  = "gcr.io/my-project/churn-model"
        GKE_CLUSTER = "ml-cluster"
        GKE_REGION  = "us-central1"
        ACCURACY_THRESHOLD = "0.85"
    }

    stages {
        stage('Setup') {
            steps { sh 'pip install -r requirements.txt' }
        }
        stage('Lint') {
            steps { sh 'flake8 src/ && black --check src/' }
        }
        stage('Unit Tests') {
            steps { sh 'pytest tests/unit/ -v' }
        }
        stage('Data Pull & Validate') {
            steps {
                sh 'dvc pull'
                sh 'python src/validate_data.py'
            }
        }
        stage('Train') {
            steps {
                sh 'python src/train.py'
                archiveArtifacts 'models/*.pkl'
            }
        }
        stage('Quality Gate') {
            steps {
                sh '''python -c "
import json
m = json.load(open('metrics/results.json'))
print(f'Accuracy: {m[\"accuracy\"]}')
assert m['accuracy'] >= float('${ACCURACY_THRESHOLD}'), 'Below threshold!'
print('Quality gate PASSED ✅')
"'''
            }
        }
        stage('Build & Push Docker') {
            steps {
                sh "docker build -t ${GCR_IMAGE}:${BUILD_NUMBER} ."
                withCredentials([file(credentialsId: 'gcp-sa-key', variable: 'KEY')]) {
                    sh "gcloud auth activate-service-account --key-file=$KEY"
                    sh "gcloud auth configure-docker --quiet"
                    sh "docker push ${GCR_IMAGE}:${BUILD_NUMBER}"
                    sh "docker tag ${GCR_IMAGE}:${BUILD_NUMBER} ${GCR_IMAGE}:latest"
                    sh "docker push ${GCR_IMAGE}:latest"
                }
            }
        }
        stage('Deploy Staging') {
            steps {
                sh "gcloud container clusters get-credentials ${GKE_CLUSTER} --region ${GKE_REGION}"
                sh "kubectl set image deployment/churn-model churn-model=${GCR_IMAGE}:${BUILD_NUMBER} -n staging"
                sh "kubectl rollout status deployment/churn-model -n staging --timeout=120s"
            }
        }
        stage('Integration Tests') {
            steps { sh 'pytest tests/integration/ -v' }
        }
        stage('Deploy Production') {
            when { branch 'main' }
            input { message "Deploy to production?" }
            steps {
                sh "kubectl set image deployment/churn-model churn-model=${GCR_IMAGE}:${BUILD_NUMBER} -n production"
                sh "kubectl rollout status deployment/churn-model -n production --timeout=180s"
            }
        }
    }

    post {
        success { slackSend color: 'good', message: "✅ Churn Model Deployed: v${BUILD_NUMBER}" }
        failure { slackSend color: 'danger', message: "❌ Pipeline Failed: v${BUILD_NUMBER}" }
        always  { cleanWs() }
    }
}

12.7 Full Pipeline Flow Diagram🔗

┌─────────────────────────────────────────────────────────────────────────────┐
│                     COMPLETE END-TO-END FLOW                                │
│                                                                             │
│  DEVELOPER                                                                  │
│     │ git push feature/new-model                                            │
│     ▼                                                                       │
│  GITHUB                                                                     │
│     ├── GitHub Actions: PR checks (lint + unit tests) ─── PR must pass ✅   │
│     └── Merge to main → webhook → triggers Jenkins                          │
│                                          │                                  │
│  JENKINS CI/CD PIPELINE                  ▼                                  │
│     ├── 1. Lint + Format                                                    │
│     ├── 2. Unit Tests (pytest)                                              │
│     ├── 3. DVC Pull data from GCS                                           │
│     ├── 4. Data Validation (Great Expectations)                             │
│     ├── 5. Train Model (logs to MLflow)                                     │
│     ├── 6. Quality Gate: accuracy ≥ 0.85? ─── fail → STOP ❌                │
│     ├── 7. Build Docker Image                                               │
│     ├── 8. Push Image to GCR                                                │
│     ├── 9. Deploy to GKE Staging                                            │
│     ├── 10. Integration Tests                                               │
│     └── 11. Manual Approval → Deploy to GKE Production                     │
│                                          │                                  │
│  GKE PRODUCTION                          ▼                                  │
│     ├── 3 Pods running churn-model:vX                                       │
│     ├── HPA: auto-scales 2→10 pods on load                                  │
│     └── Service: LoadBalancer → REST API                                    │
│                                          │                                  │
│  MONITORING                              ▼                                  │
│     ├── Prometheus: scrapes /metrics every 15s                              │
│     ├── Grafana: real-time dashboards                                       │
│     ├── Evidently: weekly drift reports                                     │
│     └── Alerts → Slack if latency/drift/errors spike                        │
│                                          │                                  │
│  RETRAINING LOOP                         ▼                                  │
│     └── Drift detected → trigger Jenkins retrain job ─── back to step 5 ↑  │
└─────────────────────────────────────────────────────────────────────────────┘

12.8 Key Files Summary🔗

File	Purpose
`Jenkinsfile`	Full CI/CD pipeline definition
`Dockerfile`	Container packaging
`dvc.yaml`	Reproducible data + training pipeline
`src/train.py`	Model training + MLflow logging
`src/serve.py`	FastAPI inference server + Prometheus metrics
`k8s/deployment.yaml`	K8s production deployment
`k8s/hpa.yaml`	Auto-scaling config
`monitoring/prometheus.yml`	Metrics scraping
`monitoring/alert_rules.yml`	Alerting rules

12.9 Tools Summary Table🔗

Phase	Tool	Role
Version Control	Git + GitHub	Code tracking, PRs, collaboration
Data Versioning	DVC + GCS	Track datasets and model files
CI/CD	Jenkins / GitHub Actions	Automate build, test, deploy
Containerization	Docker	Package model + dependencies
Orchestration	Kubernetes / GKE	Deploy, scale, heal containers
Cloud	GCP (GCS, GCR, GKE)	Storage, registry, compute
AutoML	Vertex AI AutoML / Optuna	Automated model selection + HPO
Experiment Tracking	MLflow	Log + compare experiments
Monitoring	Prometheus + Grafana	Metrics + dashboards
Drift Detection	Evidently AI	Data + model drift
Alerting	PagerDuty / Slack	Notify on issues

🎉 Congratulations! You now have a complete MLOps foundation.

Go back to README for the full table of contents.