37 End To End Project

Chapter 12: End-to-End MLOps ProjectπŸ”—

"Putting it all together: Git β†’ CI/CD β†’ Docker β†’ Kubernetes β†’ GCP β†’ Monitor"


12.1 Project OverviewπŸ”—

Goal: Build and deploy a Customer Churn Prediction model with a full MLOps pipeline.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    COMPLETE MLOPS PROJECT                               β”‚
β”‚                                                                         β”‚
β”‚  Problem:  Predict if a telecom customer will churn (leave service)     β”‚
β”‚  Model:    GradientBoostingClassifier                                   β”‚
β”‚  Stack:    GitHub β†’ Jenkins β†’ Docker β†’ GKE β†’ GCP β†’ Prometheus/Grafana  β”‚
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  GitHub  │─▢│ Jenkins  │─▢│  Docker  │─▢│   GKE    │─▢│ Monitor  β”‚ β”‚
β”‚  β”‚  (code + β”‚  β”‚ (CI/CD)  β”‚  β”‚(package) β”‚  β”‚(deploy)  β”‚  β”‚(Grafana) β”‚ β”‚
β”‚  β”‚   data)  β”‚  β”‚          β”‚  β”‚          β”‚  β”‚          β”‚  β”‚          β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

12.2 Project StructureπŸ”—

churn-prediction-mlops/
β”‚
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       └── pr-checks.yml           ← GitHub Actions (PR validation)
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/
β”‚   β”‚   └── churn_data.csv.dvc      ← DVC pointer (actual in GCS)
β”‚   └── processed/
β”‚       └── features.csv.dvc
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ preprocess.py               ← data cleaning + feature engineering
β”‚   β”œβ”€β”€ train.py                    ← model training + MLflow logging
β”‚   β”œβ”€β”€ evaluate.py                 ← model evaluation + report
β”‚   β”œβ”€β”€ validate_data.py            ← data quality checks
β”‚   └── serve.py                    ← FastAPI model server
β”‚
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ unit/
β”‚   β”‚   β”œβ”€β”€ test_preprocess.py
β”‚   β”‚   └── test_features.py
β”‚   └── integration/
β”‚       └── test_api.py
β”‚
β”œβ”€β”€ k8s/
β”‚   β”œβ”€β”€ deployment.yaml
β”‚   β”œβ”€β”€ service.yaml
β”‚   β”œβ”€β”€ hpa.yaml
β”‚   └── namespace.yaml
β”‚
β”œβ”€β”€ monitoring/
β”‚   β”œβ”€β”€ prometheus.yml
β”‚   β”œβ”€β”€ alert_rules.yml
β”‚   └── grafana-dashboard.json
β”‚
β”œβ”€β”€ Dockerfile                      ← container definition
β”œβ”€β”€ Jenkinsfile                     ← CI/CD pipeline
β”œβ”€β”€ docker-compose.yml              ← local dev stack
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ dvc.yaml                        ← DVC pipeline
└── README.md

12.3 Step 1: Data & DVC SetupπŸ”—

# Initialize project
git init churn-prediction-mlops
cd churn-prediction-mlops
dvc init

# Add GCS as DVC remote
dvc remote add -d gcs-remote gs://my-project-data/dvc-store
dvc remote modify gcs-remote credentialpath /path/to/sa-key.json

# Track data
dvc add data/raw/churn_data.csv
git add data/raw/churn_data.csv.dvc .gitignore
git commit -m "track raw churn dataset"
dvc push
# dvc.yaml β€” define reproducible pipeline
stages:
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - src/preprocess.py
      - data/raw/churn_data.csv
    outs:
      - data/processed/features.csv

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed/features.csv
    params:
      - config/train_config.yaml:
          - n_estimators
          - learning_rate
    outs:
      - models/model.pkl
    metrics:
      - metrics/results.json

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/processed/features.csv
    metrics:
      - metrics/results.json

12.4 Step 2: Training Script with MLflowπŸ”—

# src/train.py
import mlflow
import mlflow.sklearn
import yaml
import pickle
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
import pandas as pd
import json
import os

# Load config
with open("config/train_config.yaml") as f:
    config = yaml.safe_load(f)

# Load data
df = pd.read_csv("data/processed/features.csv")
X = df.drop("churned", axis=1)
y = df["churned"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# MLflow setup
mlflow.set_tracking_uri(os.getenv("MLFLOW_TRACKING_URI", "http://mlflow:5000"))
mlflow.set_experiment("churn-prediction")

with mlflow.start_run(run_name=f"GBM-{config['version']}"):
    # Log config as params
    mlflow.log_params(config)

    # Train
    model = GradientBoostingClassifier(
        n_estimators=config["n_estimators"],
        learning_rate=config["learning_rate"],
        max_depth=config["max_depth"],
        random_state=42
    )
    model.fit(X_train, y_train)

    # Evaluate
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    metrics = {
        "accuracy": accuracy_score(y_test, y_pred),
        "f1_score": f1_score(y_test, y_pred),
        "auc_roc": roc_auc_score(y_test, y_proba),
    }

    mlflow.log_metrics(metrics)
    mlflow.sklearn.log_model(model, "model", registered_model_name="churn-classifier")

    # Save metrics for DVC and CI gate
    os.makedirs("metrics", exist_ok=True)
    with open("metrics/results.json", "w") as f:
        json.dump(metrics, f)

    print(f"βœ… Training complete: {metrics}")

# Save model locally for Docker
os.makedirs("models", exist_ok=True)
with open("models/model.pkl", "wb") as f:
    pickle.dump(model, f)

12.5 Step 3: DockerfileπŸ”—

# Dockerfile
FROM python:3.10-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --prefix=/install --no-cache-dir -r requirements.txt

FROM python:3.10-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY src/serve.py .
COPY models/ ./models/
RUN useradd --create-home appuser && chown -R appuser /app
USER appuser
EXPOSE 8000
HEALTHCHECK --interval=30s CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8000"]

12.6 Step 4: Full JenkinsfileπŸ”—

// Jenkinsfile
pipeline {
    agent { docker { image 'python:3.10-slim' } }

    environment {
        GCR_IMAGE  = "gcr.io/my-project/churn-model"
        GKE_CLUSTER = "ml-cluster"
        GKE_REGION  = "us-central1"
        ACCURACY_THRESHOLD = "0.85"
    }

    stages {
        stage('Setup') {
            steps { sh 'pip install -r requirements.txt' }
        }
        stage('Lint') {
            steps { sh 'flake8 src/ && black --check src/' }
        }
        stage('Unit Tests') {
            steps { sh 'pytest tests/unit/ -v' }
        }
        stage('Data Pull & Validate') {
            steps {
                sh 'dvc pull'
                sh 'python src/validate_data.py'
            }
        }
        stage('Train') {
            steps {
                sh 'python src/train.py'
                archiveArtifacts 'models/*.pkl'
            }
        }
        stage('Quality Gate') {
            steps {
                sh '''python -c "
import json
m = json.load(open('metrics/results.json'))
print(f'Accuracy: {m[\"accuracy\"]}')
assert m['accuracy'] >= float('${ACCURACY_THRESHOLD}'), 'Below threshold!'
print('Quality gate PASSED βœ…')
"'''
            }
        }
        stage('Build & Push Docker') {
            steps {
                sh "docker build -t ${GCR_IMAGE}:${BUILD_NUMBER} ."
                withCredentials([file(credentialsId: 'gcp-sa-key', variable: 'KEY')]) {
                    sh "gcloud auth activate-service-account --key-file=$KEY"
                    sh "gcloud auth configure-docker --quiet"
                    sh "docker push ${GCR_IMAGE}:${BUILD_NUMBER}"
                    sh "docker tag ${GCR_IMAGE}:${BUILD_NUMBER} ${GCR_IMAGE}:latest"
                    sh "docker push ${GCR_IMAGE}:latest"
                }
            }
        }
        stage('Deploy Staging') {
            steps {
                sh "gcloud container clusters get-credentials ${GKE_CLUSTER} --region ${GKE_REGION}"
                sh "kubectl set image deployment/churn-model churn-model=${GCR_IMAGE}:${BUILD_NUMBER} -n staging"
                sh "kubectl rollout status deployment/churn-model -n staging --timeout=120s"
            }
        }
        stage('Integration Tests') {
            steps { sh 'pytest tests/integration/ -v' }
        }
        stage('Deploy Production') {
            when { branch 'main' }
            input { message "Deploy to production?" }
            steps {
                sh "kubectl set image deployment/churn-model churn-model=${GCR_IMAGE}:${BUILD_NUMBER} -n production"
                sh "kubectl rollout status deployment/churn-model -n production --timeout=180s"
            }
        }
    }

    post {
        success { slackSend color: 'good', message: "βœ… Churn Model Deployed: v${BUILD_NUMBER}" }
        failure { slackSend color: 'danger', message: "❌ Pipeline Failed: v${BUILD_NUMBER}" }
        always  { cleanWs() }
    }
}

12.7 Full Pipeline Flow DiagramπŸ”—

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     COMPLETE END-TO-END FLOW                                β”‚
β”‚                                                                             β”‚
β”‚  DEVELOPER                                                                  β”‚
β”‚     β”‚ git push feature/new-model                                            β”‚
β”‚     β–Ό                                                                       β”‚
β”‚  GITHUB                                                                     β”‚
β”‚     β”œβ”€β”€ GitHub Actions: PR checks (lint + unit tests) ─── PR must pass βœ…   β”‚
β”‚     └── Merge to main β†’ webhook β†’ triggers Jenkins                          β”‚
β”‚                                          β”‚                                  β”‚
β”‚  JENKINS CI/CD PIPELINE                  β–Ό                                  β”‚
β”‚     β”œβ”€β”€ 1. Lint + Format                                                    β”‚
β”‚     β”œβ”€β”€ 2. Unit Tests (pytest)                                              β”‚
β”‚     β”œβ”€β”€ 3. DVC Pull data from GCS                                           β”‚
β”‚     β”œβ”€β”€ 4. Data Validation (Great Expectations)                             β”‚
β”‚     β”œβ”€β”€ 5. Train Model (logs to MLflow)                                     β”‚
β”‚     β”œβ”€β”€ 6. Quality Gate: accuracy β‰₯ 0.85? ─── fail β†’ STOP ❌                β”‚
β”‚     β”œβ”€β”€ 7. Build Docker Image                                               β”‚
β”‚     β”œβ”€β”€ 8. Push Image to GCR                                                β”‚
β”‚     β”œβ”€β”€ 9. Deploy to GKE Staging                                            β”‚
β”‚     β”œβ”€β”€ 10. Integration Tests                                               β”‚
β”‚     └── 11. Manual Approval β†’ Deploy to GKE Production                     β”‚
β”‚                                          β”‚                                  β”‚
β”‚  GKE PRODUCTION                          β–Ό                                  β”‚
β”‚     β”œβ”€β”€ 3 Pods running churn-model:vX                                       β”‚
β”‚     β”œβ”€β”€ HPA: auto-scales 2β†’10 pods on load                                  β”‚
β”‚     └── Service: LoadBalancer β†’ REST API                                    β”‚
β”‚                                          β”‚                                  β”‚
β”‚  MONITORING                              β–Ό                                  β”‚
β”‚     β”œβ”€β”€ Prometheus: scrapes /metrics every 15s                              β”‚
β”‚     β”œβ”€β”€ Grafana: real-time dashboards                                       β”‚
β”‚     β”œβ”€β”€ Evidently: weekly drift reports                                     β”‚
β”‚     └── Alerts β†’ Slack if latency/drift/errors spike                        β”‚
β”‚                                          β”‚                                  β”‚
β”‚  RETRAINING LOOP                         β–Ό                                  β”‚
β”‚     └── Drift detected β†’ trigger Jenkins retrain job ─── back to step 5 ↑  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

12.8 Key Files SummaryπŸ”—

File Purpose
Jenkinsfile Full CI/CD pipeline definition
Dockerfile Container packaging
dvc.yaml Reproducible data + training pipeline
src/train.py Model training + MLflow logging
src/serve.py FastAPI inference server + Prometheus metrics
k8s/deployment.yaml K8s production deployment
k8s/hpa.yaml Auto-scaling config
monitoring/prometheus.yml Metrics scraping
monitoring/alert_rules.yml Alerting rules

12.9 Tools Summary TableπŸ”—

Phase Tool Role
Version Control Git + GitHub Code tracking, PRs, collaboration
Data Versioning DVC + GCS Track datasets and model files
CI/CD Jenkins / GitHub Actions Automate build, test, deploy
Containerization Docker Package model + dependencies
Orchestration Kubernetes / GKE Deploy, scale, heal containers
Cloud GCP (GCS, GCR, GKE) Storage, registry, compute
AutoML Vertex AI AutoML / Optuna Automated model selection + HPO
Experiment Tracking MLflow Log + compare experiments
Monitoring Prometheus + Grafana Metrics + dashboards
Drift Detection Evidently AI Data + model drift
Alerting PagerDuty / Slack Notify on issues

πŸŽ‰ Congratulations! You now have a complete MLOps foundation.

Go back to README for the full table of contents.