Chapter 12: End-to-End MLOps Projectπ
"Putting it all together: Git β CI/CD β Docker β Kubernetes β GCP β Monitor"
12.1 Project Overviewπ
Goal: Build and deploy a Customer Churn Prediction model with a full MLOps pipeline.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COMPLETE MLOPS PROJECT β
β β
β Problem: Predict if a telecom customer will churn (leave service) β
β Model: GradientBoostingClassifier β
β Stack: GitHub β Jenkins β Docker β GKE β GCP β Prometheus/Grafana β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β GitHub βββΆβ Jenkins βββΆβ Docker βββΆβ GKE βββΆβ Monitor β β
β β (code + β β (CI/CD) β β(package) β β(deploy) β β(Grafana) β β
β β data) β β β β β β β β β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
12.2 Project Structureπ
churn-prediction-mlops/
β
βββ .github/
β βββ workflows/
β βββ pr-checks.yml β GitHub Actions (PR validation)
β
βββ data/
β βββ raw/
β β βββ churn_data.csv.dvc β DVC pointer (actual in GCS)
β βββ processed/
β βββ features.csv.dvc
β
βββ src/
β βββ preprocess.py β data cleaning + feature engineering
β βββ train.py β model training + MLflow logging
β βββ evaluate.py β model evaluation + report
β βββ validate_data.py β data quality checks
β βββ serve.py β FastAPI model server
β
βββ tests/
β βββ unit/
β β βββ test_preprocess.py
β β βββ test_features.py
β βββ integration/
β βββ test_api.py
β
βββ k8s/
β βββ deployment.yaml
β βββ service.yaml
β βββ hpa.yaml
β βββ namespace.yaml
β
βββ monitoring/
β βββ prometheus.yml
β βββ alert_rules.yml
β βββ grafana-dashboard.json
β
βββ Dockerfile β container definition
βββ Jenkinsfile β CI/CD pipeline
βββ docker-compose.yml β local dev stack
βββ requirements.txt
βββ dvc.yaml β DVC pipeline
βββ README.md
12.3 Step 1: Data & DVC Setupπ
# Initialize project
git init churn-prediction-mlops
cd churn-prediction-mlops
dvc init
# Add GCS as DVC remote
dvc remote add -d gcs-remote gs://my-project-data/dvc-store
dvc remote modify gcs-remote credentialpath /path/to/sa-key.json
# Track data
dvc add data/raw/churn_data.csv
git add data/raw/churn_data.csv.dvc .gitignore
git commit -m "track raw churn dataset"
dvc push
# dvc.yaml β define reproducible pipeline
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- src/preprocess.py
- data/raw/churn_data.csv
outs:
- data/processed/features.csv
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed/features.csv
params:
- config/train_config.yaml:
- n_estimators
- learning_rate
outs:
- models/model.pkl
metrics:
- metrics/results.json
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/processed/features.csv
metrics:
- metrics/results.json
12.4 Step 2: Training Script with MLflowπ
# src/train.py
import mlflow
import mlflow.sklearn
import yaml
import pickle
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
import pandas as pd
import json
import os
# Load config
with open("config/train_config.yaml") as f:
config = yaml.safe_load(f)
# Load data
df = pd.read_csv("data/processed/features.csv")
X = df.drop("churned", axis=1)
y = df["churned"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# MLflow setup
mlflow.set_tracking_uri(os.getenv("MLFLOW_TRACKING_URI", "http://mlflow:5000"))
mlflow.set_experiment("churn-prediction")
with mlflow.start_run(run_name=f"GBM-{config['version']}"):
# Log config as params
mlflow.log_params(config)
# Train
model = GradientBoostingClassifier(
n_estimators=config["n_estimators"],
learning_rate=config["learning_rate"],
max_depth=config["max_depth"],
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
metrics = {
"accuracy": accuracy_score(y_test, y_pred),
"f1_score": f1_score(y_test, y_pred),
"auc_roc": roc_auc_score(y_test, y_proba),
}
mlflow.log_metrics(metrics)
mlflow.sklearn.log_model(model, "model", registered_model_name="churn-classifier")
# Save metrics for DVC and CI gate
os.makedirs("metrics", exist_ok=True)
with open("metrics/results.json", "w") as f:
json.dump(metrics, f)
print(f"β
Training complete: {metrics}")
# Save model locally for Docker
os.makedirs("models", exist_ok=True)
with open("models/model.pkl", "wb") as f:
pickle.dump(model, f)
12.5 Step 3: Dockerfileπ
# Dockerfile
FROM python:3.10-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --prefix=/install --no-cache-dir -r requirements.txt
FROM python:3.10-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY src/serve.py .
COPY models/ ./models/
RUN useradd --create-home appuser && chown -R appuser /app
USER appuser
EXPOSE 8000
HEALTHCHECK --interval=30s CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8000"]
12.6 Step 4: Full Jenkinsfileπ
// Jenkinsfile
pipeline {
agent { docker { image 'python:3.10-slim' } }
environment {
GCR_IMAGE = "gcr.io/my-project/churn-model"
GKE_CLUSTER = "ml-cluster"
GKE_REGION = "us-central1"
ACCURACY_THRESHOLD = "0.85"
}
stages {
stage('Setup') {
steps { sh 'pip install -r requirements.txt' }
}
stage('Lint') {
steps { sh 'flake8 src/ && black --check src/' }
}
stage('Unit Tests') {
steps { sh 'pytest tests/unit/ -v' }
}
stage('Data Pull & Validate') {
steps {
sh 'dvc pull'
sh 'python src/validate_data.py'
}
}
stage('Train') {
steps {
sh 'python src/train.py'
archiveArtifacts 'models/*.pkl'
}
}
stage('Quality Gate') {
steps {
sh '''python -c "
import json
m = json.load(open('metrics/results.json'))
print(f'Accuracy: {m[\"accuracy\"]}')
assert m['accuracy'] >= float('${ACCURACY_THRESHOLD}'), 'Below threshold!'
print('Quality gate PASSED β
')
"'''
}
}
stage('Build & Push Docker') {
steps {
sh "docker build -t ${GCR_IMAGE}:${BUILD_NUMBER} ."
withCredentials([file(credentialsId: 'gcp-sa-key', variable: 'KEY')]) {
sh "gcloud auth activate-service-account --key-file=$KEY"
sh "gcloud auth configure-docker --quiet"
sh "docker push ${GCR_IMAGE}:${BUILD_NUMBER}"
sh "docker tag ${GCR_IMAGE}:${BUILD_NUMBER} ${GCR_IMAGE}:latest"
sh "docker push ${GCR_IMAGE}:latest"
}
}
}
stage('Deploy Staging') {
steps {
sh "gcloud container clusters get-credentials ${GKE_CLUSTER} --region ${GKE_REGION}"
sh "kubectl set image deployment/churn-model churn-model=${GCR_IMAGE}:${BUILD_NUMBER} -n staging"
sh "kubectl rollout status deployment/churn-model -n staging --timeout=120s"
}
}
stage('Integration Tests') {
steps { sh 'pytest tests/integration/ -v' }
}
stage('Deploy Production') {
when { branch 'main' }
input { message "Deploy to production?" }
steps {
sh "kubectl set image deployment/churn-model churn-model=${GCR_IMAGE}:${BUILD_NUMBER} -n production"
sh "kubectl rollout status deployment/churn-model -n production --timeout=180s"
}
}
}
post {
success { slackSend color: 'good', message: "β
Churn Model Deployed: v${BUILD_NUMBER}" }
failure { slackSend color: 'danger', message: "β Pipeline Failed: v${BUILD_NUMBER}" }
always { cleanWs() }
}
}
12.7 Full Pipeline Flow Diagramπ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COMPLETE END-TO-END FLOW β
β β
β DEVELOPER β
β β git push feature/new-model β
β βΌ β
β GITHUB β
β βββ GitHub Actions: PR checks (lint + unit tests) βββ PR must pass β
β
β βββ Merge to main β webhook β triggers Jenkins β
β β β
β JENKINS CI/CD PIPELINE βΌ β
β βββ 1. Lint + Format β
β βββ 2. Unit Tests (pytest) β
β βββ 3. DVC Pull data from GCS β
β βββ 4. Data Validation (Great Expectations) β
β βββ 5. Train Model (logs to MLflow) β
β βββ 6. Quality Gate: accuracy β₯ 0.85? βββ fail β STOP β β
β βββ 7. Build Docker Image β
β βββ 8. Push Image to GCR β
β βββ 9. Deploy to GKE Staging β
β βββ 10. Integration Tests β
β βββ 11. Manual Approval β Deploy to GKE Production β
β β β
β GKE PRODUCTION βΌ β
β βββ 3 Pods running churn-model:vX β
β βββ HPA: auto-scales 2β10 pods on load β
β βββ Service: LoadBalancer β REST API β
β β β
β MONITORING βΌ β
β βββ Prometheus: scrapes /metrics every 15s β
β βββ Grafana: real-time dashboards β
β βββ Evidently: weekly drift reports β
β βββ Alerts β Slack if latency/drift/errors spike β
β β β
β RETRAINING LOOP βΌ β
β βββ Drift detected β trigger Jenkins retrain job βββ back to step 5 β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
12.8 Key Files Summaryπ
| File | Purpose |
|---|---|
Jenkinsfile |
Full CI/CD pipeline definition |
Dockerfile |
Container packaging |
dvc.yaml |
Reproducible data + training pipeline |
src/train.py |
Model training + MLflow logging |
src/serve.py |
FastAPI inference server + Prometheus metrics |
k8s/deployment.yaml |
K8s production deployment |
k8s/hpa.yaml |
Auto-scaling config |
monitoring/prometheus.yml |
Metrics scraping |
monitoring/alert_rules.yml |
Alerting rules |
12.9 Tools Summary Tableπ
| Phase | Tool | Role |
|---|---|---|
| Version Control | Git + GitHub | Code tracking, PRs, collaboration |
| Data Versioning | DVC + GCS | Track datasets and model files |
| CI/CD | Jenkins / GitHub Actions | Automate build, test, deploy |
| Containerization | Docker | Package model + dependencies |
| Orchestration | Kubernetes / GKE | Deploy, scale, heal containers |
| Cloud | GCP (GCS, GCR, GKE) | Storage, registry, compute |
| AutoML | Vertex AI AutoML / Optuna | Automated model selection + HPO |
| Experiment Tracking | MLflow | Log + compare experiments |
| Monitoring | Prometheus + Grafana | Metrics + dashboards |
| Drift Detection | Evidently AI | Data + model drift |
| Alerting | PagerDuty / Slack | Notify on issues |
π Congratulations! You now have a complete MLOps foundation.
Go back to README for the full table of contents.