Chapter 36: Cost Optimization in MLOps🔗
"MLOps done wrong is expensive. MLOps done right pays for itself."
36.1 Where the Money Goes in MLOps🔗
COST BREAKDOWN (typical ML system):
Training Compute: 30% ← GPUs, long jobs
Serving Infrastructure: 35% ← Always-on endpoints
Storage: 10% ← Data, models, logs
Orchestration: 10% ← Airflow, Kubeflow
Monitoring: 5% ← Prometheus, tools
Developer Tools: 5% ← MLflow, W&B licenses
Data Pipeline Compute: 5% ← Spark, Dataflow
36.2 Training Cost Optimization🔗
# Use spot/preemptible instances (70-90% discount)
# GCP: preemptible VMs
gcloud compute instances create training-vm \
--preemptible \
--machine-type=n1-highmem-8 \
--accelerator=type=nvidia-tesla-t4,count=1
# Vertex AI: spot instances
job.run(
machine_type="n1-standard-8",
accelerator_type="NVIDIA_TESLA_T4",
accelerator_count=1,
spot=True, # ~60% cheaper, may be interrupted
restart_on_worker_restart=True, # auto-resume from checkpoint
)
Checkpointing for Spot Instances🔗
# Always checkpoint so spot interruption doesn't waste progress
import os
import pickle
CHECKPOINT_DIR = "gs://my-bucket/checkpoints/"
def save_checkpoint(model, epoch, metrics):
checkpoint = {
"epoch": epoch,
"model": model,
"metrics": metrics,
}
path = f"{CHECKPOINT_DIR}/checkpoint_epoch_{epoch}.pkl"
with open(path, "wb") as f:
pickle.dump(checkpoint, f)
print(f"Checkpoint saved: {path}")
def load_latest_checkpoint():
# Find latest checkpoint and resume
...
36.3 Serving Cost Optimization🔗
# Scale to zero for low-traffic endpoints
# Cloud Run: auto-scales to 0, costs $0 when idle
gcloud run deploy churn-model \
--image=gcr.io/my-project/churn-model:latest \
--min-instances=0 \ # scale to zero!
--max-instances=10 \
--memory=2Gi \
--cpu=2
# K8s: scale down at night
kubectl apply -f - <<EOF
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-model
minReplicas: 1 # 1 at night (not 0 — avoids cold start in prod)
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
EOF
36.4 Cost Optimization Strategies Summary🔗
TRAINING:
✅ Use spot/preemptible instances (60-90% discount)
✅ Checkpoint frequently → resume after interruption
✅ Right-size compute (don't use A100 for sklearn)
✅ Use caching in pipelines (DVC, Vertex AI enable_caching=True)
✅ Early stopping → don't train longer than needed
✅ Smaller batch experiments first → only scale winning configs
SERVING:
✅ Scale to zero for dev/staging endpoints
✅ Use autoscaling (HPA) — don't over-provision
✅ Batch requests where possible (avoid per-request overhead)
✅ Use smaller models where quality allows
✅ Cache predictions for repeated inputs (Redis/Memcached)
✅ Model quantization → 2-4x smaller → cheaper inference
STORAGE:
✅ Set data lifecycle policies (delete raw data after 90 days)
✅ Compress old model checkpoints
✅ Use nearline/coldline for rarely accessed data
✅ Delete failed experiment artifacts automatically
LLMs:
✅ Use smaller model (GPT-3.5 vs GPT-4) where possible
✅ Prompt compression (fewer tokens = less cost)
✅ Implement semantic caching (cache similar queries)
✅ Fine-tune small model → replace large model API calls
36.5 Cost Monitoring🔗
# Track compute costs with labels
job = aiplatform.CustomTrainingJob(
display_name="churn-training",
labels={
"team": "ml-platform",
"cost-center": "product-analytics",
"project": "churn-prediction",
}
)
# Set budget alerts in GCP
gcloud billing budgets create \
--billing-account=BILLING_ACCOUNT_ID \
--display-name="ML Training Budget" \
--budget-amount=1000USD \
--threshold-rule=percent=80,basis=current-spend \
--threshold-rule=percent=100,basis=current-spend \
--all-updates-rule-monitoring-notification-channels=CHANNEL_ID
36.6 Cost vs Quality Trade-offs🔗
┌────────────────────────────────────┐
│ COST vs QUALITY MATRIX │
HIGH QUALITY │ │
▲ │ High Q, High Cost: Fine for │
│ │ critical decisions (fraud, health) │
│ ──────┼────── │
│ │ │
│ │ High Q, Low Cost: IDEAL TARGET │
│ │ (quantized model, batch serving) │
│ ──────┼────── │
│ │ Low Q, High Cost: AVOID │
│ │ (large model, low accuracy) │
└──────────┴──────────────────────────────────▶
Cost ($) HIGH