36 Cost Optimization

Chapter 36: Cost Optimization in MLOps🔗

"MLOps done wrong is expensive. MLOps done right pays for itself."


36.1 Where the Money Goes in MLOps🔗

COST BREAKDOWN (typical ML system):

  Training Compute:           30%   ← GPUs, long jobs
  Serving Infrastructure:     35%   ← Always-on endpoints
  Storage:                    10%   ← Data, models, logs
  Orchestration:              10%   ← Airflow, Kubeflow
  Monitoring:                  5%   ← Prometheus, tools
  Developer Tools:             5%   ← MLflow, W&B licenses
  Data Pipeline Compute:       5%   ← Spark, Dataflow

36.2 Training Cost Optimization🔗

# Use spot/preemptible instances (70-90% discount)
# GCP: preemptible VMs
gcloud compute instances create training-vm \
  --preemptible \
  --machine-type=n1-highmem-8 \
  --accelerator=type=nvidia-tesla-t4,count=1

# Vertex AI: spot instances
job.run(
    machine_type="n1-standard-8",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
    spot=True,                  # ~60% cheaper, may be interrupted
    restart_on_worker_restart=True,  # auto-resume from checkpoint
)

Checkpointing for Spot Instances🔗

# Always checkpoint so spot interruption doesn't waste progress
import os
import pickle

CHECKPOINT_DIR = "gs://my-bucket/checkpoints/"

def save_checkpoint(model, epoch, metrics):
    checkpoint = {
        "epoch": epoch,
        "model": model,
        "metrics": metrics,
    }
    path = f"{CHECKPOINT_DIR}/checkpoint_epoch_{epoch}.pkl"
    with open(path, "wb") as f:
        pickle.dump(checkpoint, f)
    print(f"Checkpoint saved: {path}")

def load_latest_checkpoint():
    # Find latest checkpoint and resume
    ...

36.3 Serving Cost Optimization🔗

# Scale to zero for low-traffic endpoints
# Cloud Run: auto-scales to 0, costs $0 when idle
gcloud run deploy churn-model \
  --image=gcr.io/my-project/churn-model:latest \
  --min-instances=0 \   # scale to zero!
  --max-instances=10 \
  --memory=2Gi \
  --cpu=2

# K8s: scale down at night
kubectl apply -f - <<EOF
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model
  minReplicas: 1     # 1 at night (not 0 — avoids cold start in prod)
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
EOF

36.4 Cost Optimization Strategies Summary🔗

TRAINING:
  ✅ Use spot/preemptible instances (60-90% discount)
  ✅ Checkpoint frequently → resume after interruption
  ✅ Right-size compute (don't use A100 for sklearn)
  ✅ Use caching in pipelines (DVC, Vertex AI enable_caching=True)
  ✅ Early stopping → don't train longer than needed
  ✅ Smaller batch experiments first → only scale winning configs

SERVING:
  ✅ Scale to zero for dev/staging endpoints
  ✅ Use autoscaling (HPA) — don't over-provision
  ✅ Batch requests where possible (avoid per-request overhead)
  ✅ Use smaller models where quality allows
  ✅ Cache predictions for repeated inputs (Redis/Memcached)
  ✅ Model quantization → 2-4x smaller → cheaper inference

STORAGE:
  ✅ Set data lifecycle policies (delete raw data after 90 days)
  ✅ Compress old model checkpoints
  ✅ Use nearline/coldline for rarely accessed data
  ✅ Delete failed experiment artifacts automatically

LLMs:
  ✅ Use smaller model (GPT-3.5 vs GPT-4) where possible
  ✅ Prompt compression (fewer tokens = less cost)
  ✅ Implement semantic caching (cache similar queries)
  ✅ Fine-tune small model → replace large model API calls

36.5 Cost Monitoring🔗

# Track compute costs with labels
job = aiplatform.CustomTrainingJob(
    display_name="churn-training",
    labels={
        "team": "ml-platform",
        "cost-center": "product-analytics",
        "project": "churn-prediction",
    }
)

# Set budget alerts in GCP
gcloud billing budgets create \
  --billing-account=BILLING_ACCOUNT_ID \
  --display-name="ML Training Budget" \
  --budget-amount=1000USD \
  --threshold-rule=percent=80,basis=current-spend \
  --threshold-rule=percent=100,basis=current-spend \
  --all-updates-rule-monitoring-notification-channels=CHANNEL_ID

36.6 Cost vs Quality Trade-offs🔗

                    ┌────────────────────────────────────┐
                    │      COST vs QUALITY MATRIX         │
    HIGH QUALITY    │                                    │
         ▲          │  High Q, High Cost: Fine for       │
         │          │  critical decisions (fraud, health) │
         │    ──────┼──────                              │
         │          │                                    │
         │          │  High Q, Low Cost: IDEAL TARGET    │
         │          │  (quantized model, batch serving)  │
         │    ──────┼──────                              │
         │          │  Low Q, High Cost: AVOID           │
         │          │  (large model, low accuracy)       │
         └──────────┴──────────────────────────────────▶
                              Cost ($)                   HIGH

Next → Chapter 37: End-to-End Project