Chapter 07: Kubernetes (K8s) for MLOps🔗

"Kubernetes is the operating system for your cloud — it runs, scales, and heals your ML services automatically."

7.1 What is Kubernetes?🔗

Kubernetes (K8s) is an open-source container orchestration platform. It manages how Docker containers are deployed, scaled, and maintained across a cluster of machines.

Without vs With Kubernetes🔗

WITHOUT K8s:                        WITH K8s:
  Server 1: runs container          ┌─────────────────────────┐
  Container crashes → 💥 down!      │   KUBERNETES CLUSTER    │
  Traffic spikes → 💥 slow!         │                         │
  Update → 💥 downtime!             │  ┌──┐ ┌──┐ ┌──┐ ┌──┐   │
                                    │  │P1│ │P2│ │P3│ │P4│   │
                                    │  └──┘ └──┘ └──┘ └──┘   │
                                    │  Auto-heal ✅            │
                                    │  Auto-scale ✅           │
                                    │  Zero-downtime ✅        │
                                    └─────────────────────────┘

7.2 Kubernetes Architecture🔗

┌──────────────────────────────────────────────────────────────────────┐
│                       KUBERNETES CLUSTER                             │
│                                                                      │
│  ┌──────────────────────────────────────┐                            │
│  │           CONTROL PLANE              │                            │
│  │                                      │                            │
│  │  ┌──────────┐  ┌──────────────────┐  │                            │
│  │  │  API     │  │   Scheduler      │  │                            │
│  │  │ Server   │  │ (where to place  │  │                            │
│  │  │          │  │  new pods)       │  │                            │
│  │  └──────────┘  └──────────────────┘  │                            │
│  │  ┌──────────┐  ┌──────────────────┐  │                            │
│  │  │Controller│  │   etcd           │  │                            │
│  │  │ Manager  │  │ (cluster state   │  │                            │
│  │  │          │  │  database)       │  │                            │
│  │  └──────────┘  └──────────────────┘  │                            │
│  └──────────────────────────────────────┘                            │
│                                                                      │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐          │
│  │   WORKER NODE  │  │   WORKER NODE  │  │   WORKER NODE  │          │
│  │                │  │                │  │                │          │
│  │  ┌────┐ ┌────┐ │  │  ┌────┐ ┌────┐ │  │  ┌────┐ ┌────┐ │          │
│  │  │Pod │ │Pod │ │  │  │Pod │ │Pod │ │  │  │Pod │ │Pod │ │          │
│  │  └────┘ └────┘ │  │  └────┘ └────┘ │  │  └────┘ └────┘ │          │
│  │  Kubelet        │  │  Kubelet        │  │  Kubelet        │          │
│  └────────────────┘  └────────────────┘  └────────────────┘          │
└──────────────────────────────────────────────────────────────────────┘

7.3 Key Kubernetes Objects🔗

Object	What It Does
Pod	Smallest unit — wraps one or more containers
Deployment	Manages Pods (desired state, rolling updates)
Service	Stable network endpoint to reach Pods
Ingress	HTTP routing / Load balancer
ConfigMap	Store non-secret config data
Secret	Store sensitive data (API keys, passwords)
HPA	Horizontal Pod Autoscaler — scales pods based on CPU/memory
Namespace	Logical isolation (dev, staging, prod)

7.4 Kubernetes Objects Hierarchy🔗

Cluster
  └── Namespace (prod)
        ├── Deployment (ml-model)
        │     └── ReplicaSet
        │           ├── Pod 1 [container: ml-model:v2]
        │           ├── Pod 2 [container: ml-model:v2]
        │           └── Pod 3 [container: ml-model:v2]
        ├── Service (ml-model-svc)  → routes traffic to Pods
        ├── Ingress                 → routes external HTTP
        ├── ConfigMap (model-config)
        └── Secret (gcp-credentials)

7.5 Deploying ML Model to Kubernetes🔗

Deployment YAML🔗

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model
  namespace: production
  labels:
    app: ml-model
    version: v2

spec:
  replicas: 3                    # run 3 instances

  selector:
    matchLabels:
      app: ml-model

  strategy:
    type: RollingUpdate          # zero-downtime updates
    rollingUpdate:
      maxSurge: 1                # add 1 extra pod during update
      maxUnavailable: 0          # never kill pod before new one is ready

  template:
    metadata:
      labels:
        app: ml-model

    spec:
      containers:
      - name: ml-model
        image: gcr.io/my-project/ml-model:v2
        ports:
        - containerPort: 8000

        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

        env:
        - name: MODEL_VERSION
          value: "v2"
        - name: GCP_PROJECT
          valueFrom:
            secretKeyRef:
              name: gcp-credentials
              key: project_id

        readinessProbe:          # don't route traffic until ready
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 5

        livenessProbe:           # restart if unhealthy
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10

Service YAML🔗

# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
  namespace: production

spec:
  selector:
    app: ml-model             # routes to pods with this label

  type: LoadBalancer          # exposes externally (GCP creates a Load Balancer)

  ports:
  - protocol: TCP
    port: 80                  # external port
    targetPort: 8000          # container port

Horizontal Pod Autoscaler🔗

# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-hpa
  namespace: production

spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model

  minReplicas: 2
  maxReplicas: 10

  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70    # scale up when avg CPU > 70%

7.6 kubectl Commands Cheatsheet🔗

# ── Cluster info ──────────────────────────────────
kubectl cluster-info
kubectl get nodes

# ── Apply configs ─────────────────────────────────
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/                    # apply all yamls in folder

# ── View resources ────────────────────────────────
kubectl get pods -n production
kubectl get deployments -n production
kubectl get services -n production
kubectl get all -n production

# ── Debugging ────────────────────────────────────
kubectl logs pod/ml-model-xyz-abc -n production
kubectl describe pod ml-model-xyz-abc -n production
kubectl exec -it ml-model-xyz-abc -n production -- bash

# ── Scaling ───────────────────────────────────────
kubectl scale deployment ml-model --replicas=5 -n production

# ── Updates ───────────────────────────────────────
kubectl set image deployment/ml-model ml-model=gcr.io/my-project/ml-model:v3 -n production
kubectl rollout status deployment/ml-model -n production

# ── Rollback ──────────────────────────────────────
kubectl rollout undo deployment/ml-model -n production
kubectl rollout history deployment/ml-model -n production

7.7 Namespaces for Environments🔗

┌───────────────────────────────────────────┐
│          NAMESPACE ISOLATION              │
│                                           │
│  namespace: dev       → developers        │
│  namespace: staging   → QA/testing        │
│  namespace: production → live traffic     │
│                                           │
│  Each namespace has own:                  │
│  ├── Deployments                          │
│  ├── Services                             │
│  ├── ConfigMaps                           │
│  └── Secrets                              │
└───────────────────────────────────────────┘

7.8 Rolling Update Flow🔗

BEFORE UPDATE (3 pods running v1):
  [v1] [v1] [v1]

ROLLING UPDATE to v2:
Step 1: Add v2 pod
  [v1] [v1] [v1] [v2]

Step 2: Remove 1 v1 pod
  [v1] [v1] [v2]

Step 3: Add another v2
  [v1] [v1] [v2] [v2]

Step 4: Remove another v1
  [v1] [v2] [v2]

Step 5: Complete
  [v2] [v2] [v2]

→ Zero downtime! Traffic served continuously.

Next Chapter → 08: Google Cloud Platform (GCP)