32 Mlops Security

Chapter 32: MLOps SecurityπŸ”—

"An insecure ML system is a liability. Secure MLOps protects your data, your models, and your users."


32.1 MLOps Security Threat LandscapeπŸ”—

THREATS TO ML SYSTEMS:

  DATA THREATS                     MODEL THREATS
  β”œβ”€β”€ Data poisoning                β”œβ”€β”€ Model theft/extraction
  β”‚   (inject bad training data)    β”‚   (adversary queries β†’ steal model)
  β”œβ”€β”€ Training data leakage         β”œβ”€β”€ Adversarial attacks
  β”‚   (PII in model weights)        β”‚   (crafted inputs β†’ wrong output)
  └── Unauthorized access           β”œβ”€β”€ Model inversion
      to datasets                   β”‚   (recover training data from model)
                                    └── Prompt injection (LLMs)

  INFRASTRUCTURE THREATS            SUPPLY CHAIN THREATS
  β”œβ”€β”€ Compromised CI/CD             β”œβ”€β”€ Malicious dependencies
  β”œβ”€β”€ Container vulnerabilities     β”œβ”€β”€ Compromised base images
  β”œβ”€β”€ K8s RBAC misconfiguration     └── Poisoned datasets from 3rd party
  └── Exposed API endpoints

32.2 IAM & Access ControlπŸ”—

Principle of Least Privilege: Grant only the minimum permissions needed.

# GCP: Create service account with minimum permissions

# Training job SA β€” only needs storage and Vertex AI training
gcloud iam service-accounts create ml-trainer-sa

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:ml-trainer-sa@my-project.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer"  # read training data

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:ml-trainer-sa@my-project.iam.gserviceaccount.com" \
  --role="roles/storage.objectCreator"  # write model artifacts

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:ml-trainer-sa@my-project.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user"  # submit Vertex AI jobs

# Serving SA β€” read-only access to model artifacts
gcloud iam service-accounts create ml-serving-sa

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:ml-serving-sa@my-project.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

Kubernetes RBAC for MLπŸ”—

# k8s/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ml-model-role
  namespace: production
rules:
  # Only allow reading pods (for health checks)
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]
  # No write access to deployments β€” only CI/CD has that
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ml-model-binding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: ml-serving-sa
roleRef:
  kind: Role
  name: ml-model-role
  apiGroup: rbac.authorization.k8s.io

32.3 Secrets ManagementπŸ”—

Never hard-code API keys, passwords, or credentials in code or Docker images.

# Use Kubernetes Secrets
kubectl create secret generic ml-secrets \
  --from-literal=mlflow-uri="http://mlflow:5000" \
  --from-literal=gcp-project="my-project" \
  --from-file=gcp-sa-key=./sa-key.json \
  -n production
# Use secrets in deployment
spec:
  containers:
  - name: ml-model
    env:
    - name: MLFLOW_TRACKING_URI
      valueFrom:
        secretKeyRef:
          name: ml-secrets
          key: mlflow-uri
    - name: GCP_PROJECT
      valueFrom:
        secretKeyRef:
          name: ml-secrets
          key: gcp-project
    volumeMounts:
    - name: gcp-key
      mountPath: /secrets
      readOnly: true
  volumes:
  - name: gcp-key
    secret:
      secretName: ml-secrets
// In Jenkins: use credentials binding
withCredentials([
    string(credentialsId: 'mlflow-uri', variable: 'MLFLOW_URI'),
    file(credentialsId: 'gcp-sa-key', variable: 'GOOGLE_APPLICATION_CREDENTIALS'),
]) {
    sh 'python src/train.py'
}

Never put secrets in:
- Git repositories (even private)
- Docker images (use build args or runtime injection)
- CI/CD logs
- Environment variables in Dockerfiles


32.4 Container SecurityπŸ”—

# Security best practices in Dockerfile

# 1. Use specific image versions (not :latest)
FROM python:3.10.12-slim

# 2. Run as non-root user
RUN useradd --create-home --uid 1001 appuser
USER appuser

# 3. Read-only filesystem
# (set in K8s: securityContext.readOnlyRootFilesystem: true)

# 4. Don't copy unnecessary files
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser models/ ./models/
# DON'T: COPY . . (copies .env, keys, etc)
# K8s Security Context
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1001
    readOnlyRootFilesystem: true
    allowPrivilegeEscalation: false
    seccompProfile:
      type: RuntimeDefault
    capabilities:
      drop: ["ALL"]

32.5 Network SecurityπŸ”—

# K8s NetworkPolicy β€” restrict communication
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ml-model-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: ml-model

  # Allow incoming traffic only from API gateway
  ingress:
    - from:
        - podSelector:
            matchLabels:
              role: api-gateway
      ports:
        - port: 8000

  # Allow outgoing traffic only to feature store and MLflow
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: feature-store
      ports:
        - port: 6566

32.6 Audit LoggingπŸ”—

# Log all predictions for audit trail
import logging
import json
from datetime import datetime

audit_logger = logging.getLogger("audit")
audit_handler = logging.FileHandler("logs/audit.log")
audit_logger.addHandler(audit_handler)

@app.post("/predict")
def predict(req: PredictionRequest, request: Request):
    response = make_prediction(req)

    # Audit log every prediction
    audit_logger.info(json.dumps({
        "timestamp": datetime.utcnow().isoformat(),
        "request_id": request.headers.get("X-Request-ID"),
        "customer_id": req.customer_id,
        "features": req.dict(),
        "prediction": response.prediction,
        "confidence": response.confidence,
        "model_version": MODEL_VERSION,
        "user_agent": request.headers.get("User-Agent"),
    }))

    return response

Chapter 33: Model Governance & ComplianceπŸ”—

"Governance turns MLOps from a technical discipline into an organizational one β€” with accountability at every step."


33.1 Model Governance FrameworkπŸ”—

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  MODEL GOVERNANCE FRAMEWORK                      β”‚
β”‚                                                                  β”‚
β”‚  DEVELOP          VALIDATE          APPROVE          OPERATE     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Build    │───▢│ Test     │───▢│ Review   │───▢│ Monitor  β”‚   β”‚
β”‚  β”‚ Document β”‚    β”‚ Evaluate β”‚    β”‚ Approve  β”‚    β”‚ Audit    β”‚   β”‚
β”‚  β”‚ Version  β”‚    β”‚ Bias chk β”‚    β”‚ Sign off β”‚    β”‚ Alert    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                  β”‚
β”‚  ARTIFACTS:                                                      β”‚
β”‚  β”œβ”€β”€ Model Card          ← What does this model do?             β”‚
β”‚  β”œβ”€β”€ Data Lineage        ← Where did training data come from?   β”‚
β”‚  β”œβ”€β”€ Evaluation Report   ← Performance + fairness metrics       β”‚
β”‚  β”œβ”€β”€ Approval Sign-off   ← Who approved this for production?    β”‚
β”‚  └── Monitoring Alerts   ← What triggers re-review?            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

33.2 EU AI Act Relevance for MLOpsπŸ”—

The EU AI Act (in force August 2024) classifies AI systems by risk and mandates different requirements.

RISK LEVELS:
  β”œβ”€β”€ Unacceptable Risk β†’ PROHIBITED
  β”‚     (social scoring, biometric mass surveillance)
  β”‚
  β”œβ”€β”€ High Risk β†’ STRICT REQUIREMENTS
  β”‚     (credit scoring, hiring, medical, critical infrastructure)
  β”‚     Requirements: explainability, bias testing, human oversight,
  β”‚                   documentation, registration in EU database
  β”‚
  β”œβ”€β”€ Limited Risk β†’ TRANSPARENCY OBLIGATIONS
  β”‚     (chatbots must disclose they're AI)
  β”‚
  └── Minimal Risk β†’ Voluntary code of conduct
        (spam filters, recommendation engines)

MLOps IMPLICATIONS for High-Risk AI:
  βœ… Log all training data sources (data lineage)
  βœ… Document model design and training process
  βœ… Test for bias across demographics
  βœ… Maintain performance metrics history
  βœ… Enable human override of model decisions
  βœ… Provide explanations for individual decisions
  βœ… Alert when model performance degrades
  βœ… Audit trail of all model changes

33.3 ML Metadata & Lineage TrackingπŸ”—

# MLMD β€” ML Metadata (used by TFX and Vertex AI Pipelines)
import ml_metadata as mlmd
from ml_metadata.metadata_store import metadata_store
from ml_metadata.proto import metadata_store_pb2

# Connect to metadata store
connection_config = metadata_store_pb2.ConnectionConfig()
connection_config.sqlite.filename_uri = "mlmd.db"
store = metadata_store.MetadataStore(connection_config)

# Register artifact types
dataset_type = metadata_store_pb2.ArtifactType(name="Dataset")
dataset_type.properties["source"] = metadata_store_pb2.STRING
dataset_type.properties["num_rows"] = metadata_store_pb2.INT
store.put_artifact_type(dataset_type)

# Log artifacts and executions
dataset = metadata_store_pb2.Artifact(
    type_id=store.get_artifact_type("Dataset").id,
    uri="gs://my-bucket/data/train.csv",
    properties={
        "source": metadata_store_pb2.Value(string_value="BigQuery"),
        "num_rows": metadata_store_pb2.Value(int_value=150000),
    }
)
dataset_id = store.put_artifacts([dataset])[0]

# Now you can trace: which model used which data

33.4 Model Approval WorkflowπŸ”—

Code + Data Changes
        β”‚
        β–Ό
  Automated CI checks
  (tests, data validation, training)
        β”‚
        β–Ό
  Auto-evaluation report
  (accuracy, fairness, SHAP)
        β”‚
        β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ PEER REVIEW         β”‚
  β”‚ ML Engineer review  β”‚
  β”‚ Data Scientist sign β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ GOVERNANCE REVIEW   β”‚
  β”‚ (for High-Risk AI)  β”‚
  β”‚ Risk Officer review β”‚
  β”‚ Legal team sign-off β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
  Deploy to Staging β†’ Test β†’ Deploy to Production
  (tagged in Git with who approved what)

Next β†’ Chapter 34: LLMOps