Chapter 21: GCP & Vertex AI — Complete Deep Dive🔗

"Vertex AI is Google's unified ML platform — one place to build, train, deploy, and monitor ML models at scale."

21.1 GCP MLOps Services Overview🔗

┌──────────────────────────────────────────────────────────────────────────┐
│                      GCP MLOPS SERVICES MAP                              │
│                                                                          │
│  COMPUTE             STORAGE              DATA & ANALYTICS                │
│  ┌─────────────┐    ┌─────────────┐      ┌─────────────────────────┐    │
│  │  GKE        │    │  GCS        │      │  BigQuery               │    │
│  │  (K8s)      │    │  (object)   │      │  Dataflow (Beam)        │    │
│  │  Cloud Run  │    │  Filestore  │      │  Dataproc (Spark)       │    │
│  │  Vertex AI  │    │  AlloyDB    │      │  Pub/Sub (streaming)    │    │
│  └─────────────┘    └─────────────┘      └─────────────────────────┘    │
│                                                                          │
│  VERTEX AI PLATFORM (Unified ML)                                         │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │  Workbench    Datasets    Training    AutoML    Experiments         │ │
│  │  Pipelines    Feature     Model       Endpoints  Monitoring         │ │
│  │  (Kubeflow)   Store       Registry   (serving)  (drift/perf)       │ │
│  │  Vizier(HPO)  Match Eng   Metadata   Model      Model Garden        │ │
│  │               (search)   (lineage)   Cards      (foundation)        │ │
│  └────────────────────────────────────────────────────────────────────┘ │
│                                                                          │
│  CI/CD & INFRA                                                           │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │  Cloud Build  │  Artifact Registry  │  Cloud Deploy  │  IAM     │    │
│  └─────────────────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────────────┘

21.2 Vertex AI Workbench🔗

Vertex AI Workbench is a managed JupyterLab environment with GPU support and GCP integrations pre-configured.

# Create a managed notebook
gcloud workbench instances create my-notebook \
  --location=us-central1-a \
  --machine-type=n1-standard-4 \
  --accelerator-type=NVIDIA_TESLA_T4 \
  --accelerator-core-count=1 \
  --vm-image-project=deeplearning-platform-release \
  --vm-image-family=tf-latest-gpu

Key features:
- Pre-installed ML frameworks (TF, PyTorch, sklearn, XGBoost)
- Direct access to GCS, BigQuery, Vertex AI
- Git integration
- Scheduled notebook execution
- Secure, VPC-native

21.3 Vertex AI Datasets🔗

Manage datasets used for training — supports tabular, image, text, video.

from google.cloud import aiplatform

aiplatform.init(project="my-project", location="us-central1")

# Create Tabular Dataset from BigQuery
dataset = aiplatform.TabularDataset.create(
    display_name="churn-dataset-v2",
    bq_source="bq://my-project.ml_data.customer_features",
)
print(f"Dataset created: {dataset.resource_name}")

# Create from GCS CSV
dataset = aiplatform.TabularDataset.create(
    display_name="churn-dataset-csv",
    gcs_source="gs://my-bucket/data/churn.csv",
)

# List datasets
datasets = aiplatform.TabularDataset.list()
for ds in datasets:
    print(ds.display_name, ds.resource_name)

21.4 Vertex AI Training — Custom Jobs🔗

from google.cloud import aiplatform

aiplatform.init(project="my-project", location="us-central1")

# ── Option 1: Custom Training Job (script-based) ──────────────────
job = aiplatform.CustomTrainingJob(
    display_name="churn-gbm-training",
    script_path="src/train.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/scikit-learn-cpu.1-3:latest",
    requirements=["xgboost==1.7.0", "mlflow"],
    model_serving_container_image_uri=(
        "us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-3:latest"
    ),
)

model = job.run(
    dataset=dataset,
    target_column="churned",
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
    machine_type="n1-standard-8",
    replica_count=1,
    model_display_name="churn-gbm-v1",
    sync=True,
)

# ── Option 2: Custom Container Job ────────────────────────────────
job = aiplatform.CustomContainerTrainingJob(
    display_name="churn-custom-container",
    container_uri="us-central1-docker.pkg.dev/my-project/ml-repo/trainer:v2",
    model_serving_container_image_uri=(
        "us-central1-docker.pkg.dev/my-project/ml-repo/serving:v2"
    ),
)

model = job.run(
    args=["--epochs=50", "--batch-size=64"],
    machine_type="n1-highmem-8",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
)

21.5 Vertex AI Experiments & MLflow Integration🔗

from google.cloud import aiplatform

aiplatform.init(
    project="my-project",
    location="us-central1",
    experiment="churn-experiments",     # link to Vertex AI Experiment
)

# Log experiment run
with aiplatform.start_run("gbm-v3-run"):
    # Log params
    aiplatform.log_params({
        "n_estimators": 200,
        "learning_rate": 0.05,
    })

    # Train model (your code here)
    model.fit(X_train, y_train)

    # Log metrics
    aiplatform.log_metrics({
        "accuracy": 0.92,
        "f1": 0.89,
    })

# Compare experiments in Vertex AI UI

21.6 Vertex AI Pipelines (KFP on GCP)🔗

# Same KFP pipeline code → runs on Vertex AI
from google.cloud import aiplatform
import kfp

@kfp.dsl.pipeline(name="churn-pipeline")
def churn_pipeline(data_path: str, threshold: float = 0.85):
    preprocess = preprocess_op(data_path=data_path)
    train = train_op(dataset=preprocess.output)
    evaluate = evaluate_op(model=train.output, threshold=threshold)

# Compile
kfp.compiler.Compiler().compile(churn_pipeline, "churn_pipeline.yaml")

# Submit to Vertex AI Pipelines
aiplatform.init(project="my-project", location="us-central1")

job = aiplatform.PipelineJob(
    display_name="churn-pipeline-weekly",
    template_path="churn_pipeline.yaml",
    pipeline_root="gs://my-bucket/pipeline-root",
    parameter_values={
        "data_path": "gs://my-bucket/data/churn.csv",
        "threshold": 0.85,
    },
    enable_caching=True,  # Skip re-running unchanged stages
)

# Schedule: run every Monday at 2am
job.create_schedule(
    display_name="churn-weekly",
    cron="0 2 * * 1",
    max_concurrent_run_count=1,
    max_run_count=52,  # 52 weeks
)

21.7 Vertex AI Model Registry🔗

# Upload model to registry
model = aiplatform.Model.upload(
    display_name="churn-classifier-v2",
    artifact_uri="gs://my-bucket/models/churn-v2/",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-3:latest",
    labels={"team": "ml-platform", "use-case": "churn"},
)

# List models
models = aiplatform.Model.list(filter="labels.use-case=churn")

# Create model version
model_v2 = parent_model.copy(
    destination_model_name=parent_model.resource_name,
    version_aliases=["stable"],
)

# Evaluate model with Vertex AI Evaluation
eval_task = aiplatform.ModelEvaluation.create(
    model=model,
    gcs_source_uris=["gs://my-bucket/test_data.jsonl"],
    prediction_type="classification",
    class_labels=["not_churned", "churned"],
    ground_truth_column="churned",
    prediction_column="prediction",
)

21.8 Vertex AI Endpoints (Online Prediction)🔗

# Create endpoint
endpoint = aiplatform.Endpoint.create(
    display_name="churn-production-endpoint",
    labels={"env": "production"},
)

# Deploy model (traffic_split controls A/B testing)
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name="churn-v2",
    machine_type="n1-standard-2",
    min_replica_count=2,
    max_replica_count=10,
    traffic_split={"0": 100},  # 100% to this deployment
)

# Online prediction
response = endpoint.predict(
    instances=[
        {"age": 35, "income": 65000, "tenure": 12, "plan": "standard"},
        {"age": 52, "income": 45000, "tenure": 6,  "plan": "basic"},
    ]
)
print(response.predictions)
# [{"prediction": 0, "confidence": 0.87}, {"prediction": 1, "confidence": 0.93}]

# A/B test: split traffic between v2 and v3
endpoint.update(traffic_split={
    model_v2.id: 80,   # 80% to v2 (stable)
    model_v3.id: 20,   # 20% to v3 (canary)
})

21.9 Vertex AI Batch Prediction🔗

# Batch prediction (score large datasets)
batch_job = model.batch_predict(
    job_display_name="churn-batch-scoring-2024-01",
    instances_format="csv",
    predictions_format="csv",
    gcs_source=["gs://my-bucket/data/customers_to_score.csv"],
    gcs_destination_prefix="gs://my-bucket/predictions/",
    machine_type="n1-standard-4",
    starting_replica_count=10,
    max_replica_count=20,
    sync=False,  # async — don't block
)

# Monitor
print(batch_job.state)

21.10 Vertex AI Model Monitoring🔗

# Set up continuous monitoring on deployed model
job = aiplatform.ModelDeploymentMonitoringJob.create(
    display_name="churn-monitoring",
    endpoint=endpoint,
    logging_sampling_strategy={
        "random_sample_config": {"sample_rate": 0.1}  # log 10% of requests
    },
    model_deployment_monitoring_schedule_config={
        "monitor_interval": {"seconds": 3600}  # check hourly
    },
    model_deployment_monitoring_objective_configs=[
        {
            "deployed_model_id": model_v2.id,
            "objective_config": {
                "training_dataset": {
                    "dataset": dataset.resource_name,
                    "target_field": "churned",
                },
                "training_prediction_skew_detection_config": {
                    "skew_thresholds": {
                        "age": {"value": 0.3},
                        "income": {"value": 0.3},
                    }
                },
                "prediction_drift_detection_config": {
                    "drift_thresholds": {
                        "age": {"value": 0.3},
                        "income": {"value": 0.3},
                    }
                }
            }
        }
    ],
    email_alert_config={
        "user_emails": ["mlteam@company.com"]
    }
)

21.11 Vertex AI Vizier (Managed HPO)🔗

from google.cloud import aiplatform
from google.cloud.aiplatform import vizier

# Create a Study
study_config = {
    "parameters": [
        {"parameter_id": "learning_rate", "double_value_spec": {"min_value": 0.001, "max_value": 0.3}},
        {"parameter_id": "n_estimators", "integer_value_spec": {"min_value": 50, "max_value": 500}},
    ],
    "metrics": [{"metric_id": "accuracy", "goal": "MAXIMIZE"}],
    "algorithm": "GAUSSIAN_PROCESS_BANDIT",  # Bayesian
}

study = vizier.Study.create_or_load(
    display_name="churn-hpo-study",
    problem=study_config,
)

# Suggest trials
trials = study.suggest(count=5)
for trial in trials:
    lr = trial.parameters["learning_rate"].as_float
    n = trial.parameters["n_estimators"].as_integer

    model = GradientBoostingClassifier(learning_rate=lr, n_estimators=n)
    model.fit(X_train, y_train)
    accuracy = accuracy_score(y_test, model.predict(X_test))

    trial.add_measurement(metrics={"accuracy": accuracy})
    trial.complete()

21.12 GCP Cost Estimates for MLOps🔗

Service                 Typical Cost
──────────────────────────────────────────────────────
GCS (data storage)      $0.02/GB/month
GKE (worker nodes)      ~$0.10-0.48/hr per node
Vertex AI Training      $0.16/hr (n1-standard-4)
Vertex AI Prediction    $0.07/hr per deployed node
Cloud Build             $0.003/min (first 120 min/day free)
Artifact Registry       $0.10/GB/month
Cloud Composer          $300+/month (managed Airflow)
Vertex AI Pipelines     Pay per task run (no idle cost)
BigQuery                $0.01/GB storage, $5/TB queried

Next → Chapter 22: AWS SageMaker