Chapter 21: GCP & Vertex AI — Complete Deep Dive🔗
"Vertex AI is Google's unified ML platform — one place to build, train, deploy, and monitor ML models at scale."
21.1 GCP MLOps Services Overview🔗
┌──────────────────────────────────────────────────────────────────────────┐
│ GCP MLOPS SERVICES MAP │
│ │
│ COMPUTE STORAGE DATA & ANALYTICS │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ GKE │ │ GCS │ │ BigQuery │ │
│ │ (K8s) │ │ (object) │ │ Dataflow (Beam) │ │
│ │ Cloud Run │ │ Filestore │ │ Dataproc (Spark) │ │
│ │ Vertex AI │ │ AlloyDB │ │ Pub/Sub (streaming) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ │
│ VERTEX AI PLATFORM (Unified ML) │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Workbench Datasets Training AutoML Experiments │ │
│ │ Pipelines Feature Model Endpoints Monitoring │ │
│ │ (Kubeflow) Store Registry (serving) (drift/perf) │ │
│ │ Vizier(HPO) Match Eng Metadata Model Model Garden │ │
│ │ (search) (lineage) Cards (foundation) │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ CI/CD & INFRA │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Cloud Build │ Artifact Registry │ Cloud Deploy │ IAM │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
21.2 Vertex AI Workbench🔗
Vertex AI Workbench is a managed JupyterLab environment with GPU support and GCP integrations pre-configured.
# Create a managed notebook
gcloud workbench instances create my-notebook \
--location=us-central1-a \
--machine-type=n1-standard-4 \
--accelerator-type=NVIDIA_TESLA_T4 \
--accelerator-core-count=1 \
--vm-image-project=deeplearning-platform-release \
--vm-image-family=tf-latest-gpu
Key features:
- Pre-installed ML frameworks (TF, PyTorch, sklearn, XGBoost)
- Direct access to GCS, BigQuery, Vertex AI
- Git integration
- Scheduled notebook execution
- Secure, VPC-native
21.3 Vertex AI Datasets🔗
Manage datasets used for training — supports tabular, image, text, video.
from google.cloud import aiplatform
aiplatform.init(project="my-project", location="us-central1")
# Create Tabular Dataset from BigQuery
dataset = aiplatform.TabularDataset.create(
display_name="churn-dataset-v2",
bq_source="bq://my-project.ml_data.customer_features",
)
print(f"Dataset created: {dataset.resource_name}")
# Create from GCS CSV
dataset = aiplatform.TabularDataset.create(
display_name="churn-dataset-csv",
gcs_source="gs://my-bucket/data/churn.csv",
)
# List datasets
datasets = aiplatform.TabularDataset.list()
for ds in datasets:
print(ds.display_name, ds.resource_name)
21.4 Vertex AI Training — Custom Jobs🔗
from google.cloud import aiplatform
aiplatform.init(project="my-project", location="us-central1")
# ── Option 1: Custom Training Job (script-based) ──────────────────
job = aiplatform.CustomTrainingJob(
display_name="churn-gbm-training",
script_path="src/train.py",
container_uri="us-docker.pkg.dev/vertex-ai/training/scikit-learn-cpu.1-3:latest",
requirements=["xgboost==1.7.0", "mlflow"],
model_serving_container_image_uri=(
"us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-3:latest"
),
)
model = job.run(
dataset=dataset,
target_column="churned",
training_fraction_split=0.8,
validation_fraction_split=0.1,
test_fraction_split=0.1,
machine_type="n1-standard-8",
replica_count=1,
model_display_name="churn-gbm-v1",
sync=True,
)
# ── Option 2: Custom Container Job ────────────────────────────────
job = aiplatform.CustomContainerTrainingJob(
display_name="churn-custom-container",
container_uri="us-central1-docker.pkg.dev/my-project/ml-repo/trainer:v2",
model_serving_container_image_uri=(
"us-central1-docker.pkg.dev/my-project/ml-repo/serving:v2"
),
)
model = job.run(
args=["--epochs=50", "--batch-size=64"],
machine_type="n1-highmem-8",
accelerator_type="NVIDIA_TESLA_T4",
accelerator_count=1,
)
21.5 Vertex AI Experiments & MLflow Integration🔗
from google.cloud import aiplatform
aiplatform.init(
project="my-project",
location="us-central1",
experiment="churn-experiments", # link to Vertex AI Experiment
)
# Log experiment run
with aiplatform.start_run("gbm-v3-run"):
# Log params
aiplatform.log_params({
"n_estimators": 200,
"learning_rate": 0.05,
})
# Train model (your code here)
model.fit(X_train, y_train)
# Log metrics
aiplatform.log_metrics({
"accuracy": 0.92,
"f1": 0.89,
})
# Compare experiments in Vertex AI UI
21.6 Vertex AI Pipelines (KFP on GCP)🔗
# Same KFP pipeline code → runs on Vertex AI
from google.cloud import aiplatform
import kfp
@kfp.dsl.pipeline(name="churn-pipeline")
def churn_pipeline(data_path: str, threshold: float = 0.85):
preprocess = preprocess_op(data_path=data_path)
train = train_op(dataset=preprocess.output)
evaluate = evaluate_op(model=train.output, threshold=threshold)
# Compile
kfp.compiler.Compiler().compile(churn_pipeline, "churn_pipeline.yaml")
# Submit to Vertex AI Pipelines
aiplatform.init(project="my-project", location="us-central1")
job = aiplatform.PipelineJob(
display_name="churn-pipeline-weekly",
template_path="churn_pipeline.yaml",
pipeline_root="gs://my-bucket/pipeline-root",
parameter_values={
"data_path": "gs://my-bucket/data/churn.csv",
"threshold": 0.85,
},
enable_caching=True, # Skip re-running unchanged stages
)
# Schedule: run every Monday at 2am
job.create_schedule(
display_name="churn-weekly",
cron="0 2 * * 1",
max_concurrent_run_count=1,
max_run_count=52, # 52 weeks
)
21.7 Vertex AI Model Registry🔗
# Upload model to registry
model = aiplatform.Model.upload(
display_name="churn-classifier-v2",
artifact_uri="gs://my-bucket/models/churn-v2/",
serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-3:latest",
labels={"team": "ml-platform", "use-case": "churn"},
)
# List models
models = aiplatform.Model.list(filter="labels.use-case=churn")
# Create model version
model_v2 = parent_model.copy(
destination_model_name=parent_model.resource_name,
version_aliases=["stable"],
)
# Evaluate model with Vertex AI Evaluation
eval_task = aiplatform.ModelEvaluation.create(
model=model,
gcs_source_uris=["gs://my-bucket/test_data.jsonl"],
prediction_type="classification",
class_labels=["not_churned", "churned"],
ground_truth_column="churned",
prediction_column="prediction",
)
21.8 Vertex AI Endpoints (Online Prediction)🔗
# Create endpoint
endpoint = aiplatform.Endpoint.create(
display_name="churn-production-endpoint",
labels={"env": "production"},
)
# Deploy model (traffic_split controls A/B testing)
model.deploy(
endpoint=endpoint,
deployed_model_display_name="churn-v2",
machine_type="n1-standard-2",
min_replica_count=2,
max_replica_count=10,
traffic_split={"0": 100}, # 100% to this deployment
)
# Online prediction
response = endpoint.predict(
instances=[
{"age": 35, "income": 65000, "tenure": 12, "plan": "standard"},
{"age": 52, "income": 45000, "tenure": 6, "plan": "basic"},
]
)
print(response.predictions)
# [{"prediction": 0, "confidence": 0.87}, {"prediction": 1, "confidence": 0.93}]
# A/B test: split traffic between v2 and v3
endpoint.update(traffic_split={
model_v2.id: 80, # 80% to v2 (stable)
model_v3.id: 20, # 20% to v3 (canary)
})
21.9 Vertex AI Batch Prediction🔗
# Batch prediction (score large datasets)
batch_job = model.batch_predict(
job_display_name="churn-batch-scoring-2024-01",
instances_format="csv",
predictions_format="csv",
gcs_source=["gs://my-bucket/data/customers_to_score.csv"],
gcs_destination_prefix="gs://my-bucket/predictions/",
machine_type="n1-standard-4",
starting_replica_count=10,
max_replica_count=20,
sync=False, # async — don't block
)
# Monitor
print(batch_job.state)
21.10 Vertex AI Model Monitoring🔗
# Set up continuous monitoring on deployed model
job = aiplatform.ModelDeploymentMonitoringJob.create(
display_name="churn-monitoring",
endpoint=endpoint,
logging_sampling_strategy={
"random_sample_config": {"sample_rate": 0.1} # log 10% of requests
},
model_deployment_monitoring_schedule_config={
"monitor_interval": {"seconds": 3600} # check hourly
},
model_deployment_monitoring_objective_configs=[
{
"deployed_model_id": model_v2.id,
"objective_config": {
"training_dataset": {
"dataset": dataset.resource_name,
"target_field": "churned",
},
"training_prediction_skew_detection_config": {
"skew_thresholds": {
"age": {"value": 0.3},
"income": {"value": 0.3},
}
},
"prediction_drift_detection_config": {
"drift_thresholds": {
"age": {"value": 0.3},
"income": {"value": 0.3},
}
}
}
}
],
email_alert_config={
"user_emails": ["mlteam@company.com"]
}
)
21.11 Vertex AI Vizier (Managed HPO)🔗
from google.cloud import aiplatform
from google.cloud.aiplatform import vizier
# Create a Study
study_config = {
"parameters": [
{"parameter_id": "learning_rate", "double_value_spec": {"min_value": 0.001, "max_value": 0.3}},
{"parameter_id": "n_estimators", "integer_value_spec": {"min_value": 50, "max_value": 500}},
],
"metrics": [{"metric_id": "accuracy", "goal": "MAXIMIZE"}],
"algorithm": "GAUSSIAN_PROCESS_BANDIT", # Bayesian
}
study = vizier.Study.create_or_load(
display_name="churn-hpo-study",
problem=study_config,
)
# Suggest trials
trials = study.suggest(count=5)
for trial in trials:
lr = trial.parameters["learning_rate"].as_float
n = trial.parameters["n_estimators"].as_integer
model = GradientBoostingClassifier(learning_rate=lr, n_estimators=n)
model.fit(X_train, y_train)
accuracy = accuracy_score(y_test, model.predict(X_test))
trial.add_measurement(metrics={"accuracy": accuracy})
trial.complete()
21.12 GCP Cost Estimates for MLOps🔗
Service Typical Cost
──────────────────────────────────────────────────────
GCS (data storage) $0.02/GB/month
GKE (worker nodes) ~$0.10-0.48/hr per node
Vertex AI Training $0.16/hr (n1-standard-4)
Vertex AI Prediction $0.07/hr per deployed node
Cloud Build $0.003/min (first 120 min/day free)
Artifact Registry $0.10/GB/month
Cloud Composer $300+/month (managed Airflow)
Vertex AI Pipelines Pay per task run (no idle cost)
BigQuery $0.01/GB storage, $5/TB queried
Next → Chapter 22: AWS SageMaker