Chapter 09: AutoML — Automated Machine Learning🔗

"AutoML democratizes ML — automating the tedious parts so experts can focus on strategy."

9.1 What is AutoML?🔗

AutoML (Automated Machine Learning) automates the process of selecting, training, and tuning machine learning models. Instead of manually trying different algorithms and hyperparameters, AutoML explores the entire search space automatically.

Manual ML vs AutoML🔗

MANUAL ML:
  Data → You pick algorithm → You tune hyperparameters → You evaluate → Repeat
  Time: Days to weeks

AUTOML:
  Data → AutoML searches algorithms + hyperparameters → Best model returned
  Time: Hours to days (with less human effort)

9.2 What AutoML Automates🔗

┌─────────────────────────────────────────────────────────┐
│              AUTOML AUTOMATION SCOPE                    │
│                                                         │
│  ✅ Feature Preprocessing (encoding, scaling, imputing) │
│  ✅ Algorithm Selection (RF, XGB, SVM, Neural Net...)   │
│  ✅ Hyperparameter Optimization (HPO)                   │
│  ✅ Model Ensembling (stack best models)                │
│  ✅ Neural Architecture Search (NAS)                    │
│  ✅ Model Evaluation & Comparison                       │
│  ✅ Cross-validation strategy                           │
│                                                         │
│  ❌ NOT automated: Problem definition, data collection  │
│     business metrics, deployment decisions              │
└─────────────────────────────────────────────────────────┘

9.3 AutoML Search Process🔗

                    ┌──────────────────────────┐
                    │     AUTOML SEARCH        │
                    │                          │
  Input Data ──────▶│  Algorithm Space:        │
                    │  ├── Random Forest       │
                    │  ├── Gradient Boosting   │
                    │  ├── SVM                 │
                    │  ├── Neural Networks     │
                    │  └── Linear Models       │
                    │                          │
                    │  Hyperparameter Space:   │
                    │  ├── learning_rate       │
                    │  ├── max_depth           │
                    │  ├── n_estimators        │
                    │  └── dropout_rate        │
                    │                          │
                    │  Search Strategy:        │
                    │  ├── Bayesian Opt        │
                    │  ├── Random Search       │
                    │  └── Evolutionary        │
                    └────────────┬─────────────┘
                                 │
                                 ▼
                         ┌───────────────┐
                         │  Best Model   │
                         │  + Config     │
                         └───────────────┘

9.4 Hyperparameter Optimization (HPO)🔗

Hyperparameters are settings you choose before training (not learned by the model).

HPO Strategies Compared🔗

GRID SEARCH:                     RANDOM SEARCH:
┌──┬──┬──┬──┬──┐                 ┌──┬──┬──┬──┬──┐
│  │  │  │  │  │                 │  │  │✓ │  │  │
├──┼──┼──┼──┼──┤                 ├──┼──┼──┼──┼──┤
│  │  │  │  │  │                 │  │  │  │✓ │  │
├──┼──┼──┼──┼──┤                 ├──┼──┼──┼──┼──┤
│  │  │  │  │  │                 │✓ │  │  │  │  │
└──┴──┴──┴──┴──┘                 └──┴──┴──┴──┴──┘
Tests every combo                 Random samples
Slow, exhaustive                  Faster, misses some

BAYESIAN OPTIMIZATION (BEST):
Start → Try 5 random points → Build a model of "which areas are good"
      → Sample from good areas → Update model → Repeat
  → Smart, efficient, finds optimum faster

HPO with Optuna (Python)🔗

import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

def objective(trial):
    # Define hyperparameter search space
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 20),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2']),
    }

    clf = RandomForestClassifier(**params, random_state=42)
    score = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy').mean()
    return score

# Run optimization (100 trials)
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print(f"Best accuracy: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

9.5 Vertex AI AutoML (GCP)🔗

Vertex AI AutoML lets you train high-quality models without writing ML code.

Supported AutoML Types🔗

Task	What You Provide	What You Get
AutoML Tables	Tabular CSV data	Classification/Regression model
AutoML Vision	Labeled images	Image classifier
AutoML NLP	Labeled text	Text classifier
AutoML Video	Labeled videos	Video classifier

AutoML Tables Workflow🔗

┌──────────────────────────────────────────────────────┐
│           VERTEX AI AUTOML TABLES WORKFLOW           │
│                                                      │
│  1. Upload CSV to BigQuery or GCS                    │
│        │                                             │
│        ▼                                             │
│  2. Create Dataset in Vertex AI                      │
│        │                                             │
│        ▼                                             │
│  3. Configure target column + training budget        │
│     (e.g., 1 hour, 8 hours, 24 hours)                │
│        │                                             │
│        ▼                                             │
│  4. AutoML trains 100s of models internally          │
│     (feature engineering + HPO + ensembling)         │
│        │                                             │
│        ▼                                             │
│  5. Evaluate: AUC, Precision, Recall, Confusion Mat  │
│        │                                             │
│        ▼                                             │
│  6. Deploy best model to Vertex AI Endpoint          │
│        │                                             │
│        ▼                                             │
│  7. Online predictions via REST API                  │
└──────────────────────────────────────────────────────┘

Vertex AI AutoML via Python SDK🔗

from google.cloud import aiplatform

aiplatform.init(project="my-project", location="us-central1")

# Create a Dataset from BigQuery
dataset = aiplatform.TabularDataset.create(
    display_name="customer-churn-dataset",
    bq_source="bq://my-project.ml_datasets.customer_churn",
)

# Train AutoML Tabular model
job = aiplatform.AutoMLTabularTrainingJob(
    display_name="churn-automl-v1",
    optimization_prediction_type="classification",   # or 'regression'
    optimization_objective="maximize-au-roc",
)

model = job.run(
    dataset=dataset,
    target_column="churned",            # column to predict
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
    budget_milli_node_hours=1000,       # 1 hour of training budget
    model_display_name="churn-model-v1",
)

# Deploy
endpoint = model.deploy(machine_type="n1-standard-2")

# Predict
prediction = endpoint.predict(instances=[{
    "age": 35,
    "tenure_months": 12,
    "monthly_charges": 65.5
}])
print(prediction.predictions)

9.6 Neural Architecture Search (NAS)🔗

NAS automates the design of neural network architectures (layer types, depths, connections).

TRADITIONAL: Data Scientist manually designs network architecture
  ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐
  │Conv│─│Conv│─│Pool│─│FC  │─│Soft│  (hand-designed)
  └────┘ └────┘ └────┘ └────┘ └────┘

NAS: Algorithm searches for best architecture
  Search Space: [Conv3x3, Conv5x5, Pool, Skip, Dense, ...]
       │
       ├── Try Architecture 1 → eval → 0.82 accuracy
       ├── Try Architecture 2 → eval → 0.85 accuracy
       ├── Try Architecture N → eval → 0.91 accuracy ← WINNER
       └── Return best architecture

9.7 Popular AutoML Libraries🔗

Library	Best For	Notes
Vertex AI AutoML	GCP production workloads	Fully managed, no code
H2O AutoML	Tabular data, fast	Open source
TPOT	Sklearn pipeline optimization	Uses genetic algorithms
AutoKeras	Deep learning architectures	Keras/TF-based
Optuna	HPO only	Flexible, widely used
Ray Tune	Distributed HPO	Scales to clusters

9.8 AutoML in the MLOps Pipeline🔗

MLOps Pipeline with AutoML:

  New Data Arrives
       │
       ▼
  Data Validation
       │
       ▼
  Trigger AutoML Training (Vertex AI)
  ┌─────────────────────────────────┐
  │  AutoML internally:             │
  │  - Tries 100s of configs        │
  │  - Logs all experiments         │
  │  - Returns best model           │
  └──────────────┬──────────────────┘
                 │
                 ▼
  Evaluate vs current prod model
       │
       ├── Better? → Deploy (CI/CD)
       └── Worse?  → Alert team

Next Chapter → 10: Experiment Tracking