Chapter 02: ML Lifecycle & Pipeline🔗

"An ML system is not just a model — it's data + code + infrastructure + people."

2.1 The Machine Learning Lifecycle🔗

The ML lifecycle has 8 stages that repeat in a continuous loop:

┌─────────────────────────────────────────────────────────────────┐
│                   ML LIFECYCLE (Circular Flow)                  │
│                                                                 │
│   1. Business    2. Data        3. Data        4. Model         │
│   Understanding  Collection ──▶ Preparation ──▶ Development     │
│        │                                           │           │
│        │         (Feedback Loop)                   │           │
│        │                                           ▼           │
│   8. Monitor ◀── 7. Deploy ◀── 6. Package ◀── 5. Evaluation    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2.2 Stage-by-Stage Breakdown🔗

Stage 1: Business Understanding🔗

Define the problem (classification? regression? anomaly detection?)
Identify success metrics (accuracy, F1, RMSE, ROI)
Determine data availability

Stage 2: Data Collection🔗

Sources:
  ┌────────────┐   ┌──────────────┐   ┌────────────┐
  │  Databases │   │ External APIs │   │  CSV/Files │
  └─────┬──────┘   └──────┬───────┘   └─────┬──────┘
        └─────────────────┼──────────────────┘
                          │
                    ┌─────▼──────┐
                    │ Data Lake  │
                    │ (GCS/S3)   │
                    └────────────┘

Stage 3: Data Preparation🔗

Tasks:
- Cleaning (handle nulls, duplicates, outliers)
- Feature Engineering (create meaningful features)
- Splitting (train / validation / test)
- Versioning with DVC

# Example: DVC data pipeline
dvc run -n preprocess \
  -d data/raw/dataset.csv \
  -o data/processed/clean.csv \
  python src/preprocess.py

Stage 4: Model Development🔗

   Raw Features
       │
       ▼
  ┌──────────┐      ┌──────────┐      ┌──────────┐
  │  Model   │      │  Model   │      │  Model   │
  │    A     │      │    B     │      │    C     │
  │(RF/XGB)  │      │  (SVM)   │      │ (Neural) │
  └────┬─────┘      └────┬─────┘      └────┬─────┘
       └─────────────────┼─────────────────┘
                         │
                    ┌────▼─────┐
                    │ Best     │
                    │ Model ✓  │
                    └──────────┘

Stage 5: Model Evaluation🔗

Key metrics tracked:

Task Type	Metrics
Classification	Accuracy, Precision, Recall, F1, AUC-ROC
Regression	MAE, MSE, RMSE, R²
Clustering	Silhouette Score, Davies-Bouldin

Stage 6: Model Packaging🔗

Model artifacts → Docker Image → Container Registry
     │
     ├── model.pkl / model.pt / model.h5
     ├── preprocessor.pkl
     ├── requirements.txt
     └── inference.py (FastAPI/Flask server)

Stage 7: Deployment🔗

  Model Package
       │
       ▼
  ┌──────────────────────────────────┐
  │   Deployment Targets             │
  │                                  │
  │  🌐 REST API (Flask/FastAPI)      │
  │  📦 Docker Container             │
  │  ☸️  Kubernetes Cluster (GKE)     │
  │  ☁️  Cloud Functions              │
  │  📊 Batch Scoring                │
  └──────────────────────────────────┘

Stage 8: Monitoring🔗

What to monitor:

┌────────────────────────────────────────┐
│          MONITORING DIMENSIONS         │
│                                        │
│  📊 Model Performance                  │
│     └── Accuracy drift over time       │
│                                        │
│  📈 Data Quality                       │
│     └── Input feature distribution    │
│                                        │
│  ⚙️  Infrastructure                    │
│     └── Latency, Memory, CPU           │
│                                        │
│  💼 Business KPIs                      │
│     └── Revenue impact, conversions   │
└────────────────────────────────────────┘

2.3 ML Pipeline vs ML System🔗

ML PIPELINE (what you build):
┌──────┐   ┌─────────┐   ┌───────┐   ┌────────┐
│ Data │──▶│ Feature │──▶│ Train │──▶│ Serve  │
│ Prep │   │  Eng    │   │ Model │   │ Model  │
└──────┘   └─────────┘   └───────┘   └────────┘

ML SYSTEM (what runs in production):
  ML Pipeline + Configuration + Data Collection +
  Feature Platform + Model Registry + Serving System +
  Monitoring + CI/CD infrastructure

2.4 Key Artifacts in the Pipeline🔗

Artifact	What It Is	How It's Tracked
Raw Data	Original collected data	DVC + GCS
Processed Data	Cleaned, engineered features	DVC
Model Weights	Trained model parameters	MLflow + DVC
Hyperparameters	Config used for training	MLflow
Metrics	Evaluation scores	MLflow
Docker Image	Packaged model server	GCR (Container Registry)

2.5 Feature Store Concept🔗

A Feature Store is a centralized repo for ML features — ensuring features used in training match features used at inference (training-serving skew prevention).

                   ┌─────────────┐
  Batch Data ─────▶│             │◀──── Streaming Data
                   │  FEATURE    │
  Training ◀───────│   STORE     │──────▶ Online Serving
                   │  (Feast/    │
  Historical ◀─────│  Tecton)    │──────▶ Batch Scoring
                   └─────────────┘

Next Chapter → 03: Git & GitHub