02 Ml Lifecycle Pipeline

Chapter 02: ML Lifecycle & PipelineπŸ”—

"An ML system is not just a model β€” it's data + code + infrastructure + people."


2.1 The Machine Learning LifecycleπŸ”—

The ML lifecycle has 8 stages that repeat in a continuous loop:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   ML LIFECYCLE (Circular Flow)                  β”‚
β”‚                                                                 β”‚
β”‚   1. Business    2. Data        3. Data        4. Model         β”‚
β”‚   Understanding  Collection ──▢ Preparation ──▢ Development     β”‚
β”‚        β”‚                                           β”‚           β”‚
β”‚        β”‚         (Feedback Loop)                   β”‚           β”‚
β”‚        β”‚                                           β–Ό           β”‚
β”‚   8. Monitor ◀── 7. Deploy ◀── 6. Package ◀── 5. Evaluation    β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.2 Stage-by-Stage BreakdownπŸ”—

Stage 1: Business UnderstandingπŸ”—

  • Define the problem (classification? regression? anomaly detection?)
  • Identify success metrics (accuracy, F1, RMSE, ROI)
  • Determine data availability

Stage 2: Data CollectionπŸ”—

Sources:
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Databases β”‚   β”‚ External APIs β”‚   β”‚  CSV/Files β”‚
  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                    β”‚ Data Lake  β”‚
                    β”‚ (GCS/S3)   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stage 3: Data PreparationπŸ”—

Tasks:
- Cleaning (handle nulls, duplicates, outliers)
- Feature Engineering (create meaningful features)
- Splitting (train / validation / test)
- Versioning with DVC

# Example: DVC data pipeline
dvc run -n preprocess \
  -d data/raw/dataset.csv \
  -o data/processed/clean.csv \
  python src/preprocess.py

Stage 4: Model DevelopmentπŸ”—

   Raw Features
       β”‚
       β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Model   β”‚      β”‚  Model   β”‚      β”‚  Model   β”‚
  β”‚    A     β”‚      β”‚    B     β”‚      β”‚    C     β”‚
  β”‚(RF/XGB)  β”‚      β”‚  (SVM)   β”‚      β”‚ (Neural) β”‚
  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
                    β”‚ Best     β”‚
                    β”‚ Model βœ“  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stage 5: Model EvaluationπŸ”—

Key metrics tracked:

Task Type Metrics
Classification Accuracy, Precision, Recall, F1, AUC-ROC
Regression MAE, MSE, RMSE, RΒ²
Clustering Silhouette Score, Davies-Bouldin

Stage 6: Model PackagingπŸ”—

Model artifacts β†’ Docker Image β†’ Container Registry
     β”‚
     β”œβ”€β”€ model.pkl / model.pt / model.h5
     β”œβ”€β”€ preprocessor.pkl
     β”œβ”€β”€ requirements.txt
     └── inference.py (FastAPI/Flask server)

Stage 7: DeploymentπŸ”—

  Model Package
       β”‚
       β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚   Deployment Targets             β”‚
  β”‚                                  β”‚
  β”‚  🌐 REST API (Flask/FastAPI)      β”‚
  β”‚  πŸ“¦ Docker Container             β”‚
  β”‚  ☸️  Kubernetes Cluster (GKE)     β”‚
  β”‚  ☁️  Cloud Functions              β”‚
  β”‚  πŸ“Š Batch Scoring                β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stage 8: MonitoringπŸ”—

What to monitor:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          MONITORING DIMENSIONS         β”‚
β”‚                                        β”‚
β”‚  πŸ“Š Model Performance                  β”‚
β”‚     └── Accuracy drift over time       β”‚
β”‚                                        β”‚
β”‚  πŸ“ˆ Data Quality                       β”‚
β”‚     └── Input feature distribution    β”‚
β”‚                                        β”‚
β”‚  βš™οΈ  Infrastructure                    β”‚
β”‚     └── Latency, Memory, CPU           β”‚
β”‚                                        β”‚
β”‚  πŸ’Ό Business KPIs                      β”‚
β”‚     └── Revenue impact, conversions   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.3 ML Pipeline vs ML SystemπŸ”—

ML PIPELINE (what you build):
β”Œβ”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Data │──▢│ Feature │──▢│ Train │──▢│ Serve  β”‚
β”‚ Prep β”‚   β”‚  Eng    β”‚   β”‚ Model β”‚   β”‚ Model  β”‚
β””β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

ML SYSTEM (what runs in production):
  ML Pipeline + Configuration + Data Collection +
  Feature Platform + Model Registry + Serving System +
  Monitoring + CI/CD infrastructure

2.4 Key Artifacts in the PipelineπŸ”—

Artifact What It Is How It's Tracked
Raw Data Original collected data DVC + GCS
Processed Data Cleaned, engineered features DVC
Model Weights Trained model parameters MLflow + DVC
Hyperparameters Config used for training MLflow
Metrics Evaluation scores MLflow
Docker Image Packaged model server GCR (Container Registry)

2.5 Feature Store ConceptπŸ”—

A Feature Store is a centralized repo for ML features β€” ensuring features used in training match features used at inference (training-serving skew prevention).

                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  Batch Data ─────▢│             │◀──── Streaming Data
                   β”‚  FEATURE    β”‚
  Training ◀───────│   STORE     │──────▢ Online Serving
                   β”‚  (Feast/    β”‚
  Historical ◀─────│  Tecton)    │──────▢ Batch Scoring
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Next Chapter β†’ 03: Git & GitHub