Chapter 03: Git & GitHub for MLOps🔗

"Version control is the backbone of reproducible ML. No Git = no collaboration = no MLOps."

3.1 What is Git?🔗

Git is a distributed version control system that tracks changes to files over time. Every change is recorded, enabling you to revert, compare, and collaborate.

WITHOUT GIT:                 WITH GIT:
model_v1.py                  main branch ──▶ prod model
model_v2_final.py            feature branch ──▶ experiment
model_v2_final_USE_THIS.py   hotfix branch ──▶ quick fix
model_v2_ACTUAL_FINAL.py     All tracked, reversible ✓

3.2 Core Git Concepts🔗

Term	Meaning
Repository (Repo)	Folder tracked by Git
Commit	A snapshot of changes
Branch	Independent line of development
Merge	Combine branches together
Clone	Copy a remote repo locally
Push	Send local commits to remote
Pull	Fetch + merge remote changes
PR (Pull Request)	Request to merge a branch

3.3 Git Workflow for ML Teams🔗

Branching Strategy (GitFlow for ML)🔗

main          ●─────────────────────────────────●  (production)
               \                               /
develop         ●───────────────────●─────────●   (integration)
                 \                 / \       /
feature/          ●───────────────●   ●─────●      (experiments)
model-experiment
                               ↑
                         PR + Code Review

Standard Commands🔗

# Initialize a new ML project
git init my-ml-project
cd my-ml-project

# Stage and commit
git add src/train.py data/features.csv
git commit -m "feat: add gradient boosting model with feature engineering"

# Create and switch to experiment branch
git checkout -b experiment/xgboost-v2

# Push to remote
git push origin experiment/xgboost-v2

# Merge via PR (on GitHub UI) then pull locally
git checkout main
git pull origin main

3.4 What is GitHub?🔗

GitHub is a cloud-hosted Git platform with collaboration tools: Pull Requests, Issues, Actions (CI/CD), Packages, and more.

┌──────────────────────────────────────────────┐
│              GITHUB FEATURES                 │
│                                              │
│  📁 Repositories    🔀 Pull Requests          │
│  🐛 Issues          🏷️  Tags & Releases        │
│  ⚡ GitHub Actions  📦 GitHub Packages        │
│  🔒 Branch Rules    🔍 Code Review            │
│  📊 Insights        🤝 Teams & Permissions    │
└──────────────────────────────────────────────┘

3.5 GitHub Actions for ML CI/CD🔗

GitHub Actions automates workflows directly in your repo. It's triggered by events (push, PR, schedule).

How It Works🔗

Event (push to main)
       │
       ▼
  .github/workflows/train.yml
       │
       ├── Job 1: lint & test code
       ├── Job 2: validate data
       ├── Job 3: train model
       ├── Job 4: evaluate model
       └── Job 5: deploy if metrics pass

Sample GitHub Actions Workflow🔗

# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  train-and-evaluate:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v3

    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'

    - name: Install dependencies
      run: pip install -r requirements.txt

    - name: Run data validation
      run: python src/validate_data.py

    - name: Train model
      run: python src/train.py

    - name: Evaluate model
      run: python src/evaluate.py

    - name: Check accuracy threshold
      run: |
        ACCURACY=$(cat metrics/accuracy.txt)
        echo "Model accuracy: $ACCURACY"
        python -c "assert float('$ACCURACY') > 0.85, 'Accuracy below threshold!'"

    - name: Build Docker image
      run: docker build -t my-ml-model:${{ github.sha }} .

    - name: Push to GCR
      run: |
        docker tag my-ml-model:${{ github.sha }} gcr.io/$PROJECT_ID/ml-model:latest
        docker push gcr.io/$PROJECT_ID/ml-model:latest

3.6 DVC — Data Version Control🔗

Git tracks code. DVC tracks large data files and ML models by storing metadata in Git and the actual files in remote storage (GCS, S3, Azure Blob).

┌─────────────────────────────────────────────┐
│              DVC WORKFLOW                   │
│                                             │
│  Large File (data.csv, model.pkl)           │
│       │                                     │
│       ├─── DVC ──▶ data.csv.dvc (tiny)      │
│       │           ├── saved to Git ✓         │
│       │           └── points to GCS         │
│       │                                     │
│       └─────────────▶ GCS Bucket            │
│                       (actual large file)   │
└─────────────────────────────────────────────┘

DVC Commands🔗

# Initialize DVC in Git repo
dvc init

# Add remote storage (GCS)
dvc remote add -d gcs-remote gs://my-bucket/dvc-store

# Track large data file
dvc add data/raw/dataset.csv
git add data/raw/dataset.csv.dvc .gitignore
git commit -m "track raw dataset with DVC"

# Push data to remote
dvc push

# Pull data on another machine
dvc pull

3.7 Git + DVC Together🔗

DEVELOPER A (trains model):
  1. git pull                  ← get latest code
  2. dvc pull                  ← get latest data
  3. python train.py           ← train
  4. dvc add model/model.pkl   ← track model
  5. git add + commit + push   ← track code change
  6. dvc push                  ← push model to GCS

DEVELOPER B (reproduces):
  1. git clone <repo>
  2. dvc pull                  ← downloads exact same data + model
  ✓  Fully reproducible!

3.8 Best Practices🔗

✅ DO:
  - Commit often with meaningful messages ("feat: add SMOTE for class imbalance")
  - Use .gitignore for __pycache__, .env, *.pkl
  - Use .dvcignore for large interim files
  - Protect main branch with required PR reviews
  - Tag releases: git tag v1.2.0

❌ DON'T:
  - Never commit large files (>100MB) to Git
  - Never commit secrets or API keys
  - Never push directly to main
  - Never delete branches without merging

Next Chapter → 04: CI/CD for ML