Chapter 03: Git & GitHub for MLOps🔗
"Version control is the backbone of reproducible ML. No Git = no collaboration = no MLOps."
3.1 What is Git?🔗
Git is a distributed version control system that tracks changes to files over time. Every change is recorded, enabling you to revert, compare, and collaborate.
WITHOUT GIT: WITH GIT:
model_v1.py main branch ──▶ prod model
model_v2_final.py feature branch ──▶ experiment
model_v2_final_USE_THIS.py hotfix branch ──▶ quick fix
model_v2_ACTUAL_FINAL.py All tracked, reversible ✓
3.2 Core Git Concepts🔗
| Term | Meaning |
|---|---|
| Repository (Repo) | Folder tracked by Git |
| Commit | A snapshot of changes |
| Branch | Independent line of development |
| Merge | Combine branches together |
| Clone | Copy a remote repo locally |
| Push | Send local commits to remote |
| Pull | Fetch + merge remote changes |
| PR (Pull Request) | Request to merge a branch |
3.3 Git Workflow for ML Teams🔗
Branching Strategy (GitFlow for ML)🔗
main ●─────────────────────────────────● (production)
\ /
develop ●───────────────────●─────────● (integration)
\ / \ /
feature/ ●───────────────● ●─────● (experiments)
model-experiment
↑
PR + Code Review
Standard Commands🔗
# Initialize a new ML project
git init my-ml-project
cd my-ml-project
# Stage and commit
git add src/train.py data/features.csv
git commit -m "feat: add gradient boosting model with feature engineering"
# Create and switch to experiment branch
git checkout -b experiment/xgboost-v2
# Push to remote
git push origin experiment/xgboost-v2
# Merge via PR (on GitHub UI) then pull locally
git checkout main
git pull origin main
3.4 What is GitHub?🔗
GitHub is a cloud-hosted Git platform with collaboration tools: Pull Requests, Issues, Actions (CI/CD), Packages, and more.
┌──────────────────────────────────────────────┐
│ GITHUB FEATURES │
│ │
│ 📁 Repositories 🔀 Pull Requests │
│ 🐛 Issues 🏷️ Tags & Releases │
│ ⚡ GitHub Actions 📦 GitHub Packages │
│ 🔒 Branch Rules 🔍 Code Review │
│ 📊 Insights 🤝 Teams & Permissions │
└──────────────────────────────────────────────┘
3.5 GitHub Actions for ML CI/CD🔗
GitHub Actions automates workflows directly in your repo. It's triggered by events (push, PR, schedule).
How It Works🔗
Event (push to main)
│
▼
.github/workflows/train.yml
│
├── Job 1: lint & test code
├── Job 2: validate data
├── Job 3: train model
├── Job 4: evaluate model
└── Job 5: deploy if metrics pass
Sample GitHub Actions Workflow🔗
# .github/workflows/ml-pipeline.yml
name: ML Pipeline
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
train-and-evaluate:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run data validation
run: python src/validate_data.py
- name: Train model
run: python src/train.py
- name: Evaluate model
run: python src/evaluate.py
- name: Check accuracy threshold
run: |
ACCURACY=$(cat metrics/accuracy.txt)
echo "Model accuracy: $ACCURACY"
python -c "assert float('$ACCURACY') > 0.85, 'Accuracy below threshold!'"
- name: Build Docker image
run: docker build -t my-ml-model:${{ github.sha }} .
- name: Push to GCR
run: |
docker tag my-ml-model:${{ github.sha }} gcr.io/$PROJECT_ID/ml-model:latest
docker push gcr.io/$PROJECT_ID/ml-model:latest
3.6 DVC — Data Version Control🔗
Git tracks code. DVC tracks large data files and ML models by storing metadata in Git and the actual files in remote storage (GCS, S3, Azure Blob).
┌─────────────────────────────────────────────┐
│ DVC WORKFLOW │
│ │
│ Large File (data.csv, model.pkl) │
│ │ │
│ ├─── DVC ──▶ data.csv.dvc (tiny) │
│ │ ├── saved to Git ✓ │
│ │ └── points to GCS │
│ │ │
│ └─────────────▶ GCS Bucket │
│ (actual large file) │
└─────────────────────────────────────────────┘
DVC Commands🔗
# Initialize DVC in Git repo
dvc init
# Add remote storage (GCS)
dvc remote add -d gcs-remote gs://my-bucket/dvc-store
# Track large data file
dvc add data/raw/dataset.csv
git add data/raw/dataset.csv.dvc .gitignore
git commit -m "track raw dataset with DVC"
# Push data to remote
dvc push
# Pull data on another machine
dvc pull
3.7 Git + DVC Together🔗
DEVELOPER A (trains model):
1. git pull ← get latest code
2. dvc pull ← get latest data
3. python train.py ← train
4. dvc add model/model.pkl ← track model
5. git add + commit + push ← track code change
6. dvc push ← push model to GCS
DEVELOPER B (reproduces):
1. git clone <repo>
2. dvc pull ← downloads exact same data + model
✓ Fully reproducible!
3.8 Best Practices🔗
✅ DO:
- Commit often with meaningful messages ("feat: add SMOTE for class imbalance")
- Use .gitignore for __pycache__, .env, *.pkl
- Use .dvcignore for large interim files
- Protect main branch with required PR reviews
- Tag releases: git tag v1.2.0
❌ DON'T:
- Never commit large files (>100MB) to Git
- Never commit secrets or API keys
- Never push directly to main
- Never delete branches without merging
Next Chapter → 04: CI/CD for ML