Chapter 05: DVC β Data Version Controlπ
"Git tracks code. DVC tracks data, models, and experiments. Together they make ML reproducible."
5.1 What is DVC?π
DVC (Data Version Control) is an open-source tool that adds Git-like version control to large data files, ML models, and experiment pipelines. It stores metadata in Git while the actual files live in remote storage (GCS, S3, Azure Blob, SSH, etc.).
The Problem DVC Solvesπ
WITHOUT DVC:
- data_v1.csv, data_v2.csv, data_final.csv (chaos)
- Can't tell which dataset trained which model
- 5GB CSV committed to Git β repo destroyed
- Colleague can't reproduce your results
WITH DVC:
- data.csv.dvc (tiny text file) committed to Git
- Actual data in GCS/S3 (versioned, deduplicated)
- Every model links to exact data that created it
- Anyone can: git checkout + dvc pull β exact environment
5.2 How DVC Worksπ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DVC ARCHITECTURE β
β β
β Your Files: Git Repo: Remote Storage: β
β data.csv ββββββββββΆ data.csv.dvc βββββββββββββββββββΆβ
β model.pkl ββββββββββΆ model.pkl.dvc GCS / S3 β
β /Azure Blob β
β βββββββββββββββββββββββ β
β β data.csv.dvc β β
β β βββββββββββββ β β
β β outs: β β
β β - md5: a3f1b2... β β
β β path: data.csv β β
β βββββββββββββββββββββββ β
β (tiny, tracked by Git) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5.3 DVC Setupπ
# 1. Initialize DVC in a Git repo
git init my-ml-project
cd my-ml-project
dvc init
git commit -m "Initialize DVC"
# 2. Configure remote storage (GCS)
dvc remote add -d gcs-remote gs://my-ml-bucket/dvc-store
dvc remote modify gcs-remote credentialpath /path/to/gcp-sa-key.json
git add .dvc/config
git commit -m "Configure DVC remote"
# 3. Track large files
dvc add data/raw/dataset.csv
git add data/raw/dataset.csv.dvc data/raw/.gitignore
git commit -m "Track dataset with DVC"
# 4. Push to remote
dvc push
# 5. On another machine: pull
git clone <repo>
dvc pull # downloads exact same data
5.4 DVC Pipelines (dvc.yaml)π
DVC pipelines define reproducible, dependency-tracked ML workflows:
# dvc.yaml
stages:
# Stage 1: Data ingestion
ingest:
cmd: python src/ingest.py --source gs://raw-data/ --output data/raw/
deps:
- src/ingest.py
outs:
- data/raw/dataset.csv
# Stage 2: Data preprocessing
preprocess:
cmd: python src/preprocess.py
deps:
- src/preprocess.py
- data/raw/dataset.csv
outs:
- data/processed/features.csv
- data/processed/train.csv
- data/processed/test.csv
# Stage 3: Train model
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed/train.csv
params:
- params.yaml: # reads hyperparams from yaml
- model.n_estimators
- model.learning_rate
- model.max_depth
outs:
- models/model.pkl
metrics:
- metrics/train_metrics.json:
cache: false # don't cache, always recalculate
# Stage 4: Evaluate
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/processed/test.csv
metrics:
- metrics/eval_metrics.json:
cache: false
plots:
- plots/confusion_matrix.json
- plots/roc_curve.json
# params.yaml β tracked by DVC
model:
n_estimators: 200
learning_rate: 0.05
max_depth: 5
random_state: 42
data:
test_size: 0.2
random_state: 42
5.5 Reproducing and Comparing Experimentsπ
# Reproduce the full pipeline (only re-runs changed stages)
dvc repro
# View the DAG
dvc dag
# Check what changed
dvc status
# Run experiment with different params
dvc exp run --set-param model.learning_rate=0.01 --name "lr-0.01"
dvc exp run --set-param model.n_estimators=500 --name "n500"
# Compare experiments
dvc exp show
# Output:
# β Experiment β accuracy β f1_score β n_estimators β learning_rate β
# β workspace β 0.8923 β 0.8812 β 200 β 0.05 β
# β lr-0.01 β 0.8755 β 0.8643 β 200 β 0.01 β
# β n500 β 0.8967 β 0.8856 β 500 β 0.05 β β BEST
# Apply the best experiment
dvc exp apply n500
git add .
git commit -m "Apply best experiment: n500"
dvc push
5.6 DVC with GCS (Production Setup)π
# src/preprocess.py β reads/writes tracked by DVC
import pandas as pd
import os
# DVC handles the path resolution
df = pd.read_csv("data/raw/dataset.csv")
# Preprocessing steps
df = df.dropna()
df = df.drop_duplicates()
# ... feature engineering ...
os.makedirs("data/processed", exist_ok=True)
df.to_csv("data/processed/features.csv", index=False)
print(f"Processed {len(df)} rows β data/processed/features.csv")
# DVC remote storage options
dvc remote add myremote gs://bucket/path # Google Cloud Storage
dvc remote add myremote s3://bucket/path # AWS S3
dvc remote add myremote azure://container # Azure Blob
dvc remote add myremote ssh://server/path # SSH server
dvc remote add myremote /local/path # Local filesystem
5.7 DVC + Git: The Combined Workflowπ
FULL WORKFLOW:
Developer A (trains model):
git pull β get latest code
dvc pull β get latest data + model
dvc exp run --set-param model.n_estimators=300
git add dvc.lock params.yaml metrics/
git commit -m "experiment: n=300 estimators"
git push
dvc push β upload new model to GCS
Developer B (reviews):
git pull
dvc pull β downloads Dev A's exact model
dvc metrics show β compare metrics
dvc params diff HEAD~1 β see what params changed
5.8 DVC vs Git-LFS vs Plain S3π
| Feature | DVC | Git-LFS | Plain S3 |
|---|---|---|---|
| Free storage | Yes (external) | Limited | Costs money |
| Data versioning | Full | Full | Manual |
| Pipeline tracking | β Yes | β No | β No |
| Experiment compare | β Yes | β No | β No |
| Language | Python | Any | Any |
| Works with any remote | β Yes | Limited | N/A |
| ML-specific features | β Yes | β No | β No |
DVC wins for MLOps. Git-LFS is fine for binary game assets; DVC is purpose-built for ML.