05 Dvc Data Versioning

Chapter 05: DVC β€” Data Version ControlπŸ”—

"Git tracks code. DVC tracks data, models, and experiments. Together they make ML reproducible."


5.1 What is DVC?πŸ”—

DVC (Data Version Control) is an open-source tool that adds Git-like version control to large data files, ML models, and experiment pipelines. It stores metadata in Git while the actual files live in remote storage (GCS, S3, Azure Blob, SSH, etc.).

The Problem DVC SolvesπŸ”—

WITHOUT DVC:
  - data_v1.csv, data_v2.csv, data_final.csv (chaos)
  - Can't tell which dataset trained which model
  - 5GB CSV committed to Git β†’ repo destroyed
  - Colleague can't reproduce your results

WITH DVC:
  - data.csv.dvc (tiny text file) committed to Git
  - Actual data in GCS/S3 (versioned, deduplicated)
  - Every model links to exact data that created it
  - Anyone can: git checkout + dvc pull β†’ exact environment

5.2 How DVC WorksπŸ”—

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  DVC ARCHITECTURE                    β”‚
β”‚                                                      β”‚
β”‚  Your Files:         Git Repo:       Remote Storage: β”‚
β”‚  data.csv ─────────▢ data.csv.dvc ──────────────────▢│
β”‚  model.pkl ─────────▢ model.pkl.dvc    GCS / S3      β”‚
β”‚                                        /Azure Blob   β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚              β”‚   data.csv.dvc      β”‚               β”‚
β”‚              β”‚   ─────────────     β”‚               β”‚
β”‚              β”‚   outs:             β”‚               β”‚
β”‚              β”‚   - md5: a3f1b2...  β”‚               β”‚
β”‚              β”‚     path: data.csv  β”‚               β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚                 (tiny, tracked by Git)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

5.3 DVC SetupπŸ”—

# 1. Initialize DVC in a Git repo
git init my-ml-project
cd my-ml-project
dvc init
git commit -m "Initialize DVC"

# 2. Configure remote storage (GCS)
dvc remote add -d gcs-remote gs://my-ml-bucket/dvc-store
dvc remote modify gcs-remote credentialpath /path/to/gcp-sa-key.json
git add .dvc/config
git commit -m "Configure DVC remote"

# 3. Track large files
dvc add data/raw/dataset.csv
git add data/raw/dataset.csv.dvc data/raw/.gitignore
git commit -m "Track dataset with DVC"

# 4. Push to remote
dvc push

# 5. On another machine: pull
git clone <repo>
dvc pull   # downloads exact same data

5.4 DVC Pipelines (dvc.yaml)πŸ”—

DVC pipelines define reproducible, dependency-tracked ML workflows:

# dvc.yaml
stages:
  # Stage 1: Data ingestion
  ingest:
    cmd: python src/ingest.py --source gs://raw-data/ --output data/raw/
    deps:
      - src/ingest.py
    outs:
      - data/raw/dataset.csv

  # Stage 2: Data preprocessing
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - src/preprocess.py
      - data/raw/dataset.csv
    outs:
      - data/processed/features.csv
      - data/processed/train.csv
      - data/processed/test.csv

  # Stage 3: Train model
  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed/train.csv
    params:
      - params.yaml:            # reads hyperparams from yaml
          - model.n_estimators
          - model.learning_rate
          - model.max_depth
    outs:
      - models/model.pkl
    metrics:
      - metrics/train_metrics.json:
          cache: false          # don't cache, always recalculate

  # Stage 4: Evaluate
  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/processed/test.csv
    metrics:
      - metrics/eval_metrics.json:
          cache: false
    plots:
      - plots/confusion_matrix.json
      - plots/roc_curve.json
# params.yaml β€” tracked by DVC
model:
  n_estimators: 200
  learning_rate: 0.05
  max_depth: 5
  random_state: 42

data:
  test_size: 0.2
  random_state: 42

5.5 Reproducing and Comparing ExperimentsπŸ”—

# Reproduce the full pipeline (only re-runs changed stages)
dvc repro

# View the DAG
dvc dag

# Check what changed
dvc status

# Run experiment with different params
dvc exp run --set-param model.learning_rate=0.01 --name "lr-0.01"
dvc exp run --set-param model.n_estimators=500 --name "n500"

# Compare experiments
dvc exp show

# Output:
# ┃ Experiment         ┃ accuracy ┃ f1_score ┃ n_estimators ┃ learning_rate ┃
# ┃ workspace          ┃ 0.8923   ┃ 0.8812   ┃ 200          ┃ 0.05          ┃
# ┃ lr-0.01            ┃ 0.8755   ┃ 0.8643   ┃ 200          ┃ 0.01          ┃
# ┃ n500               ┃ 0.8967   ┃ 0.8856   ┃ 500          ┃ 0.05          ┃ ← BEST

# Apply the best experiment
dvc exp apply n500
git add .
git commit -m "Apply best experiment: n500"
dvc push

5.6 DVC with GCS (Production Setup)πŸ”—

# src/preprocess.py β€” reads/writes tracked by DVC
import pandas as pd
import os

# DVC handles the path resolution
df = pd.read_csv("data/raw/dataset.csv")

# Preprocessing steps
df = df.dropna()
df = df.drop_duplicates()
# ... feature engineering ...

os.makedirs("data/processed", exist_ok=True)
df.to_csv("data/processed/features.csv", index=False)
print(f"Processed {len(df)} rows β†’ data/processed/features.csv")
# DVC remote storage options
dvc remote add myremote gs://bucket/path      # Google Cloud Storage
dvc remote add myremote s3://bucket/path      # AWS S3
dvc remote add myremote azure://container     # Azure Blob
dvc remote add myremote ssh://server/path     # SSH server
dvc remote add myremote /local/path           # Local filesystem

5.7 DVC + Git: The Combined WorkflowπŸ”—

FULL WORKFLOW:

Developer A (trains model):
  git pull                ← get latest code
  dvc pull                ← get latest data + model
  dvc exp run --set-param model.n_estimators=300
  git add dvc.lock params.yaml metrics/
  git commit -m "experiment: n=300 estimators"
  git push
  dvc push                ← upload new model to GCS

Developer B (reviews):
  git pull
  dvc pull                ← downloads Dev A's exact model
  dvc metrics show        ← compare metrics
  dvc params diff HEAD~1  ← see what params changed

5.8 DVC vs Git-LFS vs Plain S3πŸ”—

Feature DVC Git-LFS Plain S3
Free storage Yes (external) Limited Costs money
Data versioning Full Full Manual
Pipeline tracking βœ… Yes ❌ No ❌ No
Experiment compare βœ… Yes ❌ No ❌ No
Language Python Any Any
Works with any remote βœ… Yes Limited N/A
ML-specific features βœ… Yes ❌ No ❌ No

DVC wins for MLOps. Git-LFS is fine for binary game assets; DVC is purpose-built for ML.


Next β†’ Chapter 06: Data Quality & Validation