20 Feature Stores

Chapter 20: Feature StoresπŸ”—

"A Feature Store prevents training-serving skew β€” the single biggest source of ML bugs in production."


20.1 What is a Feature Store?πŸ”—

A Feature Store is a centralized data layer that stores, manages, and serves ML features consistently for both training and serving (inference).

The Problem Without a Feature StoreπŸ”—

WITHOUT FEATURE STORE:

  Training Pipeline:              Serving Pipeline:
    age = row["birth_year"]         age = current_year - user.dob
          vs                              β‰  DIFFERENT CALCULATION!
    income = annual / 12            income = monthly_salary
          vs                              β‰  DIFFERENT COLUMN!

Result:
  Training accuracy: 0.92
  Production accuracy: 0.71 ← Training-serving skew!
  Root cause: Features computed differently at training vs serving

WITH FEATURE STORE:
  One definition, used everywhere.
  Training and serving use IDENTICAL feature values.

20.2 Feature Store ArchitectureπŸ”—

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       FEATURE STORE                                  β”‚
β”‚                                                                      β”‚
β”‚  DATA SOURCES          FEATURE PIPELINE          FEATURE STORE       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚Streaming │──────────▢              β”‚          β”‚                β”‚  β”‚
β”‚  β”‚(Kafka)   β”‚          β”‚  Feature     │─────────▢│  Online Store  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚  Engineering β”‚          β”‚  (Redis/       β”‚  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚  (Python /   β”‚          β”‚   Bigtable)    β”‚  β”‚
β”‚  β”‚Batch DB  │──────────▢  dbt/Spark)  β”‚          β”‚  Low latency   β”‚  β”‚
β”‚  β”‚(BigQuery)β”‚          β”‚              │─────────▢│  Offline Store β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚  (GCS/BigQuery)β”‚  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                    β”‚  High volume   β”‚  β”‚
β”‚  β”‚Files/CSV │──────────────────────────────────▢ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                             β”‚          β”‚
β”‚                                                           β”‚          β”‚
β”‚               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€          β”‚
β”‚               β–Ό                                          β–Ό          β”‚
β”‚         Training                                    Serving          β”‚
β”‚         Pipeline                                    System           β”‚
β”‚         (offline store                             (online store     β”‚
β”‚          β†’ batch read)                              β†’ real-time)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

20.3 Feast β€” Open Source Feature StoreπŸ”—

Feast is the most popular open-source feature store, designed to work with GCS, BigQuery, Redis, and other backends.

Feast SetupπŸ”—

# pip install feast[gcp]

# feature_repo/feature_store.yaml
project: churn_feature_store
registry: gs://my-bucket/feast/registry.db
provider: gcp
online_store:
  type: datastore
offline_store:
  type: bigquery
  dataset: feast_offline
# feature_repo/features.py
from datetime import timedelta
from feast import Entity, Feature, FeatureView, FileSource, BigQuerySource, ValueType

# ── Define Entity ──────────────────────────────────────────────────
customer = Entity(
    name="customer_id",
    description="Customer identifier",
    value_type=ValueType.STRING,
)

# ── Define Data Source ─────────────────────────────────────────────
customer_source = BigQuerySource(
    table="my-project.feast_data.customer_features",
    event_timestamp_column="event_timestamp",
)

# ── Define Feature View ────────────────────────────────────────────
customer_features = FeatureView(
    name="customer_features",
    entities=["customer_id"],
    ttl=timedelta(days=30),           # features expire after 30 days
    features=[
        Feature(name="age", dtype=ValueType.INT64),
        Feature(name="income", dtype=ValueType.FLOAT),
        Feature(name="tenure_months", dtype=ValueType.INT64),
        Feature(name="num_products", dtype=ValueType.INT64),
        Feature(name="monthly_charges", dtype=ValueType.FLOAT),
        Feature(name="avg_support_calls_30d", dtype=ValueType.FLOAT),
    ],
    source=customer_source,
)
# Apply feature definitions to registry
feast apply

# Materialize features (batch load into online store)
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")

Using Features in TrainingπŸ”—

from feast import FeatureStore
import pandas as pd
from datetime import datetime

store = FeatureStore(repo_path="feature_repo/")

# Training: pull historical features
entity_df = pd.DataFrame({
    "customer_id": ["C001", "C002", "C003"],
    "event_timestamp": [datetime(2024, 1, 1)] * 3,
})

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "customer_features:age",
        "customer_features:income",
        "customer_features:tenure_months",
        "customer_features:monthly_charges",
    ],
).to_df()

print(training_df.head())
# customer_id  age  income  tenure_months  monthly_charges
# C001         35   65000   12            75.50
# C002         28   42000   6             45.00

Using Features in Serving (Real-Time)πŸ”—

from feast import FeatureStore

store = FeatureStore(repo_path="feature_repo/")

# Online serving: get latest features for a customer
feature_vector = store.get_online_features(
    features=[
        "customer_features:age",
        "customer_features:income",
        "customer_features:tenure_months",
        "customer_features:monthly_charges",
    ],
    entity_rows=[{"customer_id": "C001"}]
).to_dict()

print(feature_vector)
# {'age': [35], 'income': [65000], 'tenure_months': [13], 'monthly_charges': [75.5]}

20.4 Vertex AI Feature StoreπŸ”—

Vertex AI Feature Store is Google's managed feature store, deeply integrated with BigQuery and Vertex AI.

from google.cloud import aiplatform

aiplatform.init(project="my-project", location="us-central1")

# Create Feature Store
featurestore = aiplatform.Featurestore.create(
    featurestore_id="churn_features",
    online_store_fixed_node_count=1,
)

# Create Entity Type (like a table)
customer_entity = featurestore.create_entity_type(
    entity_type_id="customer",
    description="Customer entity for churn prediction",
)

# Create Features
customer_entity.batch_create_features(
    feature_configs={
        "age":              {"value_type": "INT64"},
        "income":           {"value_type": "DOUBLE"},
        "tenure_months":    {"value_type": "INT64"},
        "monthly_charges":  {"value_type": "DOUBLE"},
        "plan":             {"value_type": "STRING"},
    }
)

# Ingest features from BigQuery
customer_entity.ingest_from_bq(
    feature_ids=["age", "income", "tenure_months", "monthly_charges"],
    feature_time="event_timestamp",
    bq_source_uri="bq://my-project.features.customer_features",
    entity_id_field="customer_id",
)

# Read for training (batch)
aiplatform.Featurestore("churn_features").batch_serve_to_bq(
    bq_destination_output_uri="bq://my-project.training_data.features",
    serving_feature_ids={"customer": ["age", "income", "tenure_months"]},
    read_instances_uri="bq://my-project.training_data.entity_list",
)

# Read for serving (online)
feature_vector = featurestore.read_feature_values(
    entity_type_id="customer",
    entity_id="C001",
    feature_selector=aiplatform.Featurestore.FeatureSelector(
        id_matcher=aiplatform.Featurestore.IdMatcher(ids=["age", "income"])
    )
)

20.5 Feature Store ComparisonπŸ”—

Feature Feast Vertex AI Feature Store Hopsworks Tecton
Type Open Source Managed (GCP) Commercial Commercial
Cost Free + infra Per node/hour Enterprise Enterprise
Online serving Redis/Datastore Managed Managed Managed
Offline serving BigQuery/File BigQuery HDFS/S3 Any
Streaming Kafka support Pub/Sub Kafka Kafka
GCP integration Good Native Good Good
Best for Open source teams GCP-first orgs Large enterprise Large enterprise

20.6 When to Use a Feature StoreπŸ”—

USE a Feature Store when:
  βœ… Multiple models use the same features
  βœ… Training-serving skew is causing issues
  βœ… Features take hours to compute
  βœ… Team has 5+ data scientists
  βœ… Real-time predictions needed
  βœ… Feature reuse across teams is important

SKIP for now if:
  ❌ Single model, single team
  ❌ Batch predictions only
  ❌ Small dataset, fast feature computation
  ❌ Early-stage exploration

Next β†’ Chapter 21: GCP & Vertex AI Deep Dive