30 Metadata Lineage

Chapter 30: Metadata & Lineage Tracking🔗

"Lineage answers the question: where did this model come from, and what data created it?"


Why Metadata Matters🔗

WITHOUT LINEAGE:
  Production model accuracy dropped.
  Question: "Which version of the data trained this model?"
  Answer: "We don't know."  ← ☠️

WITH LINEAGE:
  model:prod:v3.2
    trained_on: dataset:churn:v8
      sourced_from: BigQuery query #4521 (2024-01-15)
    training_code: github.com/org/repo/commit/a3f1bc2
    metrics: accuracy=0.91, f1=0.88
    deployed_by: jenkins/build/1234 (2024-01-16)

ML Metadata (MLMD)🔗

MLMD is the metadata store used by TFX and Vertex AI Pipelines.

import ml_metadata as mlmd
from ml_metadata.metadata_store import metadata_store
from ml_metadata.proto import metadata_store_pb2

# Setup store
connection_config = metadata_store_pb2.ConnectionConfig()
connection_config.sqlite.filename_uri = "mlmd.sqlite"
store = metadata_store.MetadataStore(connection_config)

# Register types
dataset_type = metadata_store_pb2.ArtifactType(name="Dataset")
dataset_type.properties["location"] = metadata_store_pb2.STRING
model_type = metadata_store_pb2.ArtifactType(name="Model")
model_type.properties["accuracy"] = metadata_store_pb2.DOUBLE
store.put_artifact_type(dataset_type)
store.put_artifact_type(model_type)

# Log dataset artifact
dataset = metadata_store_pb2.Artifact(
    type_id=store.get_artifact_type("Dataset").id,
    uri="gs://bucket/data/train_v8.csv",
    properties={"location": metadata_store_pb2.Value(string_value="gs://bucket/")}
)
[dataset_id] = store.put_artifacts([dataset])

# Log model artifact
model_art = metadata_store_pb2.Artifact(
    type_id=store.get_artifact_type("Model").id,
    uri="gs://bucket/models/v3.2/",
    properties={"accuracy": metadata_store_pb2.Value(double_value=0.91)}
)
[model_id] = store.put_artifacts([model_art])

# Link dataset to model (lineage!)
training_exec_type = metadata_store_pb2.ExecutionType(name="Training")
store.put_execution_type(training_exec_type)

execution = metadata_store_pb2.Execution(
    type_id=store.get_execution_type("Training").id
)
[exec_id] = store.put_executions([execution])

# Events: this dataset was INPUT, this model was OUTPUT
input_event = metadata_store_pb2.Event(
    artifact_id=dataset_id,
    execution_id=exec_id,
    type=metadata_store_pb2.Event.INPUT
)
output_event = metadata_store_pb2.Event(
    artifact_id=model_id,
    execution_id=exec_id,
    type=metadata_store_pb2.Event.OUTPUT
)
store.put_events([input_event, output_event])

# Query lineage: what artifacts led to this model?
artifacts = store.get_artifacts_by_id([dataset_id, model_id])
print(f"Model traced to: {artifacts[0].uri}")

Next → Chapter 31: Responsible AI