Chapter 30: Metadata & Lineage Tracking🔗
"Lineage answers the question: where did this model come from, and what data created it?"
Why Metadata Matters🔗
WITHOUT LINEAGE:
Production model accuracy dropped.
Question: "Which version of the data trained this model?"
Answer: "We don't know." ← ☠️
WITH LINEAGE:
model:prod:v3.2
trained_on: dataset:churn:v8
sourced_from: BigQuery query #4521 (2024-01-15)
training_code: github.com/org/repo/commit/a3f1bc2
metrics: accuracy=0.91, f1=0.88
deployed_by: jenkins/build/1234 (2024-01-16)
ML Metadata (MLMD)🔗
MLMD is the metadata store used by TFX and Vertex AI Pipelines.
import ml_metadata as mlmd
from ml_metadata.metadata_store import metadata_store
from ml_metadata.proto import metadata_store_pb2
# Setup store
connection_config = metadata_store_pb2.ConnectionConfig()
connection_config.sqlite.filename_uri = "mlmd.sqlite"
store = metadata_store.MetadataStore(connection_config)
# Register types
dataset_type = metadata_store_pb2.ArtifactType(name="Dataset")
dataset_type.properties["location"] = metadata_store_pb2.STRING
model_type = metadata_store_pb2.ArtifactType(name="Model")
model_type.properties["accuracy"] = metadata_store_pb2.DOUBLE
store.put_artifact_type(dataset_type)
store.put_artifact_type(model_type)
# Log dataset artifact
dataset = metadata_store_pb2.Artifact(
type_id=store.get_artifact_type("Dataset").id,
uri="gs://bucket/data/train_v8.csv",
properties={"location": metadata_store_pb2.Value(string_value="gs://bucket/")}
)
[dataset_id] = store.put_artifacts([dataset])
# Log model artifact
model_art = metadata_store_pb2.Artifact(
type_id=store.get_artifact_type("Model").id,
uri="gs://bucket/models/v3.2/",
properties={"accuracy": metadata_store_pb2.Value(double_value=0.91)}
)
[model_id] = store.put_artifacts([model_art])
# Link dataset to model (lineage!)
training_exec_type = metadata_store_pb2.ExecutionType(name="Training")
store.put_execution_type(training_exec_type)
execution = metadata_store_pb2.Execution(
type_id=store.get_execution_type("Training").id
)
[exec_id] = store.put_executions([execution])
# Events: this dataset was INPUT, this model was OUTPUT
input_event = metadata_store_pb2.Event(
artifact_id=dataset_id,
execution_id=exec_id,
type=metadata_store_pb2.Event.INPUT
)
output_event = metadata_store_pb2.Event(
artifact_id=model_id,
execution_id=exec_id,
type=metadata_store_pb2.Event.OUTPUT
)
store.put_events([input_event, output_event])
# Query lineage: what artifacts led to this model?
artifacts = store.get_artifacts_by_id([dataset_id, model_id])
print(f"Model traced to: {artifacts[0].uri}")
Next → Chapter 31: Responsible AI