[D] How do you track data lineage in your ML pipelines? Most teams I’ve talked to do it manually (or not at all)
I’m a PhD student researching ML reproducibility, and one thing that keeps surprising me is how many teams have no systematic way to track which data went into which model.
The typical workflow I see (and have been guilty of myself):
- Load some CSVs
- Clean and transform them through a chain of pandas operations
- Train a model
- Three months later, someone asks “what data was this model trained on?” and you’re digging through old notebooks trying to reconstruct the answer
The academic literature on reproducibility keeps pointing to data provenance as a core problem, papers can’t be replicated because the exact data pipeline isn’t documented. And now with the EU AI Act requiring data documentation for high-risk AI systems (Article 10), this is becoming a regulatory requirement too, not just good practice.
I’ve been working on an approach to this as part of my PhD research: function hooking to automatically intercept pandas/numpy I/O operations and record the full lineage graph without any manual logging. The idea is you add one import line and your existing code is tracked — no MLflow experiment setup, no decorator syntax, no config files.
I built it into an open-source tool called AutoLineage (pip install autolineage). It’s early, just hit v0.1.0, but it tracks reads/writes across pandas, numpy, pickle, and joblib, generates visual lineage graphs, and can produce EU AI Act compliance reports.
I’m curious about a few things from this community:
- How do you currently handle data lineage? MLflow? DVC? Manual documentation? Nothing?
- What’s the biggest pain point? Is it the initial tracking, or more the “6 months later someone needs to audit this” problem?
- Would zero-config automatic tracking actually be useful to you, or is the manual approach fine because you need more control over what gets logged?
Genuinely looking for feedback on whether this is a real problem worth solving or if existing tools handle it well enough. The academic framing suggests it’s a gap, but I want to hear from practitioners.
GitHub: https://github.com/kishanraj41/autolineage PyPI: https://pypi.org/project/autolineage/
submitted by /u/Achilles_411
[link] [comments]