[D] How do you actually track which data transformations went into your trained models?
I keep running into this problem and wondering if I’m just disorganized or if this is a real gap:
The scenario: – Train a model in January, get 94% accuracy – Write paper, submit to conference – Reviewer in March asks: “Can you reproduce this with different random seeds?” – I go back to my code and… which dataset version did I use? Which preprocessing script? Did I merge the demographic data before or after normalization?
What I’ve tried: – Git commits (but I forget to commit datasets) – MLflow (tracks experiments, not data transformations) – Detailed comments in notebooks (works until I have 50 notebooks) – “Just being more disciplined” (lol)
My question: How do you handle this? Do you: 1. Use a specific tool that tracks data lineage well? 2. Have a workflow/discipline that just works? 3. Also struggle with this and wing it every time?
I’m especially curious about people doing LLM fine-tuning – with multiple dataset versions, prompts, and preprocessing steps, how do you keep track of what went where?
Not looking for perfect solutions – just want to know I’m not alone or if there’s something obvious I’m missing.
What’s your workflow?
submitted by /u/Achilles_411
[link] [comments]