[D] How do you actually track which data transformations went into your trained models?

I keep running into this problem and wondering if I’m just disorganized or if this is a real gap:

The scenario: – Train a model in January, get 94% accuracy – Write paper, submit to conference – Reviewer in March asks: “Can you reproduce this with different random seeds?” – I go back to my code and… which dataset version did I use? Which preprocessing script? Did I merge the demographic data before or after normalization?

What I’ve tried: – Git commits (but I forget to commit datasets) – MLflow (tracks experiments, not data transformations) – Detailed comments in notebooks (works until I have 50 notebooks) – “Just being more disciplined” (lol)

My question: How do you handle this? Do you: 1. Use a specific tool that tracks data lineage well? 2. Have a workflow/discipline that just works? 3. Also struggle with this and wing it every time?

I’m especially curious about people doing LLM fine-tuning – with multiple dataset versions, prompts, and preprocessing steps, how do you keep track of what went where?

Not looking for perfect solutions – just want to know I’m not alone or if there’s something obvious I’m missing.

What’s your workflow?

submitted by /u/Achilles_411
[link] [comments]

Liked Liked