[P] How do you regression-test ML systems when correctness is fuzzy? (OSS tool)
I’ve repeatedly run into the same issue when working with ML / NLP systems (and more recently LLM-based ones): there often isn’t a single correct answer – only better or worse behavior – and small changes can have non-local effects across the system. Traditional testing approaches (assertions, snapshot tests, benchmarks) tend to break down here: failures don’t explain what changed evaluation is expensive tests become brittle or get ignored We ended up building a review-driven regression testing approach […]