[P] I built an autonomous ML agent that runs experiments on tabular data indefinitely – inspired by Karpathy’s AutoResearch
Inspired by Andrej Karpathy’s AutoResearch, I built a system where Claude Code acts as an autonomous ML researcher on tabular binary classification tasks (churn, conversion, etc.).
You give it a dataset. It loops forever: analyze data, form hypothesis, edit code, run experiment, evaluate with expanding time windows (train on past, predict future – no leakage), keep or revert via git. It edits only 3 files – feature engineering, model hyperparams, and analysis code. Everything else is locked down.
Edit: To clarify based on some comments, I am using this to solve the problem of finding new signals to add to the model, not trying to overfit a limited dataset. -end Edit-
Key design decisions:
- Introducing an analysis loop in addition to the experiment loop, this allow for better reflection and experimentation.
- Optimize for experiment throughput with a bunch of decisions: Use LightGBM as default model, limit feature count and tree count, locking down training run until it finishes.
- Constrained editing surface: only 3 files + logs. No infrastructure changes, no package installs. Without this, the agent will eventually try to modify the evaluation code to “improve” its score.
- Docker sandbox – the agent runs with full shell access (–dangerously-skip-permissions). Container keeps it contained.
- Expanding time windows over k-fold – mean score across multiple temporal train/test splits.
- Forced logging – every experiment gets a LOG.md entry (hypothesis, result, takeaway). Significant insights go to LEARNING.md. You can read the agent’s reasoning after the fact.
- Analysis primitives built-in – univariate AUC, correlation pairs, null rates, feature importance, error analysis. The agent writes analysis code using these to save time, they also serve as initial suggestions for the first few analyses.
What I learned building this:
- Air-tight evaluation is the essential for real improvement – this lesson hit me twice:
- Earlier version didn’t constraint which file the agent could edit, it eventually changed the evaluation code to make “improvement” easier for itself.
- K-fold validation was originally employed, the agent found improvements that are actually data leakage and didn’t hold out-of-time. After a painful manual inspection, I switched over to expanding time windows.
- Do everything to protect experiment throughput – this lesson also hit twice:
- Initially, I let the model run wild and was not very impressed when it barely run 20 experiments overnight. Turns out, the agent engineered thousands of new features that slowed down training and crash some runs due to RAM limit. I added the feature count limit and tree count limit to make sure training time is reasonable.
- Despite that, the agent still manage to crash/slow down training runs by putting many of them into background process at the same time. -> Locking mechanism was implemented to prevent 2 experiments being run at the same time. After this, the rate of progress increased to hundreds of runs per day.
- Persistent memory is important: Without forced logging, the agent would repeat experiments it already tried. The LOG.md and LEARNING.md system gives it memory across iterations.
The code open source (sanitized version): https://github.com/trantrikien239/autoresearch-tabularOf course it is done with Claude Code, but it has improved so much after rounds of iterations, including manual edits, so I think it’s worth sharing.
submitted by /u/Pancake502
[link] [comments]