Helios: a verifiable-reward (RLVR) environment for ETL optimization — frozen-policy agent, ground-truth equivalence + runtime rewards
Helios is an LLM agent that proposes optimizations for Databricks ETL jobs and verifies them end-to-end — same output, faster runtime. The framing: ETL optimization as a verifiable-reward (RLVR) environment. The reward channel is diff_tables (byte-level output equivalence) and measured runtime delta — both deterministic ground truth, not learned reward models.
How it works
- Point at a prod
job_id+task_key. Helios never modifies prod — frozen mutation guards on the prod job id, application-layer write guard on every SQL. - It clones the task into a sandbox: source tables pinned via Delta
TIMESTAMP AS OFaligned to the prod task’s start time; prod boundary pinned viaVERSION AS OF. - An LLM agent investigates (
EXPLAIN, plan inspection, skew probes), proposes a rewrite, runs it in isolation, verifies viadiff_tables. Iterates within the run on failure. - Emits a
proposal.mdwith diff, equivalence proof, perf number, and the full audit trail.
The parts where most “LLM-for-SQL” demos break:
- Magnitude-relative float tolerance (
atol + rtol·max(|a|,|b|)) so a correct rewrite that perturbs DOUBLE sums at ~1e-13 (inherent to IEEE-754 reduction reorder under different parallelism) doesn’t false-fail. DECIMAL/INT/string stay byte-exact via a type gate. - LLM nondeterminism detector that reads the SQL and classifies every output column: untied
ROW_NUMBER ORDER BYargmax, order-sensitive aggregates,current_timestamp()run-stamps, etc. Self-authorizing classes (non-pure by language) get auto-excluded behind a strict name+type gate; data-derived ones (the dangerous class) are surfaced for human sign-off — never silently ignored. - Empirical tie-break corroboration: for probe-required columns, automatically joins prod-vs-sandbox on the stable key and checks whether differing carried attributes correlate with matching
ORDER BYsibling (→ tie-break, safe) or differing siblings (→ real bug, don’t ship). - Incremental task handling: detects
INSERT INTO/MERGE INTOnotebooks, materializes a partition-bounded prod-increment view (v_post WHERE date='…' EXCEPT v_pre), diffs against the sandbox’s daily increment — not against the table’s historical accumulation. - Isolation baseline for honest Tier-3 perf: runs the original notebook in the sandbox to separate true algebra impact from prod cluster co-tenant contention relief.
Live result on one prod task: 28.3M-row daily increment, byte-identical to prod, +34% runtime vs prod median.
Honest framing: Helios is the environment half of RLVR — verifiable reward, well-shaped episodes, structured trajectories (messages.json + streamed trace.jsonl with reasoning text alongside tool I/O). The agent currently operates as a frozen policy under in-context adaptation; we’re accumulating (state, action, reward) trajectories but haven’t closed the training loop with an offline RL/SFT pass yet. That’s the next step.
Repo: https://github.com/dvakhil8/helios
Happy to answer questions about the equivalence-check internals, the safety model, or where this is most likely to break.
submitted by /u/Available-Subject-76
[link] [comments]