Helios: a verifiable-reward (RLVR) environment for ETL optimization — frozen-policy agent, ground-truth equivalence + runtime rewards

Helios is an LLM agent that proposes optimizations for Databricks ETL jobs and verifies them end-to-end — same output, faster runtime. The framing: ETL optimization as a verifiable-reward (RLVR) environment. The reward channel is diff_tables (byte-level output equivalence) and measured runtime delta — both deterministic ground truth, not learned reward models.

How it works

  1. Point at a prod job_id + task_key. Helios never modifies prod — frozen mutation guards on the prod job id, application-layer write guard on every SQL.
  2. It clones the task into a sandbox: source tables pinned via Delta TIMESTAMP AS OF aligned to the prod task’s start time; prod boundary pinned via VERSION AS OF.
  3. An LLM agent investigates (EXPLAIN, plan inspection, skew probes), proposes a rewrite, runs it in isolation, verifies via diff_tables. Iterates within the run on failure.
  4. Emits a proposal.md with diff, equivalence proof, perf number, and the full audit trail.

The parts where most “LLM-for-SQL” demos break:

  • Magnitude-relative float tolerance (atol + rtol·max(|a|,|b|)) so a correct rewrite that perturbs DOUBLE sums at ~1e-13 (inherent to IEEE-754 reduction reorder under different parallelism) doesn’t false-fail. DECIMAL/INT/string stay byte-exact via a type gate.
  • LLM nondeterminism detector that reads the SQL and classifies every output column: untied ROW_NUMBER ORDER BY argmax, order-sensitive aggregates, current_timestamp() run-stamps, etc. Self-authorizing classes (non-pure by language) get auto-excluded behind a strict name+type gate; data-derived ones (the dangerous class) are surfaced for human sign-off — never silently ignored.
  • Empirical tie-break corroboration: for probe-required columns, automatically joins prod-vs-sandbox on the stable key and checks whether differing carried attributes correlate with matching ORDER BY sibling (→ tie-break, safe) or differing siblings (→ real bug, don’t ship).
  • Incremental task handling: detects INSERT INTO/MERGE INTO notebooks, materializes a partition-bounded prod-increment view (v_post WHERE date='…' EXCEPT v_pre), diffs against the sandbox’s daily increment — not against the table’s historical accumulation.
  • Isolation baseline for honest Tier-3 perf: runs the original notebook in the sandbox to separate true algebra impact from prod cluster co-tenant contention relief.

Live result on one prod task: 28.3M-row daily increment, byte-identical to prod, +34% runtime vs prod median.

Honest framing: Helios is the environment half of RLVR — verifiable reward, well-shaped episodes, structured trajectories (messages.json + streamed trace.jsonl with reasoning text alongside tool I/O). The agent currently operates as a frozen policy under in-context adaptation; we’re accumulating (state, action, reward) trajectories but haven’t closed the training loop with an offline RL/SFT pass yet. That’s the next step.

Repo: https://github.com/dvakhil8/helios

Happy to answer questions about the equivalence-check internals, the safety model, or where this is most likely to break.

submitted by /u/Available-Subject-76
[link] [comments]

Liked Liked