I Built a CLI That Measures AI Agent Judgment Tilt Through Blind Debates

We have lots of benchmarks for AI agent correctness and capability. We have far fewer tools for measuring something subtler: when an agent reads two competent, well-argued positions on a hard topic and picks one — what pattern is driving those picks?
That’s what I mean by judgment tilt — the systematic tendency to reward certain arguments over others when both sides are internally consistent and well-structured. It’s shaped by training data, RLHF tuning, and system prompt conditioning. In my early validation runs, even a vanilla model with no system prompt showed measurable tilt — on one topic, the baseline scored -0.50 on a Stability axis and -0.40 on Tradition. In those runs, the pattern only became visible once I forced blind comparisons.
So I extracted the engine from an earlier project and turned it into Tiltgent — an open-source CLI.
What Tiltgent Actually Does
Tiltgent is an open-source CLI that measures your AI agent’s judgment tilt through structured blind debates.
You give it your agent’s system prompt and a topic. Tiltgent generates 10 debate rounds on escalating sub-questions derived from that topic. Each round pits two arguments against each other, written by calibrated “archetype” agents drawn from a roster of 21 distinct worldviews. Each archetype sits at a specific coordinate on five ideological axes, with its own system prompt, signature rhetorical move, and vocabulary constraints that prevent close neighbors from blurring together.
Your agent reads both arguments blind — no labels, no names — and picks the winner. To ensure stability and generate reliability metrics, Tiltgent runs the target agent three times with consensus voting across all runs. It also runs a separate vanilla baseline on the same topic and subtracts its scores, so you’re measuring your agent’s preferences, not the persuasion bias baked into the debate archetypes.
After 10 rounds, the pattern of picks produces a structured judgment tilt profile:
{
"archetype_name": "Emergence Realist",
"contradiction_line": "You champion market forces and emergent systems,
but you go cold when they threaten actual human competence and welfare.",
"dimensions": {
"order_emergence": 0.65,
"humanist_systems": -0.30,
"stability_dynamism": 0.45,
"local_coordinated": 0.10,
"tradition_reinvention": 0.55
},
"stability": { "pick_agreement_rate": 0.93, "unstable_rounds": 1 }
}
The full profile includes scored dimensions, stability and reliability metrics, and interpretation fields for the agent’s preference pattern.
Three commands:
- tiltgent eval — runs the full evaluation. Takes a system prompt file and a topic. Outputs a JSON profile.
- tiltgent diff — compares two saved profiles. Shows where tilt shifted between prompt versions or topics. Zero API calls, instant.
- tiltgent inspect — pretty-prints a saved profile to the terminal.
Three production dependencies. MIT licensed. Bring your own Anthropic API key.
The Measuring Instrument
The 21 archetypes are the measuring instrument.
They’re positioned across five dimensions: Order↔Emergence, Humanist↔Systems-first, Stability↔Dynamism, Local agency↔Coordinated scale, and Tradition↔Reinvention. Each one has a system prompt with a signature move, a go-to accusation, and a line it refuses to concede. Pairing uses Euclidean distance on the 5-axis vectors to enforce minimum ideological separation between debate opponents.
The roster went through triple audits with ChatGPT, Gemini, and Grok before it was locked. That produced 14 vector corrections, 11 prompt sharpenings with vocabulary constraints, two archetype merges (blind tests confirmed they were indistinguishable), and three new archetypes to fill gaps the auditors flagged.
A universal debate prompt instruction is meant to reduce rhetorical imbalance across archetypes — this is the defense against the obvious vulnerability: an agent picking arguments based on writing style rather than substance. Without that constraint, a more dramatic archetype wins on prose, not worldview, and the measurement drifts.
The full archetype roster and all system prompts are public in the CLI repo under MIT license. You can read every prompt and decide for yourself whether the instrument is sound.
How I Got Here
The engine started as a web app where humans played the blind debates — judge five rounds, get a worldview profile at the end. It worked, but I didn’t want to operate a hosted service. So I pivoted to an open-source CLI that developers run locally with their own API key.
The engine extracted cleanly from the original codebase — seven files, zero modifications needed. Earlier security audits had forced clean architecture by requiring every function to accept an API client as a parameter. I hadn’t planned for portability; the audits just happened to produce it.
During thesis validation, I ran four synthetic agents (Hard Accelerationist, Cautious Humanist, Coordinated Systems Thinker, and a vanilla baseline) across two topics with repeated runs at temperature zero. The same agent made identical picks every time — the signal was stable, not noisy. Different prompt-conditioned agents produced clearly separated profiles on the same topic: after calibration, the Humanist and Systems Thinker showed 0.93 separation on the Humanist↔Systems axis. And the vanilla baseline showed measurable tilt that varied by topic, which is why per-topic calibration became mandatory.
What You Can Do With It
Prompt regression testing. You changed your agent’s system prompt. Did its judgment tilt shift? Run tiltgent eval before and after, then tiltgent diff to see exactly which dimensions moved and by how much.
Cross-topic profiling. Run the same agent against different topics. Does your “balanced, helpful assistant” stay balanced on healthcare but tilt hard toward markets on economic questions?
Model comparison. Same prompt, different underlying model. Does switching models change which arguments your agent prefers?
Pre-deployment diagnostic. If your agent summarizes, triages, or recommends, you may want to inspect its argumentative preferences before it ships.
What This Isn’t
Tiltgent doesn’t tell you your agent is “biased” in a moral sense. It tells you which direction your agent tilts when faced with competing competent arguments. Whether that tilt is a problem depends on your use case.
An agent built for a libertarian think tank should tilt toward markets and individual sovereignty. An agent built for public health policy should weight institutional coordination. The profile isn’t a report card. It’s a diagnostic.
Directly asking an agent its opinions often produces socially desirable or inconsistent answers. The blind debate format forces revealed preferences under comparison pressure — the agent can’t hedge both ways when it has to pick a winner.
It also doesn’t test for factual accuracy, hallucination, or reasoning quality. Existing benchmarks cover those. Tiltgent tests what happens when the facts aren’t in dispute but the values are.
Known Limitations
This is not a finished measurement science. It’s a v0.1 diagnostic with known open questions.
The hardest problem is keeping the archetypes rhetorically balanced. I’ve already found signs that some archetypes may be more persuasive than others on certain topics — one archetype won 4 out of 4 test matchups, likely because its “trace the second-order consequences” style reads as authoritative regardless of subject. Win-rate balance and style contamination are ongoing calibration work, not solved science. The per-topic vanilla baseline helps, but it doesn’t eliminate archetype-level persuasion bias entirely.
There’s also a self-preference confounder worth noting: in this version, an Anthropic model (Claude) generates the archetype arguments and an Anthropic model judges them. The vanilla calibration subtracts baseline preferences, which reduces this effect, but it remains an open confound until the tool supports non-Anthropic debate generators.
The tool currently requires an Anthropic API key and the debate archetypes run on Claude. I haven’t validated it against GPT-4, Gemini, or open-weight models as target agents — those are unknown territory. The engine itself is model-agnostic in principle; it only requires the target agent to return a structured pick. Validation on non-Anthropic models is next.
All validation so far is from synthetic test agents and controlled thesis-validation runs. I haven’t tested against production agents making real downstream decisions. The use cases I describe above are where I believe the tool is headed, not where it’s been proven.
Try It
npx tiltgent eval --prompt your-agent.txt --topic "Universal basic income"
One command. About five minutes — that’s the time for generating a 10-question escalation, running a vanilla baseline calibration, and running your agent 3× for consensus. Expect roughly $0.25–0.30 in API costs per evaluation (your key, your cost).
GitHub: github.com/selfradiance/tiltgent-cli
npm: tiltgent
The present version answers a practical question: if you change the prompt, change the model, or change the topic, does your agent’s judgment tilt move with it? Tiltgent gives you a way to measure that instead of guessing.
I Built a CLI That Measures AI Agent Judgment Tilt Through Blind Debates was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.