Agent Behavior Lab: a reproducible harness for factorial experiments on tool-using LLM agent behavior

I’ve open-sourced Agent Behavior Lab, a platform for running controlled, factorial experiments on tool-using LLM agents.

Design. An experiment is a Cartesian product of factors — model × tool (with renamed alias variants) × system persona × prior conversation history — run for N trials per cell. Each trial is scored by a configurable judge (deterministic keyword/tool-call detection or an LLM judge).

Analysis. The harness reports safety failure rate and severity-weighted failure rate with Wilson intervals, per-factor breakdowns, cross-factor SFR heatmaps, and effect sizes (alias/persona/history) alongside logistic-regression coefficients and odds ratios. Raw trials and CSV export are available for external analysis.

Reproducibility. Ships with versioned YAML seed data (models, tools, personas, histories, judges, and pre-built sessions) so the reference experiments are reproducible out of the box. It sends tool definitions to model APIs and records attempted calls; it does not execute tools.

Repo: https://github.com/Null-Square/agent-behavior-lab

Interested in feedback on the statistical treatment (effect-size definitions, handling of saturated cells, multiple-comparison concerns) and on additional judge designs.

submitted by /u/IcyPop8985
[link] [comments]

Liked Liked