Thinking about a research round after a public AI poker agent competition
I’m working on Poker Arena, an AI poker agent competition focused on imperfect-information decision-making.
The public event has already been announced and got a surprising amount of attention: around 600,000 social views and 300+ registrations so far.
The public round is mainly for broad participation: builders submit agents, run them through the arena, and see how they perform against other bots / reference opponents.
After that, we’re planning a smaller researcher round with BenchFlow (around 25 seats).
The goal of the researcher round would be more technical: use what we learn from the public round to study evaluation design, variance, failure modes, and agent behavior under uncertainty.
Poker is tricky as a benchmark because raw bankroll / win rate is noisy. A bot can make a good decision and lose the hand, or make a bad decision and look strong over a short sample.
The current builder loop is roughly:
-
build a `decide(table)` policy
-
test locally against simple bots
-
run Arena previews against a reference panel
-
score with bb/100
-
inspect losing hands, positions, chip deltas, and traces
-
patch systematic leaks
-
rerun across more hands / tables
For the researcher round, I’m especially interested in questions like:
– How do we separate policy quality from short-term variance?
– What metrics should matter besides bb/100?
– How should we evaluate risk management and confidence calibration?
– How do we avoid overfitting to a fixed reference panel?
– What baselines should be included: heuristics, CFR, NFSP, DeepCFR, LLM-assisted agents, solver lookup?
– What traces would be most useful for post-match analysis?
– How many hands / opponents are needed before results become meaningful?
My current instinct is that the research round should track more than final result: exploitability, risk-adjusted return, opponent adaptation, decision consistency across similar spots, and failure modes like overconfidence under uncertainty.
For people working on RL, self-play, imperfect-information games, or agent evaluation:
What would you want to see in a post-public research round to make the results useful as a benchmark, not just a tournament?
If this is your area and you’d like to participate or give feedback, happy to chat. We’re trying to make the researcher round small and high-signal.
submitted by /u/xoleni
[link] [comments]