[R] Higher effort settings reduce deep research accuracy for GPT-5 and Gemini Flash 3

We evaluated 22 model configurations across different effort/thinking levels on Deep Research Bench (169 web research tasks, human-verified answers). For two of the most capable models, higher effort settings scored worse.

GPT-5 at low effort scored 0.496 on DRB. At high effort, it dropped to 0.481, and cost 55% more per query ($0.25 → $0.39). Gemini 3 Flash showed a 5-point drop going from 0.504 at low effort, to 0.479 at high effort.

Most models cluster well under a dollar per task, making deep research surprisingly affordable. Methodology, pareto analysis of accuracy vs cost are at https://everyrow.io/docs/notebooks/deep-research-bench-pareto-analysis

submitted by /u/ddp26
[link] [comments]

Liked Liked