[R] IDP Leaderboard: Open benchmark for document AI across 16 VLMs, 9,000+ documents, 3 benchmark suites

We’re releasing the IDP Leaderboard, an open evaluation framework for document understanding tasks. 16 models tested across OlmOCR, OmniDoc, and our own IDP Core benchmark (covering KIE, table extraction, VQA, OCR, classification, and long document processing).

Key results:

– Gemini 3.1 Pro leads overall (83.2) but the margin is tight. Top 5 within 2.4 points.

– Cheaper model variants (Flash, Sonnet) produce nearly identical extraction quality to flagship models. The differentiation only appears on reasoning-heavy tasks like VQA.

– GPT-5.4 shows a significant jump over GPT-4.1 (70 to 81 overall, 42% to 91% on DocVQA).

– Sparse unstructured tables remain the hardest task. Most models are below 55%.

– Handwriting OCR tops out at 76%.

We also built a Results Explorer that shows ground truth alongside every model’s raw prediction for every document. Not just scores.

This helps you decide which model works for you by actually seeing the predictions and the ground truths.

Findings: https://nanonets.com/blog/idp-leaderboard-1-5/

Datasets: huggingface.co/collections/nanonets/idp-leaderboard

Leaderboard + Results Explorer: idp-leaderboard.org

submitted by /u/shhdwi
[link] [comments]

Liked Liked