Do Large Language Model Benchmarks Test Reliability?

    Paper    Code Large language models (LLMs) have shown remarkable capabilities in areas like problem-solving, knowledge retrieval, and code generation. Yet, these models still fail sometimes on surprisingly simple tasks. Two such examples that went viral recently were models such as ChatGPT and Claude failing on the questions “how many r’s are in the word strawberry?” and “which is greater, 9.11 or 9.9?” These examples might seem amusing but inconsequential. However, in safety-critical contexts such as […]

Ver mais

Liked Liked