[D] Why evaluating only final outputs is misleading for local LLM agents
Been running local agents with Ollama + LangChain lately and noticed something kind of uncomfortable — you can get a completely correct final answer while the agent is doing absolute nonsense internally.
I’m talking about stuff like calling the wrong tool first and then “recovering,” using tools it didn’t need at all, looping a few times before converging, or even getting dangerously close to calling something it shouldn’t. And if you’re only checking the final output, all of that just… passes.
It made me realize that for agents, the output is almost the least interesting part. The process is where all the signal is.
Like imagine two agents both summarizing a document correctly. One does read → summarize in two clean steps. The other does read → search → read again → summarize → retry. Same result, but one is clearly way more efficient and way less risky. If you’re not looking at the trace, you’d treat them as equal.
So I started thinking about what actually matters to evaluate for local setups. Stuff like whether the agent picked the right tools, whether it avoided tools it shouldn’t touch, how many steps it took, whether it got stuck in loops, and whether the reasoning even makes sense. Basically judging how it got there, not just where it ended up.
I haven’t seen a lot of people talking about this on the local side specifically. Most eval setups I’ve come across still focus heavily on final answers, or assume you’re fine sending data to an external API for judging.
Curious how people here are handling this. Are you evaluating traces at all, or just outputs? And if you are, what kind of metrics are you using for things like loop detection or tool efficiency?
I actually ran into this enough that I hacked together a small local eval setup for it.
Nothing fancy, but it can:
– check tool usage (expected vs forbidden)
– penalize loops / extra steps
– run fully local (I’m using Ollama as the judge)
If anyone wants to poke at it:
https://github.com/Kareem-Rashed/rubric-eval
Would genuinely love ideas for better trace metrics
submitted by /u/MundaneAlternative47
[link] [comments]