Your AI Agent Got It Right. But Did It Reason Right?

Do You Trust Your AI Agent’s Reasoning?

How knowledge graphs catch flawed reasoning

Agentic AI evaluation has come a long way, and the next step is an exciting one: testing not just whether agents get the right answer, but whether they reasoned well to get there.

We are getting better at testing whether agents get the right answer. We are not yet good at testing whether they reasoned well to get there.

An agent can reach a correct conclusion through incomplete or flawed reasoning. It can skip critical considerations, ignore important trade-offs, and still land on the right output. Current testing frameworks: benchmarks, trajectory matching, LLM-as-judge are evolving rapidly, but none yet address this directly.

In high-stakes domains, “right answer, incomplete reasoning” is a hidden risk. The agent that succeeded without thorough reasoning will eventually encounter a case where the shortcuts do not work.

I first hit this problem building a hybrid rules-and-LLM engine for variant classification. It kept breaking on edge cases — missed combinations, invalid paths. And the LLM was not able to parse the connections. I started thinking about this and wrote LLM Reasoning and Recall. And I decided that we needed strict safeguards and validation systems to cover complex systems.

The Structure of Real Decisions

Here is something easy to miss: most real-world decisions have graph-like logic. Not simple yes/no checks, but conditional paths with sequences, required combinations, and exclusions.

This is not an edge case. It is what agents face every day:

  • Some conclusions require multiple pieces of evidence together (AND)
  • Some conclusions can be reached through alternative paths (OR)
  • Some paths are only valid when certain conditions are absent (NOT)
  • Some steps must come before others (sequences)
  • Sometimes you need to revisit earlier steps (loops)

Every domain where agents make decisions has this structure. Customer service, loan approval, legal research, medical diagnosis, code review — the logic is always more complex than a checklist.

The difference is whether the rules are written down.

Note: In practice, typically systems have multiple agents. The reasoning path would span them all. For simplicity, we will focus on validating a single agent’s reasoning here, but the same graph structure applies to the combined path across agents.

Where Current Approaches Leave a Gap

Existing evaluation methods are valuable, and they are improving. But they are optimized for different questions.

Benchmarks measure outcomes. Did the agent complete the task? This is necessary but not sufficient. A passing score does not tell you whether the agent followed valid reasoning paths to get there.

Trajectory evaluation checks steps. Did the agent take the expected actions? But this does not tell you whether it satisfied required evidence combinations or respected exclusions.

LLM-as-judge evaluates coherence. Is the reasoning internally consistent? But without a reference for what valid reasoning looks like in a given domain, a judge cannot identify what is missing or what invalid paths were taken.

These approaches complement each other. What is missing is a framework that asks: did the agent’s reasoning follow a valid path through the decision logic? And what if there are multiple valid paths, that intersect?

The Knowledge Graph Concept

A knowledge graph for validation encodes what valid reasoning looks like for a given decision type:

  • Required information — What must be considered
  • Valid sequences — What order steps can occur in
  • Required combinations (AND) — What must be true together
  • Valid alternatives (OR) — Different legitimate paths to the same conclusion
  • Exclusions (NOT) — Conditions that invalidate a path
  • Loops — Where iteration and re-evaluation is legitimate

The graph is built independently of the agent. It represents what a domain expert would expect from valid reasoning. It constitutes the space of all valid paths an agent can take.

Why a Graph, Not a Checklist

A checklist says: “Did you consider these things?”

A knowledge graph says: “Did you reason through these things in valid combinations and sequences?”

The difference matters. A checklist is flat, items to tick off. A graph encodes structure.

Consider a simple example: “near a gene” might not be enough to support a conclusion. “In a hotspot region” might not be enough either. But “near a gene AND in a hotspot” together might be sufficient. The checklist approach (did you check gene proximity? did you check hotspot status?) misses that both are required together.

Or consider exclusions: a reasoning path might be valid only when certain conditions are absent. “This evidence supports the conclusion, but NOT when this other condition is present.”

Agents face this complexity constantly. The question is whether we are testing for it.

A Genomics Use Case

The AMP/ASCO/CAP guidelines define how genetic variants should be classified. They are not simple rules — they are complex conditional logic with exactly the structure described above:

Conjunctions (AND): Classification requires evidence combinations: functional data AND population frequency AND clinical observations. No single evidence type is sufficient alone.

Disjunctions (OR): Tier I (strong clinical significance) requires FDA-approved therapies OR professional guideline consensus. Either path is valid.

Exclusions (NOT): Certain evidence only counts in specific contexts. Evidence from solid tumor cell lines might not apply when classifying variants for hematologic malignancies.

Sequences: You need to establish what variant you are examining before you can assess its population frequency. You need frequency data before you can weigh it against clinical observations.

Context dependencies: The same variant type can be pathogenic in one gene and tolerated in another. The reasoning path depends on context.

These guidelines are essentially a knowledge graph expressed in prose. They encode which combinations of evidence, in what relationships, support which classifications.

Real Agents in This Space

This matters because AI agents are being deployed in genomics right now.

GeneAgent (published in Nature Methods) analyzes gene sets by querying biological databases. Stanford’s Biomni handles variant prioritization and rare disease diagnosis. GPT-4 with retrieval-augmented generation is being used to interpret millions of variant annotations.

These agents make classification decisions. The question is: are they following valid reasoning paths?

An agent might classify a variant as pathogenic. But did it:

  • Check the required evidence combinations?
  • Follow valid sequences?
  • Respect the exclusions?
  • Take a path that actually exists in the guideline logic?

Or did it skip steps, ignore required conjunctions, and arrive at the right answer through an invalid path?

Current evaluation cannot tell you. A knowledge graph can.

Example: Valid and Invalid Paths

Valid path 1: Variant identified → Check population frequency (absent from databases) → Check functional evidence (strong impact demonstrated) → Check clinical observations (seen in affected patients, absent in controls) → Classification supported

Valid path 2: Variant identified → Check existing expert classifications (ClinVar pathogenic, multiple submitters, review status high) → Verify underlying evidence quality → Classification supported

Invalid path: Variant identified → Run computational prediction → Prediction says damaging → Classify as pathogenic

The third path is invalid because computational predictions alone do not support pathogenic classification under the guidelines. An agent taking this path might get the right answer by luck, but its reasoning does not follow a valid path through the decision logic.

Conjunction example: “Near oncogene” alone insufficient. “In hotspot” alone insufficient. Both together support pathogenicity. An agent that checks both but misses they are required together has ticked the checklist but missed the logic.

Exclusion example: Functional evidence supports pathogenicity, but NOT when the assay used an unrelated cell type. An agent that cites evidence without checking exclusions has followed an invalid path.

Building the Graph

Building knowledge graphs is labor intensive.

You do not auto-generate integration tests for critical systems. Engineers who understand the domain write them to reflect correct behavior. Knowledge graphs work the same way: domain experts define what valid reasoning requires. (Of course, there will be some automation, but some rules need to be curated).

In regulated domains like clinical genomics, some of the workflow is already done. The AMP/ASCO/CAP guidelines, ClinGen specifications, and ACMG criteria are detailed reasoning frameworks. The task is encoding them in a structure you can validate agents against.

For domains without formalized guidelines, the work is harder but the logic still exists. Experts know what valid reasoning looks like. The knowledge graph makes it explicit and testable.

How Validation Works

  1. Encode the domain’s reasoning rules as a knowledge graph
  2. Run the agent on test cases
  3. Examine what knowledge the agent used, what sequence it followed, what conclusion it reached
  4. Check whether the agent’s reasoning corresponds to a valid path in the graph
  5. Score on path validity, conjunction satisfaction, and exclusion compliance
  6. Flag cases where conclusions were reached through invalid paths

The output is not pass/fail. It is a map showing where the agent’s reasoning aligned with valid paths and where it diverged. Think: fully graph-valid, graph-aligned but incomplete, or off-graph.

How to Start

Begin small and concrete. Choose one tightly scoped decision (for example, BRCA1 missense classification), encode a compact knowledge graph of 6–12 nodes capturing the key evidence types, and curate about a dozen representative test cases. Instrument the agent to emit structured traces (evidence items, retrieval citations, step sequence), run the pilot, and triage divergences through human review. Publish the graph, test cases, and evaluation results to invite community critique and iterative improvement.

Open Questions

Granularity. How detailed should the graph be? Too coarse and it misses important distinctions. Too fine and it becomes brittle.

Novel paths. What if an agent takes a valid path the graph did not anticipate? Domain knowledge evolves. The framework needs room for expert review.

Maintenance. Guidelines change. Graphs need versioning and periodic review.

Related Work

Similar thinking is emerging. Salesforce’s Agentforce uses a “Graph Runtime” to test whether agents respect business constraints. Harvard’s Zitnik Lab developed KGARevion, a knowledge graph-based agent for medical QA that validates reasoning against structured knowledge. Recent work on self-correcting Agentic Graph RAG in clinical hepatology demonstrates validate-execute-evaluate loops for reasoning paths. Paths-over-Graph shows how knowledge graph paths can improve LLM reasoning faithfulness by 18.9% through structured path validation.

These approaches share the insight that structured references help validate agents. But they focus on different questions: Did the agent violate a constraint? Did it contradict known facts?

The knowledge graph concept addresses something complementary: did the agent follow valid reasoning paths? Did it satisfy required evidence combinations? That is the gap between getting the right answer and actually reasoning correctly.

Why This Matters

Every domain where agents make decisions has graph-like reasoning logic: sequences, required combinations, exclusions, alternative valid paths. The question is whether we are testing for it.

Right now, we mostly test outcomes. Did the agent get the right answer? That is necessary but not sufficient.

Knowledge graphs let us test reasoning. Did the agent follow a valid path? Did it satisfy the required conjunctions? Did it respect the exclusions?

An agent that gets the right answer through invalid reasoning is a liability. It will fail when the shortcuts stop working. Testing valid paths, not just correct outputs, is how we catch that before it matters.

Dami is a Software Engineer and MS Data Science candidate at Boston University. She has many years of experience working in the bio-pharmaceutical industry.

If you are working on related problems in agentic AI validation, or AI-driven scientific discovery, I would love to connect.

SAMPLE CODE

https://medium.com/media/26c013bf7bfed0c7d346ca117d530eff/href


Your AI Agent Got It Right. But Did It Reason Right? was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Liked Liked