Tracing an AI Agent’s Reasoning: Building Observability Into Your Pipeline
A user filed a support ticket on a Wednesday afternoon. Our customer-facing agent had recommended the wrong subscription tier confidently, politely, with a detailed breakdown of why it was the right choice. The user had followed the recommendation and was now on a plan that cost more and did less than what they had before.
I pulled up the logs. The agent had completed successfully. Status 200. No errors. No retries. Latency looked normal. Token count was typical.
From every external signal, the agent had done its job.
I had no idea what it had actually done inside.
That was the moment I understood what agent observability actually means. It is not knowing whether the agent ran. It is being able to reconstruct, step by step, exactly what it reasoned, what it retrieved, what it chose, and why so that when something like this happens you are doing analysis, not archaeology.
Why Your Existing Logs Are Not Enough
Traditional software logs are built around events: something happened, here is what it was, here is when. That model works because the path through deterministic code is either the one you expected or an error.
Agents do not fail with errors. They fail with confident wrong answers.
The failure is not in the final output. It is in a context corruption that happened three steps ago and silently propagated forward through every subsequent decision. Your logs show the final output. They do not show the propagation.
Without observability, agent failures become anecdotal: “it sometimes gives weird answers” rather than systematic. In early 2026, observability tooling for AI agents has matured significantly, and teams that invest in it ship better agents faster.
The specific thing you cannot see without structured tracing:
Which retrieval step returned irrelevant context that poisoned the reasoning downstream. Which tool call received a subtly wrong parameter and returned a result the agent misinterpreted. Which branching decision at step four set the agent down a path that made step eleven impossible to complete correctly. What the agent’s stated reasoning was at each decision point, not just its final output.
A hallucinating agent might pass an invalid date format or a nonexistent ID to a tool. If you are not observing the tool call itself, you might see a 500 Internal Server Error and blame your database, when the real culprit was the agent’s faulty reasoning.
The Structured Log Every Agent Step Should Emit
The first thing to build, before any integration with any platform, is a standard log schema that every step in your pipeline emits. This is the foundation everything else sits on.
Here is the schema we settled on after about three iterations:
json
{
"timestamp": "2026-06-10T14:22:11.334Z",
"trace_id": "trace_7f3b9a",
"span_id": "span_002",
"parent_span_id": "span_001",
"agent_id": "subscription-advisor-v2",
"session_id": "sess_a4c1e8",
"step_type": "tool_call",
"step_name": "fetch_user_subscription_history",
"input": {
"user_id": "usr_88271",
"lookback_days": 90
},
"output": {
"plans": ["starter", "starter", "starter"],
"churn_events": 0,
"avg_monthly_usage_gb": 4.2
},
"reasoning": "User has been on Starter for 90 days with no churn. Usage is 4.2GB average. Threshold for Pro recommendation is 10GB. Recommending Starter continuation.",
"model": "claude-haiku-4-5",
"tokens": {
"prompt": 843,
"completion": 127,
"cost_usd": 0.0003
},
"duration_ms": 412,
"status": "success",
"metadata": {
"retry_count": 0,
"confidence": 0.91,
"fallback_used": false
}
}
The field that most teams skip is reasoning. That is the agent’s stated rationale for the decision it made at this step. It is the only way to distinguish “the model chose wrong” from “the model was given wrong inputs.” Without it, you have a record of what happened. With it, you have a record of why.
Logging the reasoning trace alongside the tool calls is the only way to distinguish “the agent chose wrong” from “the agent was given bad inputs.” Without that, you get a beautiful trace of a bad decision with no way to know if it was model error or prompt error.
The trace_id, span_id, and parent_span_id fields are how you reconstruct the full decision tree later. Every step in a run shares the same trace_id. The parent_span_id points to the step that triggered this one. Pull all spans for a given trace_id and you can rebuild the full execution graph, step by step, in order.
LangSmith Integration: Three Lines to Full Visibility
If you are already using LangChain or LangGraph, full tracing is closer than you think. The minimal setup:
python
import os
from langchain_anthropic import ChatAnthropic
from langsmith import traceable
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-key"
os.environ["LANGCHAIN_PROJECT"] = "subscription-advisor-prod"
llm = ChatAnthropic(model="claude-sonnet-4-6")
That is it for LangChain-native code. Every LLM call, tool invocation, and chain step gets traced automatically and shows up in LangSmith with full input/output visibility.
For custom code that is not LangChain-native, the @traceable decorator wraps any function:
python
from langsmith import traceable
@traceable(name="fetch_subscription_history", run_type="tool")
def fetch_subscription_history(user_id: str, lookback_days: int = 90):
# your actual database call here
result = db.query(
"SELECT plan, usage_gb FROM subscriptions WHERE user_id = ? AND date > ?",
[user_id, days_ago(lookback_days)]
)
return {
"plans": [r["plan"] for r in result],
"avg_monthly_usage_gb": mean([r["usage_gb"] for r in result])
}
LangSmith captures the function name, inputs, outputs, and timing automatically. You can add metadata for anything you want to query later:
python
from langsmith import get_current_run_tree
@traceable(name="recommend_subscription_tier", run_type="chain")
def recommend_subscription_tier(user_id: str):
run = get_current_run_tree()
history = fetch_subscription_history(user_id)
reasoning_step = build_recommendation_prompt(history)
recommendation = llm.invoke(reasoning_step)
# attach metadata you want to filter on later
if run:
run.metadata["user_id"] = user_id
run.metadata["avg_usage_gb"] = history["avg_monthly_usage_gb"]
run.metadata["recommended_tier"] = recommendation.content
return recommendation
LangSmith now aggregates costs across the full agent workflow, not just per LLM call. This includes retrieval, tool execution, and downstream API spend. Online evaluations let production traffic be scored in real time with LLM-as-judge evaluators, heuristic checks, and human annotation queues, surfacing quality drift before users complain.
Adding Decision Traces at Tool Selection Boundaries
The most valuable thing you can log is not what the agent did. It is what it considered and then did not do.
Tool selection is where agents most commonly go wrong. The model chooses to call fetch_pricing when it should have called fetch_user_history first. Or it calls both but uses the pricing data before the history data is loaded. Or it calls the right tool with a parameter that seems right but is subtly wrong given the context.
Here is how to log the decision at each tool selection point:
python
import json
import logging
from langsmith import traceable
logger = logging.getLogger("agent.decisions")
@traceable(name="tool_selection_step", run_type="chain")
def tool_selection_step(available_tools: list, context: dict, reasoning_model):
"""
Explicit tool selection step that logs the decision before executing it.
"""
selection_prompt = f"""
Available tools: {[t['name'] for t in available_tools]}
Current context: {json.dumps(context, indent=2)}
Which tool should be called next and why?
Respond in JSON: {{"tool": "tool_name", "reason": "one sentence", "parameters": {{}}}}
"""
decision = reasoning_model.invoke(selection_prompt)
parsed = json.loads(decision.content)
# log the decision before we execute it
logger.info(json.dumps({
"event": "tool_selected",
"trace_id": context.get("trace_id"),
"tool_chosen": parsed["tool"],
"reason": parsed["reason"],
"parameters": parsed["parameters"],
"tools_not_chosen": [t["name"] for t in available_tools if t["name"] != parsed["tool"]]
}))
return parsed
The tools_not_chosen field is the one that earns its place in a post-incident review. When you are trying to understand why the agent recommended the wrong tier, seeing that it explicitly considered fetch_user_history and chose not to call it because the context made it look like history was already loaded when it was not is the difference between a five-minute diagnosis and a three-hour one.
Reconstructing a Full Decision Trace After an Incident
Back to the Wednesday support ticket.
After we built the tracing layer, we went back and tried to reproduce what the agent had done in that session. With LangSmith, pulling the full trace for a session is a URL:
https://smith.langchain.com/o/your-org/projects/subscription-advisor-prod/runs?filter=session_id:sess_a4c1e8
What we found: the agent had called fetch_user_subscription_history correctly. The output was accurate the user had been on Starter for 90 days, average usage 4.2GB. Then it called fetch_available_plans to get current pricing.
The fetch_available_plans tool had a bug we had not caught. It was returning plans sorted by revenue, not by fit. The first plan in the response was Pro. The agent had been prompted to “consider the plans in order and recommend the best fit.” It considered Pro first, found reasons it could justify, and stopped there. It never got to the analysis that would have shown Starter was the right answer.
The reasoning log showed this explicitly:
json
{
"step_name": "evaluate_plan_fit",
"reasoning": "Pro plan returned as first option. User usage is 4.2GB. Pro includes 50GB. Headroom exists. Recommending Pro for growth potential.",
"tool_chosen": "none sufficient context from available_plans response",
"tools_not_chosen": ["fetch_user_churn_risk", "fetch_usage_trend"]
}
The agent had enough context to make a decision, so it stopped gathering context. The decision was wrong because the context was sorted wrong. The trace showed exactly that. The fix was a one-line change to the fetch_available_plans tool: sort by relevance score, not revenue.
Without the trace, that diagnosis would have taken hours. With it, it took twelve minutes from ticket to root cause.
The Three Things Silent Failures Look Like in Traces
Once you have tracing in place, you start recognizing patterns. These are the three that show up most often:
Context poisoning. A retrieval step returns something subtly wrong the right document but an outdated version, or the right record but for the wrong user ID. The agent uses it, reasons correctly from it, and produces a wrong answer. The trace shows the retrieval output directly, so you can see the bad data the moment you look.
Tool boundary failures. Most agent failures in production are not LLM reasoning failures. They are tool boundary failures. The agent calls the right tool with the wrong parameters, or interprets the return value incorrectly. Without decision traces, you see the wrong final answer but cannot tell where the chain went sideways.
Premature confidence. The agent accumulates just enough context to make a decision and stops gathering. The decision is technically supportable from available context. It is wrong because the context is incomplete. The tools_not_chosen log is the only way to catch this you need to see what was available and not called to understand why the agent stopped where it did.
Monitoring in Production: What to Alert On
Tracing gives you the data. Alerting tells you when something is wrong before users do.
The metrics that actually predict problems not just reflect them:
python
from langsmith import Client
from datetime import datetime, timedelta
client = Client()
def check_agent_health(project_name: str, window_minutes: int = 60):
runs = client.list_runs(
project_name=project_name,
start_time=datetime.utcnow() - timedelta(minutes=window_minutes),
execution_order=1
)
metrics = {
"total_runs": 0,
"error_rate": 0,
"avg_tool_calls_per_run": 0,
"fallback_rate": 0,
"high_retry_runs": 0
}
for run in runs:
metrics["total_runs"] += 1
if run.error:
metrics["error_rate"] += 1
if run.extra and run.extra.get("metadata", {}).get("retry_count", 0) > 1:
metrics["high_retry_runs"] += 1
if metrics["total_runs"] > 0:
metrics["error_rate"] = metrics["error_rate"] / metrics["total_runs"]
metrics["high_retry_runs"] = metrics["high_retry_runs"] / metrics["total_runs"]
return metrics
The metric that matters most and gets ignored the most: fallback rate. When your agent cannot complete a step with its primary model and falls back to a retry or a stronger model, that is a signal that something upstream is under-specified or that the task complexity has shifted. A fallback rate that climbs from 8% to 22% over a week is not noise it is the agent telling you something changed.
Many changes that affect agents never show up as a formal deploy: someone tweaks a prompt, adds a new tool, changes a data source, or switches to a new model version. These untracked changes can degrade quality without raising error rates. Fallback rate and retry rate catch these silent shifts before they surface in user complaints.
The Minimum Viable Tracing Setup
If you ship nothing else from this post, ship these four things in order:
Day one. Structured JSON logs at every agent step with trace_id, span_id, step_type, input, output, reasoning, and status. Store them somewhere you can query. Even if that somewhere is CloudWatch or a Postgres table.
Day two. LangSmith integration with LANGCHAIN_TRACING_V2=true if you are on LangChain, or @traceable decorators if you are not. The visual trace debugger alone pays for the setup time the first time something goes wrong.
Week one. Decision logging at tool selection boundaries. Specifically: which tool was chosen, why, and which tools were available but not chosen.
Week two. Alerting on fallback rate, retry rate, and average tool calls per run. If any of those move significantly week over week, something changed that you need to look at.
The full observability stack that teams with mature agent pipelines run takes months to build. The four things above take a few days and catch 80% of the production failures you will actually encounter.
The Thing You Cannot Reconstruct Without Traces
Traces, not code, provide the only record of what your agent did and why.
The support ticket on Wednesday afternoon cost us about three hours of engineering time to diagnose without traces. After we built the tracing layer, the next similar incident took twelve minutes. The cost difference across a year of incidents is not small.
But the more important thing is what traces change about how you think about your agent. Without them, you are guessing. You run the agent again, tweak the prompt, run it again, look at the output, try to infer what happened in between. You are doing archaeology.
With them, you are doing engineering. You have a record. You can point to the exact step where the reasoning went wrong, the exact input that poisoned it, the exact tool call that was never made. You can fix the specific thing that broke, confirm the fix with a replay against the same trace, and know that you are done.
The agent you built is non-deterministic. Your ability to understand what it did should not be.
References
- DEV Community AI Agent Observability in 2026: OpenAI Agents SDK, LangSmith, and OpenTelemetry (April 2026) https://dev.to/chunxiaoxx/ai-agent-observability-in-2026-openai-agents-sdk-langsmith-and-opentelemetry-3ale
- MetaCTO What is LangSmith? 2026 Guide to LLM Observability https://www.metacto.com/blogs/what-is-langsmith-a-comprehensive-guide-to-llm-observability
- CallSphere AI Agent Observability: Tracing and Debugging with OpenTelemetry and LangSmith (January 2026) https://callsphere.ai/blog/ai-agent-observability-opentelemetry-langsmith-tracing
- LangChain AI Agent Observability: Tracing, Testing, and Improving Agents https://www.langchain.com/resources/agent-observability
- DigitalOcean LangSmith Explained: Debugging and Evaluating LLM Agents (January 2026) https://www.digitalocean.com/community/tutorials/langsmith-debudding-evaluating-llm-agents
- LangChain Blog How to Debug and Evaluate AI Agents with Observability (January 2026) https://www.langchain.com/blog/agent-observability-powers-agent-evaluation
- LangChain Platform LangSmith: AI Agent and LLM Observability and Evals Platform https://www.langchain.com/langsmith-platform
- Latitude The Complete Guide to Debugging AI Agents in Production (March 2026) https://latitude.so/blog/complete-guide-debugging-ai-agents-production
- DEV Community How to Monitor and Debug AI Agents in Production (March 2026) https://dev.to/miso_clawpod/how-to-monitor-and-debug-ai-agents-in-production-42o8
- Trantor Inc AI Agent Failure Modes: What Goes Wrong in Production (May 2026) https://www.trantorinc.com/blog/ai-agent-failure-modes-what-goes-wrong-design-resilience
- Maxim AI Tracing AI Agent Failures: Debugging Multi-Step Tool Workflows (May 2026) https://www.getmaxim.ai/articles/tracing-ai-agent-failures-debugging-multi-step-tool-workflows/
- TrueFoundry AI Agent Observability: Monitoring and Debugging Agent Workflows (December 2025) https://www.truefoundry.com/blog/ai-agent-observability-tools
- Groundcover AI Agent Observability Guide: Telemetry, Traces, Metrics, and Evals (June 2026) https://www.groundcover.com/learn/observability/ai-agent-observability
- Augment Code 7 Best AI Agent Observability Tools for Coding Teams in 2026 (May 2026) https://www.augmentcode.com/tools/best-ai-agent-observability-tools
- BuildMVPFast AI Agent Logging and Audit Trails: Debugging and Compliance (April 2026) https://www.buildmvpfast.com/blog/ai-agent-logging-audit-trail-debugging-compliance-2026