Tool Calling Is Not an API Call: What Engineers Keep Getting Wrong

digitado ⋅ 24 de June de 2026

Every team that builds an LLM agent eventually hits the same wall. The model calls a tool. Something breaks. Nobody knows why.

I’ve been building tool-driven agent systems at MasTec for a while now, orchestrating enterprise APIs, operational databases, and internal services through LLM agents in production environments. And the pattern I keep seeing is the same: engineers treat tool calling as if they’re writing a REST client. Clean schema, right endpoint, valid payload, ship it.

That mental model works for about five minutes in production. Then reality shows up.

Tool calling in an agentic system is a fundamentally different contract than an API call. The caller isn’t deterministic. It doesn’t guarantee argument structure. It doesn’t always know when not to call. And it doesn’t recover gracefully when the tool returns something unexpected. Understanding that gap between how engineers expect tool calling to work and how it actually behaves under real load is what separates agents that hold up from agents that quietly corrupt your workflows.

Here are the five mistakes I’ve seen repeatedly. All of them are fixable. None of them show up in the tutorials.

Mistake 1: Writing Tool Schemas for Humans, Not for the Model

When engineers design a tool schema, they tend to write it the same way they’d write API documentation, clear enough for a developer to understand. That’s the wrong audience.

The model reads your schema at inference time and decides how to call the tool based entirely on what you wrote. If your description is ambiguous, the model fills the gap with a guess. If two of your tools have overlapping purposes, the model arbitrarily picks one. If your parameter names are terse and unexplained, the model infers meaning and is often wrong.

I’ve watched agents call a get_record tool when they should have called search_records because both descriptions mentioned “retrieving data.” The fix wasn’t changing the routing logic — it was rewriting the schema descriptions to make the behavioral boundary explicit.

A good tool schema description should answer three questions unambiguously: what this tool does, what it explicitly does not do, and under what conditions it should be called. Write it like you’re training a junior engineer who has never seen your codebase.

python

# Weak schema description
{
  "name": "get_customer",
  "description": "Gets customer data"
}
# Production-grade schema description
{
  "name": "get_customer_by_id",
  "description": "Retrieves a single customer record using an exact customer ID. 
  Use this ONLY when you have a confirmed customer_id. 
  Do NOT use this for name-based lookups or search - use search_customers instead.",
  "parameters": {
    "customer_id": {
      "type": "string",
      "description": "The exact customer UUID. Format: 'cust_XXXXXXXXXX'"
    }
  }
}

The investment in schema clarity pays back every time the model routes correctly without needing a retry.

Mistake 2: No Validation Before the Tool Executes

The model sends a tool call. Your code receives it. What happens next?

In most early implementations I’ve reviewed, the arguments get passed directly to the underlying function. No validation. No type checking. No boundary checks. The assumption is that the model populated the arguments correctly.

That assumption is wrong often enough to matter.

LLMs hallucinate tool arguments. Not dramatically, not {“customer_id”: “I made this up”}but subtly. A string field gets an integer. A required parameter comes through as null. An enum field receives a value that isn’t in the allowed set. These failures don’t throw loud errors. They propagate silently into your database, your downstream services, and your audit logs.

The fix is schema validation at the MCP layer or at the function boundary before anything touches your actual systems. I enforce this with Pydantic on every tool handler we run in production:

python

from pydantic import BaseModel, validator
class GetCustomerInput(BaseModel):
    customer_id: str
    @validator("customer_id")
    def must_be_valid_format(cls, v):
        if not v.startswith("cust_"):
            raise ValueError(f"Invalid customer_id format: {v}")
        return v
@tool
def get_customer_by_id(raw_input: dict) -> dict:
    validated = GetCustomerInput(**raw_input)  # raises before any DB call
    return db.fetch_customer(validated.customer_id)

Schema validation at the tool boundary is one of the highest-ROI reliability patterns in agent systems. It costs almost nothing to implement, and it catches a significant percentage of hallucinated arguments before they touch anything real.

Mistake 3: Treating Tool Errors as Terminal

An API goes down. A database query times out. A tool returns a 500. What does your agent do?

In a naive implementation, it stops. Or worse, it retries the same call with the same arguments indefinitely until you hit a rate limit or someone looks at the logs.

This is where the difference between a prototype and a production agent shows up most clearly. Production agents need structured error handling baked into the tool layer not as an afterthought, but as part of the tool’s contract with the orchestrator.

Every tool I ship has a return envelope that distinguishes recoverable failures from terminal ones:

python

def call_tool(name: str, args: dict) -> dict:
    try:
        result = execute_tool(name, args)
        return {"status": "success", "data": result}
    except TransientError as e:
        return {"status": "retry", "reason": str(e), "retry_after": 2}
    except InvalidInputError as e:
        return {"status": "invalid_args", "reason": str(e)}
    except Exception as e:
        return {"status": "error", "reason": "Tool failed. Do not retry."}

The agent’s orchestration layer, in my case, LangGraph, reads that status field and routes accordingly. A retry status triggers exponential backoff with jitter. An invalid_args status routes back to the model with the error message so it can attempt a corrected call. An error status escalates or gracefully terminates that branch of execution.

Without this structure, your agent has no way to distinguish “try again” from “stop, something is fundamentally wrong.” It guesses. And its guesses at error recovery are usually bad.

Mistake 4: Giving the Agent Too Many Tools

This one surprises engineers every time. You’d think more tools mean more capability. In practice, it often means worse performance.

When you register fifteen tools with an agent, every one of those tool schemas enters the model’s context window. The model now has to reason about fifteen possible actions on every step. That increases token usage, slows down routing decisions, and critically raises the probability of the model calling the wrong tool. The more overlapping or similar your tools look from a description standpoint, the worse this gets.

Amazon Prime Video hit this directly in production. Centralizing all tool access through a single MCP server loaded enough tool definitions to consume a meaningful chunk of the context window before the agent processed a single user message.

The fix I’ve landed on: scope tools to the agent’s role. Not every agent needs access to every tool. A customer lookup agent doesn’t need write access to anything. An order status agent doesn’t need access to account management APIs. Define the minimum viable toolset for each agent’s function and enforce it at the MCP permission layer, not just at the prompt level.

If a human reviewing your system can’t immediately say which tool should handle a given scenario, the model can’t either. That ambiguity is a configuration problem, not a model problem.

Mistake 5: No Observability on Tool Execution

When something goes wrong, and it will, can you reconstruct exactly what happened? Which tool was called, with what arguments, and what did it return?

Most teams can’t. Tool calls happen inside the agent loop, and unless you’ve explicitly wired in tracing, they’re invisible. You see the final output. You don’t see the three intermediate tool calls that produced it.

I trace every tool invocation, inputs, outputs, latency, and status using structured logging tied to a trace ID that spans the full agent run:

python

import structlog
log = structlog.get_logger()
def traced_tool_call(trace_id: str, tool_name: str, args: dict):
    log.info("tool_call_start", trace_id=trace_id, tool=tool_name, args=args)
    start = time.time()
    result = call_tool(tool_name, args)
    log.info("tool_call_end", 
             trace_id=trace_id, 
             tool=tool_name, 
             status=result["status"],
             latency_ms=round((time.time() - start) * 1000))
    return result

This gives you the ability to replay any agent run, identify exactly where it went wrong, and determine whether the failure was a bad tool call, a bad model decision, or a downstream service issue. Without this, debugging agentic systems is guesswork, and guesswork is expensive when the system is touching live enterprise data.

The Through Line

Tool calling looks simple from the outside. You define a function, register it, and let the model decide when to invoke it. The complexity is in everything that surrounds that decision: schema clarity, input validation, error classification, toolset scoping, and execution observability.

Every one of these is an engineering discipline, not a model capability. The model will do its job. The question is whether you’ve built the infrastructure that makes its job possible.

The teams shipping agents to production reliably are the ones who’ve stopped treating tool calling as a convenience feature and started treating it as a first-class engineering surface. The rest are debugging production incidents and wondering why their demo worked.

Tool Calling Is Not an API Call: What Engineers Keep Getting Wrong was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked