AI Agents: Skills, Hooks, & Tool Standards

digitado ⋅ 24 de February de 2026

A Guest Contribution by Tanishq Singh and Jiten Oswal

If you’ve been following the AI agent space in early 2026, you’ve probably noticed something: the conversation has shifted.

We’re no longer debating whether agents work.

We’re debating how to make them reliable, reusable, and composable.

The answer, it turns out, isn’t a better model. It’s better infrastructure around the model (honestly that’s what engineering is!)

If you are building AI Agents you would relate to this that probably 80% of your time and effort is spent not in LLM reasoning, but in everything around it:

logging,
state management,
safety,
evals,
orchestration,
memory management,
context handling,
tooling etc.

This blog is about some of these things which are new to this world, but when you understand them you would feel “why wasn’t it there previously?”

Skills → Portable, Reusable Expertise for Agents

A Skill is a modular package of instructions (and optionally scripts and reference material) that teaches an AI agent how to do something specific. Think of it as a training manual that the agent loads on demand.

The concept emerged across multiple tools almost simultaneously, Claude Code, OpenCode, etc. And in Vercel also formalized it as an open standard with their skills.sh ecosystem, essentially building “npm for AI agents.”

At its core, a skill is just a folder with a SKILL.md file:

my-skill/
├── SKILL.md          # Natural language instructions (required)
├── scripts/          # Optional helper scripts
└── references/       # Optional reference material

The SKILL.md file has two parts: YAML frontmatter that tells the agent when to use the skill, and markdown content with the instructions the agent follows when the skill is invoked.

How Skills Work — The Progressive Disclosure Pattern

One of the elegant design decisions in Claude Code’s skill implementation is progressive disclosure. Agents don’t load every skill’s full content into context upfront. Instead:

Metadata scan (~100 tokens): The agent first scans only the skill’s name and description from the frontmatter(YAML).
Instructions Loading: If the current task matches a skill’s description, the agent loads the full instructions.
Full execution: The agent follows the loaded instructions.

This means you can have dozens of specialized skills available without blowing up your context window. The agent pays attention cost only for what’s relevant.

Creating a Skill — Claude Code Example

Here’s a practical skill that teaches Claude Code to perform code reviews following your team’s standards:

# File: .claude/skills/code-review/SKILL.md
---
name: code-review
description: >
  Reviews code for quality, security, and adherence to team standards.
  Use when the user asks for a review, audit, or quality check of code.
allowed-tools:
  - Read
  - Grep
  - Glob
---

When reviewing code, follow this checklist:

1. **Security**: Check for hardcoded secrets, SQL injection, XSS vulnerabilities
2. **Naming**: Flag generic names like "helper", "utils", "data" - suggest domain-specific names
3. **Error handling**: Verify all async operations have proper error handling
4. **Types**: Ensure TypeScript types are specific, not `any`
5. **Testing**: Check that new functions have corresponding test coverage

Format your review as:
- 🔴 **Critical**: Must fix before merge
- 🟡 **Suggestion**: Would improve quality
- 🟢 **Praise**: Good patterns worth highlighting

Always explain *why* something is an issue, not just *what* to change.

Once this file exists in your project, Claude Code will automatically invoke it when you say something like “review this PR” or “audit this file for issues.” You can also trigger it explicitly with /code-review.

Skills with Subagent Execution

Skills can also delegate work to subagents. The context: fork frontmatter option runs the skill in a separate agent with its own context window:

---
name: deep-research
description: Research a topic thoroughly across the codebase
context: fork
agent: Explore
---

Research $ARGUMENTS thoroughly:
1. Find relevant files using Glob and Grep
2. Read and analyze the code
3. Summarize findings with specific file references

This keeps heavy exploration out of your main conversation and returns only a summary.

Skills with Shell Preprocessing

Skills can run shell commands before being sent to Claude, injecting live data:

---
name: pr-summary
description: Summarize changes in a pull request
context: fork
agent: Explore
allowed-tools: Bash(gh *)
---

## Pull request context
- PR diff: !`gh pr diff`
- PR comments: !`gh pr view --comments`
- Changed files: !`gh pr diff --name-only`

## Your task
Summarize this pull request...

The ! backtick syntax runs shell commands first. Claude receives the fully-rendered prompt with actual PR data, this is preprocessing, not something Claude executes.

Now you might have also heard about AGENTS.md , We will talk about it too in some other blog, and cover the differences between Skills.md and AGENTS.md .

Hooks → Deterministic Guardrails for Non-Deterministic Agents

Hooks are event-driven shell scripts that run automatically at specific points in an agent’s lifecycle. They are the deterministic layer injected into an inherently non-deterministic system.

This is a critical distinction. Unlike skills (which the agent chooses to invoke), hooks always fire when their conditions are met.

They don’t depend on the LLM remembering to:

run a linter or
validate a schema
check for secrets

They execute every single time, period.

Basically its task is to constrain and supplement what an Agent does.

How Hooks Work Under the Hood

Each hook is configured in your settings.json with three components:

Event: Which lifecycle moment triggers it.
Matcher: A regex filter for which tools trigger it (e.g., Edit|Write matches file modifications).
Command: The shell command to run.

The hook receives JSON data via stdin containing context about the event → session ID, working directory, tool name, tool input parameters. Your script processes this and communicates back through exit codes and stdout JSON.

Exit code semantics:

0 — Success. Action proceeds. Stdout is processed for JSON or added to context.
2 — Block. The action is prevented. Stderr becomes the error message shown to Claude.
Other — Non-blocking error. Stderr shown in verbose mode.

Practical Examples

Auto-format code after every edit (Boris’s exact pattern):

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "bun run format || true"
          }
        ]
      }
    ]
  }
}

Boris, the creator of Claude Code, uses this because while “Claude’s code is usually well formatted, inconsistencies can sometimes cause CI failures.” The || true ensures the hook doesn’t block on formatter errors.

Auto-approve safe commands (skip permission dialogs):

{
  "hooks": {
    "PermissionRequest": [
      {
        "matcher": "Bash(npm test*)",
        "hooks": [
          {
            "type": "command",
            "command": "/path/to/validate-test-command.sh"
          }
        ]
      }
    ]
  }
}

Instead of clicking “Allow” for npm test for the hundredth time, this auto-approves it. Boris uses /permissions to pre-allow safe bash commands like bun run build:*, bun run test:*, shared across his team.

Desktop notification when Claude needs input:

{
  "hooks": {
    "Notification": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "notify-send 'Claude Code' 'Awaiting your input'"
          }
        ]
      }
    ]
  }
}

This is how people manage several parallel terminal sessions, system notifications tell him when any Claude session needs attention.

These were some examples from practical experience and also Boris’s recommendations. In the recent update of Claude Code (v2.0.10),PreToolUse hooks can modify the tool inputs before execution as well and the modified JSON is transparent to Claude, it doesn’t know the inputs were changed.

Tool Standards

Every agent capability ultimately reduces to one thing: Calling tools.

Whether it’s searching the web, reading a file, querying a database, or sending an email — the agent reasons about the task, decides which tool to call, generates structured arguments, and processes the result.

How tools are defined, how arguments are specified, and how results are returned, is one of the biggest determinants of agent reliability. And yet, it’s the part most developers rush through.

Standardization is Emerging → MCP

While tool schemas look provider-specific at first glance, the ecosystem is converging toward shared standards.

One of the most important emerging efforts is the Model Context Protocol (MCP).

MCP aims to standardize how AI applications expose tools and contextual resources to models — functioning for AI agents much like the Language Server Protocol (LSP) does for IDEs.

Instead of every framework inventing custom tool wiring, MCP defines: how tools are declared, how capabilities are exposed, how context is shared, how external systems integrate.

As agents move from experiments to infrastructure, protocol standardization becomes critical.

Tool standards aren’t just about schema formatting. They’re about interoperability.

The Universal Schema

Tool definitions follow a JSON schema that is remarkably consistent across providers. Here’s how you define a tool for both OpenAI and Anthropic:

{
  "name": "get_weather",
  "description": "Fetch current weather conditions for a specific city. Returns temperature in Celsius, humidity percentage, and a brief condition description.",
  "parameters": {
    "type": "object",
    "properties": {
      "city": {
        "type": "string",
        "description": "City name, e.g., 'San Francisco' or 'London'"
      },
      "units": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"],
        "description": "Temperature unit for the response"
      }
    },
    "required": ["city"]
  }
}

The Tool Calling Loop

Here’s the complete lifecycle of a tool call:

1. Context Assembly
   System prompt + Tool definitions + User message → Complete context

2. LLM Decision
   Model analyzes context → Decides to call a tool
   Output: { "tool": "get_weather", "arguments": { "city": "Paris" } }

3. Tool Execution  
   Agent framework receives tool call → Executes actual function
   Result: { "temperature": 18, "humidity": 65, "condition": "Partly cloudy" }

4. Result Injection (tool_result message)
   Tool output is formatted as a "tool_result" message → Appended to conversation

5. LLM Continuation
   Model receives updated context with tool result → Generates response or calls another tool

In the Anthropic API, this looks like:

import anthropic

client = anthropic.Anthropic()

# Step 1: Define tools
tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city. Returns temperature, humidity, and conditions.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {
                    "type": "string",
                    "description": "City name (e.g., 'San Francisco')"
                }
            },
            "required": ["city"]
        }
    }
]

# Step 2: Send message — model decides to use tool
response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in Paris?"}]
)

# Step 3: Response contains a tool_use block
response.content = [
  {"type": "tool_use", "id": "call_123", "name": "get_weather", 
   "input": {"city": "Paris"}} ]

# Step 4: Execute tool and return result
tool_result = get_weather("Paris")  # Your actual function

# Step 5: Send tool result back
final_response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    tools=tools,
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"},
        {"role": "assistant", "content": response.content},
        {
            "role": "user",
            "content": [
                {
                    "type": "tool_result",
                    "tool_use_id": "call_123",
                    "content": json.dumps(tool_result)
                }
            ]
        }
    ]
)

Writing Effective Tool Descriptions

1. Be specific about what the tool does AND what it returns:

// ❌ Vague
{
  "name": "manage_users",
  "description": "Manages user data"
}

// ✅ Specific  
{
  "name": "get_user_profile",
  "description": "Retrieves a user's profile information (name, email, registration date, subscription tier) by their unique user ID. Returns null if the user ID doesn't exist."
}

2. Use action-oriented language and specify edge cases:

{
  "name": "search_documents",
  "description": "Searches the document store using semantic similarity. Returns the top 5 most relevant documents with their relevance scores (0-1). Returns an empty array if no documents match above 0.3 threshold. Supports natural language queries; does NOT support SQL or regex syntax."
}

3. Specify parameter constraints in descriptions, not just types:

{
  "properties": {
    "limit": {
      "type": "integer",
      "description": "Maximum number of results to return. Must be between 1 and 100. Defaults to 10 if not specified."
    }
  }
}

The Token Cost of Tools

Here’s something most people miss: tool definitions are included in context on every LLM call. They consume tokens and affect cost and latency permanently throughout the conversation.

This is why practitioners who’ve built at scale are emphatic about keeping tool counts low:

Claude Code uses ~12 tools
Manus uses fewer than 20 tools

The GitHub MCP server, by contrast, exposes 35 tools with ~26K tokens of tool definitions. If every model call includes those definitions, you’re burning 26K tokens of context budget before your conversation even starts.

The LLMToolSelectorMiddleware pattern (by LangChain 1.0) is one solution, use a cheap model to filter tools per query before passing them to the main model. Another approach is to group tools by domain and only load the relevant group.

Common Pitfalls Developers Hit

1. The tool_use / tool_result Pairing Problem:

This is the most common API-level error developers hit. The strict rule is: every tool_use block from the assistant must have exactly one corresponding tool_result block in the next user message, matched by tool_use_id. Three ways this breaks:

Orphaned tool_results: You send a tool_result whose tool_use_id doesn’t match any tool_use in the preceding assistant message. Anthropic’s API rejects with: “unexpected tool_use_id found in tool_result blocks.” This commonly happens when conversation history is truncated or summarized — if the assistant message containing the tool_use gets removed but the following tool_result stays, the pairing breaks.
Missing tool_results: The assistant requests two tool calls but you only return one result. OpenAI’s API rejects with: “An assistant message with ‘tool_calls’ must be followed by tool messages responding to each ‘tool_call_id’.” Same strict rule, different provider.
Duplicate tool_use_ids: Bedrock rejects with “messages contain duplicate Ids” if two tool_use blocks share the same ID. LiteLLM hit this when its caching layer accidentally doubled tool call blocks.

The fix for all of these: treat tool_use/tool_result pairs as atomic units, never trim, summarize, or remove one without the other. Validate your message array before every API call.

2. Tool Schema Hallucination (Models inventing parameters or tools)

Smaller or quantized models sometimes generate tool calls with parameters that don’t exist in the schema, or call tool names that weren’t defined. The model is just generating JSON, it’s essentially doing next-token prediction, and sometimes it predicts plausible but invalid structures. This is especially common with complex schemas or when too many tools are loaded.

3. The “Pretend to Call” Problem

The model generates a natural language response that describes calling a tool rather than actually producing a structured tool call. It says “I’ll look that up for you” and then fabricates a plausible answer. So basically, if you don’t see the structured tool_use block in the response, the model is faking it.

4. Parallel Tool Call Ordering

When the model requests multiple tool calls in a single response, developers often don’t realize all results must be returned in a single user message. Sending them as separate messages breaks the conversation structure. Also, some developers assume the results need to be in the same order as the requests, they don’t, but the tool_use_id matching must be correct.

5. Text Before Tool Results

A subtle API constraint: when sending tool results back, you can’t prepend arbitrary text content before the tool_result blocks in the same user message (at least with Anthropic’s API). The Anthropic docs explicitly flag this as a common mistake.

6. Context Window Erosion from Tool Results

Tool results can be massive, for eg. a database query returning 2000 rows, a web search returning full page content. Developers often pipe raw results back into the conversation without truncation, rapidly eating the context window. Anthropic recently introduced automatic tool call clearing (beta) to address this, and their new Programmatic Tool Calling feature keeps intermediate results out of Claude’s context entirely.

7. Authentication & Rate Limits at the Tool Execution Layer

The gap between “tool calling works in a demo” and “tool calling works in production” is almost entirely about execution infrastructure, OAuth flows for 5000 users, rate limit backoff the LLM doesn’t know about, pagination the LLM can’t handle.

Conclusion

The trend is clear: the agent ecosystem is converging on these primitives.

Skills make agents smarter by giving them reusable expertise.
Hooks make agents safer by enforcing invariants regardless of LLM behavior.
Tool standards make agents capable by defining the contract between reasoning and action.

Whether you’re using Claude Code, LangChain, OpenCode, Cursor, or any of the dozens of tools in this rapidly evolving space, skills, hooks, and well-designed tools are the building blocks you’ll need to master.

The next blog in this series will cover The Patterns → Ralph Loops, Spec-Driven Development, Plan Mode, and Context Engineering, the battle-tested workflows that practitioners are using to make these building blocks actually work in production.

References

Enjoyed this deep dive?

We write & work with industry experts to write about AI systems, AI & Data engineering, LLM internals, and Platform Architecture.

Got a tricky AI System & Private LLM? Drop it in the comments, and we might write our next deep dive about it if there is enough interest.

AI Agents: Skills, Hooks, & Tool Standards was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked