PyTrace Autopsy: Teaching AI to Debug Like a Human

How runtime function tracing turns Claude Code from a code reader into a runtime detective

You know the loop. You are building a feature with Claude Code. It writes the code, writes the tests, runs them, and the tests fail. Claude reads the traceback, stares at the source, and proposes a fix. You run the tests again. Still failing. Claude tries a different fix. Still failing. After three attempts, you are wondering: why does it keep missing this?

Here is the thing. Claude can read your code perfectly well. It can follow the logic, understand the types, and reason about control flow. But when the tests fail, it does something no experienced developer would actually do: it tries to diagnose a runtime problem by reading only the source.

That is like a mechanic diagnosing an engine knock by reading the repair manual. You can’t find a misfiring cylinder by staring at a schematic. You need to hear the engine run.

What Human Developers Actually Do

Think about the last time you hit a really stubborn bug. You probably didn’t just sit there re-reading the same function over and over. You did something active:

  • You added print() statements to see what values actually flowed through the code at runtime.
  • You fired up a debugger and stepped through line by line, watching variables change.
  • You added logging to capture the call sequence: what got called, with what arguments, what came back.
  • You may even have reached for a profiler or a tracer to see the full picture.

The common thread is observation. You watched the code execute. You compared what actually happened to what you expected to happen. And somewhere in that gap between expectation and reality, you found the bug.

AI coding assistants don’t have this ability. They live in a world of static text. They can read the source, read the error message, and reason about what might be happening. But they can’t observe what is happening. That is a significant blind spot, and it is exactly the kind that leads to those frustrating multi-attempt fix loops.

The Idea: Give Claude a Tracer

What if we could give Claude the ability to trace function execution at runtime? Not a full interactive debugger; that requires a human sitting there to press “step over” and inspect things. Instead, an automated tracer that:

  • Records every function call along with its arguments
  • Records every return value
  • Records every exception (even the ones that get caught and swallowed)
  • Writes it all to a structured log that Claude can read and analyze

This is what PyTrace Autopsy does. It is a Claude Code skill that instruments Python’s runtime to capture function-level traces, then hands Claude the trace log so it can see what actually happened during test execution.

The result: instead of guessing at runtime behavior from static code, Claude gets a complete record of the execution. Call by call, argument by argument, return value by return value.

How It Works: Two Python Hooks You Probably Didn’t Know About

The implementation is surprisingly simple because Python gives us two hooks that fit together perfectly.

Hook 1: sys.settrace(). Python has a built-in mechanism for registering a callback that gets invoked on every function call, return, and exception. Debuggers like pdb use this internally. You pass it a function, and Python calls that function every time something interesting happens during execution.

Hook 2: sitecustomize.py. When Python starts up, the site module automatically tries to import a file called sitecustomize.py. If it finds one on PYTHONPATH, it executes it before your program runs. It is meant for site-wide configuration, but it is a perfect injection point for a tracer.

Combine them, and you get the full mechanism:

  1. Create a temporary directory.
  2. Place a sitecustomize.py in it that calls sys.settrace() with our tracing callback.
  3. Place a config file next to it specifying what to trace and where to write the log.
  4. Prepend the temp directory to PYTHONPATH.
  5. Run pytest as normal.

When Python starts, it picks up our sitecustomize.py, which installs the tracer. Every function call in your project source (and only your project source; stdlib and third-party packages are filtered out) gets logged to a JSONL file. When the tests finish, Claude reads the log.

The beautiful part: no project files are modified. The tracer lives entirely in a temp directory and is injected via an environment variable. Remove the temp directory, and it is as if nothing happened.

Here is what the setup looks like in practice:

# Create a temp directory for the tracer
TRACE_DIR=$(mktemp -d)
# Copy the tracer template and write a config
cp tracer_template.py $TRACE_DIR/sitecustomize.py
cat > $TRACE_DIR/trace_config.json << 'EOF'
{
"trace_targets": {
"paths": ["/absolute/path/to/your/src"],
"modules": [],
"functions": []
},
"output": {
"log_file": "$TRACE_DIR/trace_output.jsonl",
"max_entries": 10000
}
}
EOF
# Run tests with tracing active
PYTHONPATH="$TRACE_DIR:$PYTHONPATH" pytest tests/ -x -v

That is it. The tracer writes a structured JSONL log, and then Claude reads it.

A Real Example: The Phantom Items

Let us walk through a concrete case where tracing makes a difference. Here is a simple shopping cart class:

class Cart:
items = []
    def __init__(self, customer_name):
self.customer_name = customer_name
    def add_item(self, name, price, quantity=1):
self.items.append({"name": name, "price": price, "quantity": quantity})
    def get_item_count(self):
return sum(item["quantity"] for item in self.items)

And a test that checks whether two carts are independent:

def test_independent_carts():
cart1 = Cart("Alice")
cart1.add_item("Apple", 1.00)
    cart2 = Cart("Bob")
cart2.add_item("Banana", 2.00)
    assert cart1.get_item_count() == 1, f"Cart1 has {cart1.get_item_count()} items"
assert cart2.get_item_count() == 1, f"Cart2 has {cart2.get_item_count()} items"

The test fails: Cart1 has 7 items, expected 1. Seven items? Alice only added one apple. Where did the other six come from?

Without tracing, Claude would read the code and might spot the class-level items = [], or it might not. It is a subtle bug, and the code looks correct at first glance. The __init__ method sets self.customer_name and adds items toself.items. Everything seems fine.

With tracing, the picture is immediately clear. Here are key entries from the trace log (timestamps and file paths omitted for readability):

{"event":"call", "call_id":8, "depth":0, "func":"Cart.__init__", "args":{"self":"<Cart>", "customer_name":"'Bob'"}}
{"event": "call", "call_id":9, "depth":0, "func": "Cart.add_item", "args":{"self": "Cart('Bob', 0 items)", "name":"'Apple'", "price": "1.5", "quantity": "3"}}
{"event": "call", "call_id":10, "depth":0, "func": "Cart.add_item", "args":{"self": "Cart('Bob', 1 items)", "name":"'Bread'", "price": "2.0", "quantity": "1"}}

Bob starts at 0 items (he runs first). Now look at what happens when the next test creates a cart for Charlie:

{"event":"call", "call_id":18, "depth":0, "func":"Cart.__init__", "args":{"self":"<Cart>", "customer_name":"'Charlie'"}}
{"event": "call", "call_id":19, "depth":0, "func": "Cart.add_item", "args":{"self": "Cart('Charlie', 2 items)", "name":"'Laptop'", "price": "1000", "quantity": "1"}}

Stop right there. Charlie’s brand-new cart already has 2 items: Bob’s Apple and Bread from the previous test. By the time the test_independent_carts test runs, the shared items list has accumulated items from every prior test, and get_item_count returns 7.

The bug: items = [] is a class-level attribute, not an instance attribute. Every Cart instance shares the same list. The fix is one line: add self.items = [] inside __init__.

The trace makes this obvious because you can literally watch the item count grow across unrelated cart instances. You don’t need to reason about Python’s class attribute semantics. You see it happening.

Another Example: The Silent Swallower

Here is a different kind of bug, one where the code does not crash when it should.

A config loader reads a JSON file, validates it, and returns the config:

def load_config(filepath, defaults=None):
defaults = defaults or {}
try:
config = read_config_file(filepath)
merged = {**defaults, **config}
validate_config(merged)
return merged
except Exception:
return defaults

The validate_config function checks that required fields are present and raises ValueError if they are not. The test expects that loading an invalid config (missing api_key and port) will raise:

def test_load_invalid_config_raises():
path = _write_config({"database": "postgres://localhost/db"})
with pytest.raises(ValueError, match="Missing required"):
load_config(path)

The test fails: DID NOT RAISE <class ‘ValueError’>. The validation error never reaches the test.

Now look at what the trace reveals (timestamps and file paths omitted):

{"event":"call", "call_id":7, "depth":0, "func":"load_config", "args":{"filepath":"'/tmp/tmp28ib58c7.json'", "defaults":"None"}}
{"event":"call", "call_id":8, "parent_id":7, "depth":1, "func":"read_config_file", "args":{"filepath":"'/tmp/tmp28ib58c7.json'"}}
{"event":"return", "call_id":8, "depth":1, "func":"read_config_file", "return_value":"{'database': 'postgres://localhost/db'}"}
{"event":"call", "call_id":9, "parent_id":7, "depth":1, "func":"validate_config", "args":{"config":"{'database': 'postgres://localhost/db'}"}}
{"event":"exception", "call_id":9, "depth":1, "func":"validate_config", "exc_type":"ValueError", "exc_value":"ValueError("Missing required config fields: ['api_key', 'port']")"}
{"event":"exception", "call_id":7, "depth":0, "func":"load_config", "exc_type":"ValueError", "exc_value":"ValueError("Missing required config fields: ['api_key', 'port']")"}
{"event":"return", "call_id":7, "depth":0, "func":"load_config", "return_value":"{}"}

The trace tells the whole story in seven entries:

  1. read_config_file succeeds and returns the partial config.
  2. validate_config raises a ValueError about missing fields. The tracer captures this even though it never reaches the test.
  3. load_config catches the ValueError (because except Exception catches everything) and returns {}.

The exception was raised, caught, and silently discarded. The trace makes the swallowed exception visible by logging exception events at the point they occur, regardless of whether something higher up catches them.

The fix: change except Exception to except FileNotFoundError so that only missing-file errors are caught, and validation errors propagate up as they should.

Without tracing, Claude has to reason about exception flow across multiple functions and figure out that the broad except clause is eating the ValueError. With tracing, it can see the exception appear and disappear, caught in the act.

The 7-Phase Workflow

PyTrace Autopsy is packaged as a Claude Code skill, which means Claude follows a structured workflow when it uses tracing. Here is the high-level flow:

  1. Detect: Run the failing tests and parse the output to identify which tests failed and where.
  2. Analyze: Read the test code and the source under test. Identify which modules and functions are worth tracing.
  3. Set up: Create a temp directory, copy in the tracer, write a config targeting the relevant source paths.
  4. Run: Execute pytest with the temp directory prepended to PYTHONPATH. Verify that the tracer is activated.
  5. Analyze the trace: Read the JSONL log. Look for wrong argument values, unexpected return values, swallowed exceptions, missing calls, or state mutation between calls.
  6. Fix and verify: Apply the fix based on trace analysis. Re-run tests without tracing to confirm.
  7. Clean up: Remove the temp directory — no artifacts left behind.

Claude decides when to invoke this workflow. If a test fails and the cause is not obvious from reading the code, or if it has already tried a fix and the test still fails, it activates the tracer to get runtime visibility. It picks the trace targets, configures the filters, and interprets the output, all without human intervention.

What Makes This Approach Powerful

A few design decisions make this work well in practice:

Zero-invasive. The tracer never modifies your project files. No import logging added to your source. No print() statements. No monkey-patching. It lives entirely in a temp directory and is injected via an environment variable. Remove the temp dir, and every trace of the tracer (pun intended) is gone.

Targeted. You can filter by source path, module name, or function name. The default is to trace everything under your project’s source directory while excluding the standard library, site-packages, and test files. If the trace is too noisy, you can narrow it down to specific modules or functions.

Structured. The output is JSONL, one JSON object per line. Each entry has a call_id and parent_id so you can reconstruct the call tree. Each entry has a depth field so you can see nesting at a glance. This is much easier for an AI to parse than free-form log output.

Safe. The entire tracer is wrapped in a top-level try/except block. If anything goes wrong during tracer initialization, it prints a warning to stderr, and your program runs normally. The tracer cannot break your tests. It also caps the log at 10,000 entries by default, so that a runaway loop won’t fill your disk.

Automatic. Claude decides when to trace, what to trace, and how to interpret the results. You don’t need to configure anything manually. If you want to invoke it explicitly, say “trace” or use the /trace skill command.

The Bigger Idea

PyTrace Autopsy is a small tool that solves a specific problem, but it points to something larger: AI coding assistants need better instruments, not just better models.

We have spent enormous effort making language models smarter: better at reasoning, better at understanding code, better at generating solutions. And that effort has paid off. But there is a ceiling to what you can accomplish by reasoning about static text alone. Some bugs are only visible at runtime. Some behaviors only emerge when code actually executes. No amount of intelligence can compensate for being unable to see what is happening.

The analogy is medicine. Making a doctor smarter is valuable. But giving that doctor an X-ray machine is transformative. They are not a better doctor because the X-ray is smart; they are a better doctor because the X-ray shows them what they could not see before.

That is what runtime tracing does for AI coding assistants. It doesn’t make Claude smarter. It gives Claude the ability to observe. And observation, as any scientist will tell you, is where understanding begins.

PyTrace Autopsy is open source under the MIT license. You can find the code, examples, and installation instructions at github.com/ApartsinProjects/pytrace-autopsy.

Install it in any project with one command:

curl -L https://raw.githubusercontent.com/ApartsinProjects/pytrace-autopsy/main/install.sh | bash

Contributions welcome.


PyTrace Autopsy: Teaching AI to Debug Like a Human was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Liked Liked