[P] Distributed training observability for Pytorch

digitado ⋅ 27 de January de 2026

Hi,

I have been building TraceML, an open-source tool for low-overhead observability in distributed PyTorch training, and just pushed an update adding single-node DDP support.

It focuses on making common distributed bottlenecks visible without heavy profilers: Step time (median / worst / per-rank) Dataloader fetch time GPU memory usage Rank-aware metrics for DDP

Design goals: drop-in instrumentation (no model rewrite) low overhead (meant to stay enabled) explicit distributed semantics (worst-rank vs averages)

This ISN’T a replacement for PyTorch Profiler or Nsight.

It is meant as always-on telemetry to answer questions like “which rank is the straggler?” or “are GPUs idle due to dataloader or sync?”

Repo: https://github.com/traceopt-ai/traceml Demo: https://www.loom.com/share/de274cbfb49e4f24b4d1d2c7f6a12705

Feedback are most welcome, especially from people debugging performance issues in distributed training.

submitted by /u/traceml-ai
[link] [comments]

Like 0

Liked Liked