[P] Distributed training observability for Pytorch
Hi,
I have been building TraceML, an open-source tool for low-overhead observability in distributed PyTorch training, and just pushed an update adding single-node DDP support.
It focuses on making common distributed bottlenecks visible without heavy profilers: Step time (median / worst / per-rank) Dataloader fetch time GPU memory usage Rank-aware metrics for DDP
Design goals: drop-in instrumentation (no model rewrite) low overhead (meant to stay enabled) explicit distributed semantics (worst-rank vs averages)
This ISN’T a replacement for PyTorch Profiler or Nsight.
It is meant as always-on telemetry to answer questions like “which rank is the straggler?” or “are GPUs idle due to dataloader or sync?”
Repo: https://github.com/traceopt-ai/traceml Demo: https://www.loom.com/share/de274cbfb49e4f24b4d1d2c7f6a12705
Feedback are most welcome, especially from people debugging performance issues in distributed training.
submitted by /u/traceml-ai
[link] [comments]