pipeline is really slow – consulting [D]

Hi, after a long debugging process and many discussions, I wanted to ask for advice from people who may have encountered similar training bottlenecks.

My goal is imitation learning for robotics.

Model / Pipeline

  • Observation space:
    • 4 RGB robot cameras
    • image resolution: 128x128x3
    • small vector of robot joint velocities (14 dims)
  • Pipeline:
    • Shared ResNet18 encoder processes each image
    • Each image embedding dimension is 128
    • Final input to policy:
      • 4 * 128 image embedding
      • concatenated with 14-dim state vector
  • Policy backbone:
    • DiT (Diffusion Transformer)
    • ~8 layers
    • hidden dim: 512
    • 8 attention heads
    • total params: ~50M
  • Diffusion setup:
    • predict action chunks of length ~50
    • diffusion timesteps: 4

Dataset / Storage

  • Dataset stored in Zarr
  • Data access is indexed/reference-based (not loading huge chunks into RAM)
  • train/val split is contiguous
  • no shuffling

Current encoder setup

  • Initially trained end-to-end
  • During debugging I switched to ImageNet pretrained ResNet18
  • Encoder is currently frozen

Hardware / Software

  • GPU: NVIDIA A4500
  • RAM: 48GB
  • Storage: SSD
  • CUDA: 12.8
  • PyTorch: 2.9
  • Precision: bf16 mixed precision (also tested fp32)

Dataloader

  • batch size: 2
  • 8 persistent workers
  • pinned memory enabled

Preprocessing

  • preprocessing is minimal
  • normalization + float conversion only
  • preprocessing happens inside the multimodal encoder on GPU

Profiler results (PyTorch profiler)
Current workload split:

  • train_dataloader_next:
    • 4.41s / 41.84s = 10.5%
  • batch_to_device:
    • 0.32s / 41.84s = 0.77%
  • training_step:
    • 12.78s = 30.5%
  • backward:
    • 10.83s = 25.9%
  • optimizer_step (wrapper total):
    • 26.09s = 62.4%

Problem
The training is much slower than I expected.

Current behavior:

  • CPU utilization: ~100%
  • GPU utilization: ~20–30%
  • GPU utilization can even become LOWER with synthetic data
  • VRAM usage is relatively low
  • Throughput is around 10 iterations/sec
  • Epoch of ~50k samples takes around 30 minutes

Additional observations

  • Increasing batch size does NOT reduce epoch wall-clock time
  • Sometimes larger batches make things slower
  • Freezing the encoder did not improve throughput much
  • Replacing dataset samples with synthetic/random tensors improved throughput by only ~50%
  • Synthetic dataset was initialized directly in memory

I do not believe this setup should be this slow. At this rate, training takes multiple days.

For comparison, I saw papers with somewhat similar architectures mentioning ~10 hour training times on RTX 4090. With my setup 10 hours is completely not enough.

Does anyone see something obviously wrong or have suggestions for where I should investigate next?

Please help, can’t know what to do!

submitted by /u/Potential_Hippo1724
[link] [comments]

Liked Liked