pipeline is really slow – consulting [D]
Hi, after a long debugging process and many discussions, I wanted to ask for advice from people who may have encountered similar training bottlenecks.
My goal is imitation learning for robotics.
Model / Pipeline
- Observation space:
- 4 RGB robot cameras
- image resolution: 128x128x3
- small vector of robot joint velocities (14 dims)
- Pipeline:
- Shared ResNet18 encoder processes each image
- Each image embedding dimension is 128
- Final input to policy:
- 4 * 128 image embedding
- concatenated with 14-dim state vector
- Policy backbone:
- DiT (Diffusion Transformer)
- ~8 layers
- hidden dim: 512
- 8 attention heads
- total params: ~50M
- Diffusion setup:
- predict action chunks of length ~50
- diffusion timesteps: 4
Dataset / Storage
- Dataset stored in Zarr
- Data access is indexed/reference-based (not loading huge chunks into RAM)
- train/val split is contiguous
- no shuffling
Current encoder setup
- Initially trained end-to-end
- During debugging I switched to ImageNet pretrained ResNet18
- Encoder is currently frozen
Hardware / Software
- GPU: NVIDIA A4500
- RAM: 48GB
- Storage: SSD
- CUDA: 12.8
- PyTorch: 2.9
- Precision: bf16 mixed precision (also tested fp32)
Dataloader
- batch size: 2
- 8 persistent workers
- pinned memory enabled
Preprocessing
- preprocessing is minimal
- normalization + float conversion only
- preprocessing happens inside the multimodal encoder on GPU
Profiler results (PyTorch profiler)
Current workload split:
- train_dataloader_next:
- 4.41s / 41.84s = 10.5%
- batch_to_device:
- 0.32s / 41.84s = 0.77%
- training_step:
- 12.78s = 30.5%
- backward:
- 10.83s = 25.9%
- optimizer_step (wrapper total):
- 26.09s = 62.4%
Problem
The training is much slower than I expected.
Current behavior:
- CPU utilization: ~100%
- GPU utilization: ~20–30%
- GPU utilization can even become LOWER with synthetic data
- VRAM usage is relatively low
- Throughput is around 10 iterations/sec
- Epoch of ~50k samples takes around 30 minutes
Additional observations
- Increasing batch size does NOT reduce epoch wall-clock time
- Sometimes larger batches make things slower
- Freezing the encoder did not improve throughput much
- Replacing dataset samples with synthetic/random tensors improved throughput by only ~50%
- Synthetic dataset was initialized directly in memory
I do not believe this setup should be this slow. At this rate, training takes multiple days.
For comparison, I saw papers with somewhat similar architectures mentioning ~10 hour training times on RTX 4090. With my setup 10 hours is completely not enough.
Does anyone see something obviously wrong or have suggestions for where I should investigate next?
Please help, can’t know what to do!
submitted by /u/Potential_Hippo1724
[link] [comments]