pipeline is really slow – consulting [D]

digitado ⋅ 23 de May de 2026

Hi, after a long debugging process and many discussions, I wanted to ask for advice from people who may have encountered similar training bottlenecks.

My goal is imitation learning for robotics.

Model / Pipeline

Observation space:
- 4 RGB robot cameras
- image resolution: 128x128x3
- small vector of robot joint velocities (14 dims)
Pipeline:
- Shared ResNet18 encoder processes each image
- Each image embedding dimension is 128
- Final input to policy:
  - 4 * 128 image embedding
  - concatenated with 14-dim state vector
Policy backbone:
- DiT (Diffusion Transformer)
- ~8 layers
- hidden dim: 512
- 8 attention heads
- total params: ~50M
Diffusion setup:
- predict action chunks of length ~50
- diffusion timesteps: 4

Dataset / Storage

Dataset stored in Zarr
Data access is indexed/reference-based (not loading huge chunks into RAM)
train/val split is contiguous
no shuffling

Current encoder setup

Initially trained end-to-end
During debugging I switched to ImageNet pretrained ResNet18
Encoder is currently frozen

Hardware / Software

GPU: NVIDIA A4500
RAM: 48GB
Storage: SSD
CUDA: 12.8
PyTorch: 2.9
Precision: bf16 mixed precision (also tested fp32)

Dataloader

batch size: 2
8 persistent workers
pinned memory enabled

Preprocessing

preprocessing is minimal
normalization + float conversion only
preprocessing happens inside the multimodal encoder on GPU

Profiler results (PyTorch profiler)
Current workload split:

train_dataloader_next:
- 4.41s / 41.84s = 10.5%
batch_to_device:
- 0.32s / 41.84s = 0.77%
training_step:
- 12.78s = 30.5%
backward:
- 10.83s = 25.9%
optimizer_step (wrapper total):
- 26.09s = 62.4%

Problem
The training is much slower than I expected.

Current behavior:

CPU utilization: ~100%
GPU utilization: ~20–30%
GPU utilization can even become LOWER with synthetic data
VRAM usage is relatively low
Throughput is around 10 iterations/sec
Epoch of ~50k samples takes around 30 minutes

Additional observations

Increasing batch size does NOT reduce epoch wall-clock time
Sometimes larger batches make things slower
Freezing the encoder did not improve throughput much
Replacing dataset samples with synthetic/random tensors improved throughput by only ~50%
Synthetic dataset was initialized directly in memory

I do not believe this setup should be this slow. At this rate, training takes multiple days.

For comparison, I saw papers with somewhat similar architectures mentioning ~10 hour training times on RTX 4090. With my setup 10 hours is completely not enough.

Does anyone see something obviously wrong or have suggestions for where I should investigate next?

Please help, can’t know what to do!

submitted by /u/Potential_Hippo1724
[link] [comments]

Like 0

Liked Liked