Deep Learning for Autonomous Drone Navigation (RGB-D only) – How would you approach this?

Hi everyone,
I’m working on a university project and could really use some advice from people with more experience in autonomous navigation / RL / simulation.

Task:
I need to design a deep learning model that directly controls a drone (x, y, z, pitch, yaw — roll probably doesn’t make much sense here 😅). The drone should autonomously patrol and map indoor and outdoor environments.

Example use case:
A warehouse where the drone automatically flies through all aisles repeatedly, covering the full area with a minimal / near-optimal path, while avoiding obstacles.

Important constraints:

  • The drone does not exist in real life
  • Training and testing must be done in simulation
  • Using existing datasets (e.g. ScanNet) is allowed
  • Only RGB-D data from the drone can be used for navigation (no external maps, no GPS, etc.)

My current idea / approach

I’m thinking about a staged approach:

  1. Procedural environments Generate simple rooms / mazes in Python (basic geometries) to get fast initial results and stable training.
  2. Fine-tuning on realistic data Fine-tune the model on something like ScanNet so it can handle complex indoor scenes (hanging lamps, cables, clutter, etc.).
  3. Policy learning Likely RL or imitation learning, where the model outputs control commands directly from RGB-D input.

One thing I’m unsure about:
In simulation you can’t model everything (e.g. a bird flying into the drone). How is this usually handled? Just ignore rare edge cases and focus on static / semi-static obstacles?

Simulation tools – what should I use?

This is where I’m most confused right now:

  • AirSim – seems discontinued
  • Colosseum (AirSim successor) – heard there are stability / maintenance issues
    • Pros: great graphics, RGB-D + LiDAR support
  • Gazebo + PX4
    • Unsure about RGB-D data quality and availability
    • Graphics seem quite poor → not sure if that hurts learning
  • Pegasus Simulator
    • Looks promising, but I don’t know if it fully supports what I need (RGB-D streams, flexible environments, DL training loop, etc.)

What I care most about:

  • Real-time RGB-D camera access
  • Decent visual realism
  • Ability to easily generate multiple environments
  • Reasonable integration with Python / PyTorch

Main questions

  • How would you structure the learning problem? (Exploration vs. patrolling, reward design, intermediate representations, etc.)
  • What would you train the model on exactly? Do I need to create several TB of Unreal scenes for training? How to validate my model(s) properly?
  • Which simulator would you recommend in 2025/2026 for this kind of project?
  • Do I need ROS/ROS2?

Any insights or “don’t do this” advice would be massively appreciated 🙏
Thanks in advance!

submitted by /u/Glittering_Copy6914
[link] [comments]

Liked Liked