From Prompts to Flight Logs: How LLM Agents Can Run a Drone Testing Pipeline

If you’ve ever worked on drones (or any autonomous system), you know the dirty secret:

Simulation testing is where progress goes to die.

Not because simulation is bad — it’s essential. But because the workflow is brutal: you brainstorm scenarios, configure the environment, translate everything into the simulator’s formats, craft mission plans for the flight stack, run the sim, then spend hours digging through logs and plots trying to explain why the system behaved the way it did.

In our ICSE 2025 work, we asked a simple question:

What if we automated the entire simulation testing loop using LLM agents — not just one model, but a small “AI test team” that collaborates like humans do?

That’s the idea behind AUTOSIMTEST: a multi-agent LLM framework that helps developers generate scenarios, configure simulators, produce missions, and analyze flight logs — end to end.

The problem: sUAS testing is manual, slow, and doesn’t scale

Small uncrewed aerial systems (sUAS) need to behave correctly across messy real-world conditions: wind, fog, urban canyons, hilly terrain, diverse missions like tracking, surveillance, delivery, and more.

But today, even with mature simulators, developers still do most steps by hand:

  • Identify scenarios worth testing (often relying on domain experience)
  • Configure the simulator (terrain, weather, obstacles, sensor noise)
  • Translate the mission into the specific input language required by the System-under-Test (SuT)
  • Run simulations
  • Analyze logs with thousands of parameters (and interpret plots manually)

Those “labor-intensive tasks” limit the ability to test broadly and frequently.

So we built a framework where LLM agents collaborate to remove that bottleneck.

The key idea: don’t use one LLM — use specialized agents

Instead of asking one model to do everything, AUTOSIMTEST splits the pipeline into three phases and assigns specialist agents to each part.

Overview of our AutoSimTest Framework with the 3 main Phases: Scenario Blueprint Construction and Manual Validation and Feedback (blue), Scenario Specification, Validation, and Execution (green), and Scenario Analysis (yellow).

Phase 1: Scenario blueprint construction

A developer provides a natural-language goal (from very specific to broad), and the Scenario-Gen agent (S-Agent) generates a scenario blueprint: environment, mission objectives, and safety properties to test.

S-Agent can use Retrieval-Augmented Generation (RAG) so it isn’t just “making up” scenarios from generic knowledge — it can pull from a knowledge base of sUAS incidents and domain info to craft more realistic, context-specific tests.

Phase 2: Scenario specification and execution

Now we need to translate that blueprint into executable artifacts:

  • Env-Agent turns the environmental part into simulator setup/config
  • M-Agent turns the mission part into SuT-compatible mission scripts

Because LLMs can hallucinate, we integrate rule-based validators that check the generated artifacts and provide automated feedback loops.

Phase 3: Scenario analysis

This is where teams lose days.

So we built an Analytics-Agent that supports two modes:

  • Automated Mode: analyze logs based on test properties generated earlier
  • Interactive Mode: let developers ask new questions and explore logs with AI assistance

It does semantic search over a knowledge base to identify which flight parameters matter (velocity, pitch, GPS, altitude, etc.), auto-plots time series, then uses a vision-capable LLM to interpret the plot images and generate a report. We have already improved our Analytics-Agent, I will make a seperate post soon.

Why we think this matters (beyond drones)

This isn’t “LLMs writing unit tests.” It’s bigger:

We’re turning simulation testing into a closed-loop system:
prompt → scenario → executable config/mission → run → logs → analysis → next test

In practice, that means:

  • More scenario diversity with less human effort
  • Faster iteration cycles (test earlier, more often)
  • A path toward continuous verification for autonomy stacks

And if you work in robotics, AVs, industrial automation, or any cyber-physical domain, you can probably map this architecture to your world immediately.

The deeper takeaway: the real product is the workflow

If you’re an AI enthusiast, here’s the shift to notice:

LLMs aren’t just a “model you call.”

They’re becoming modular operators inside engineered systems — agents with roles, guardrails, validators, memory/knowledge bases, and feedback loops. That’s what makes AUTOSIMTEST interesting as a product direction:

  • A scenario generator grounded in real incidents
  • A compiler layer that turns intent into executable simulation artifacts
  • A log analyst that reduces the expert knowledge barrier for debugging

This is the blueprint for a new class of developer tools:
AI that doesn’t just help engineers write tests — it runs the simulation testing loop and explains what happened.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

LLM agents won’t replace autonomy engineers — but they can replace a lot of the repetitive, fragile work around simulation testing: scenario authoring, config wiring, mission scripting, and first-pass failure triage.

That’s the bet behind AUTOSIMTEST:
prompt → scenario → run → logs → explanation → next test

If you’re working on drones/robotics/autonomy and want to compare notes (or try this approach on your stack), reach out. I’ll share a deeper post soon on the improved Analytics-Agent and the guardrails that kept LLM outputs executable.

Paper: arXiv 2501.11864 (ICSE 2025)


From Prompts to Flight Logs: How LLM Agents Can Run a Drone Testing Pipeline was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Liked Liked