Disaster Recovery Is Broken And AI Won’t Fix It

digitado ⋅ 25 de April de 2026

Let’s start with an activity, can you find a common pattern in all of the below disasters:

In October 2025, AWS suffered a prolonged outage in its US‑EAST‑1 region, impacting core services like DynamoDB, IAM, and networking layers. The blast radius extended far beyond the affected region, bringing down widely used platforms such as Snapchat, Coinbase, and Robinhood. The root cause was not a single server failure, but control‑plane fragility combined with hidden regional dependencies. [bluemantis.com], [forbes.com]

Google Cloud has faced similar events. In June 2025, a faulty quota policy update triggered a global failure in Google Cloud’s Service Control layer, impairing APIs across nearly all regions simultaneously. Critical services , including Gmail, Cloud Storage, BigQuery, and Vertex AI were unavailable for hours. The irony was stark: workloads designed for regional isolation were defeated by a globally replicated control decision. [digit.in] [statusgator.com]

In October 2025, a faulty configuration change in Azure Front Door caused a massive global disruption, affecting services like Microsoft 365, Teams, and Xbox. The outage lasted over 8 hours and affected major enterprises worldwide. [reuters.com]

Do you notice a troubling pattern too: an increasing cloud complexity coupled with aggressive automation and cost optimization that has reduced the margin for error in resilience planning? [infoworld.com]

The Uncomfortable Truth About Disaster Recovery Today

Cloud computing was supposed to make outages a relic of the past. In many ways, it delivered – hardware failures, power issues, and localized infrastructure problems are handled far better today by providers like Amazon Web Services, Google Cloud Platform, and Microsoft Azure than most organizations ever managed on-premises.

And yet, outages haven’t gone away. If anything, they’ve become more disruptive.

The reason is subtle but fundamental: the nature of failure has changed.

In the past, failures were mostly physical and contained. A server would crash, a storage array would degrade, or a data center would go offline. These were painful, but they were localized. You could isolate them, work around them, and recover.

In the cloud, failures are increasingly logical and shared. They originate in places that are harder to see and harder to reason about — identity systems, control planes, quota managers, networking abstractions, and global configuration pipelines. When these components fail, they don’t fail neatly. They fail broadly, often affecting multiple regions, services, and customers at once.

Over the past few years, well-documented incidents across major cloud providers have revealed a consistent pattern: systems designed for regional isolation breaking down because of hidden global dependencies. Teams discover, often too late, that their failover strategy depends on the same identity service, the same control plane API, or the same quota system that just failed.

This leads to an uncomfortable but important realization:

Most disaster recovery strategies are built for failures that no longer dominate real systems.

We assume regions are independent enough.
We assume control planes are always reachable.
We assume failover is mostly a routing problem.
We assume that, if things go wrong, humans will step in and fix it.

These assumptions are rarely written down. They live quietly inside architecture diagrams, Terraform modules, and runbooks. But when they fail, they take everything else down with them.

This is where AI starts to matter – not in the middle of an outage, but long before it. Because the real problem in disaster recovery is not execution. It’s decision-making under flawed assumptions.

Automation ≠ Intelligence

Over the last decade, the industry has invested heavily in automation. We’ve built systems that can fail over services in seconds, scale infrastructure automatically, and recreate entire environments from code. These are remarkable achievements, and they are essential for operating at scale.

But automation solves only part of the problem.

It answers the question: “How do we recover?”
It does not answer: “Should we recover this way at all?”

DR automation today vs Smart DR approach

In real incidents, the hardest decisions are rarely mechanical. They are contextual.

Should you fail over immediately, or wait for a partial recovery?
Is the issue isolated to a region, or is it tied to a shared control plane that will follow you?
What is the real risk of data loss if you promote a replica right now?
Will a cross-region move violate compliance constraints or introduce unacceptable latency?

These are not questions that automation can answer. And this is where things get dangerous.

Because automated systems will execute whatever they are told — quickly, reliably, and at scale, even if the decision itself is wrong. In distributed systems, a premature or poorly chosen failover can amplify the impact: shifting load into already degraded systems, triggering cascading failures, or locking in data inconsistencies.

Automation doesn’t eliminate bad decisions. It accelerates them.

This is the gap where AI can provide real value, not by turning more knobs faster, but by helping teams make better decisions in the first place.

AI in SRE Today: Helpful, but Too Late

If you look at how AI is currently used in SRE environments, most of it sits firmly in the operational layer.

Microsoft’s Azure SRE Agent is one of the clearest illustrations of this trend. The agent can automatically acknowledge incidents, correlate metrics, logs, deployments, and prior incidents, and generate investigation summaries within seconds of an alert firing. In some run-modes, it can even execute predefined mitigations or propose fixes for human approval. [sre.azure.com]

Similar capabilities exist in platforms such as PagerDuty Advance, which uses generative AI for incident summarization, alert correlation, stakeholder updates, and post‑incident review drafts directly inside Slack or Teams. [cio.com]

Most of these tools can:

Summarize incidents in real time
Correlate alerts and reduce noise
Detect anomalies in logs and metrics
Generate postmortem drafts

These capabilities are genuinely useful. They reduce cognitive load during incidents, speed up triage, and help teams retain knowledge that would otherwise be lost in chat logs and fragmented documentation.

But they all share a common characteristic: they activate after something has already gone wrong.

They help you understand the failure.
They help you respond more efficiently.
They help you document what happened.

What they don’t do is challenge the assumptions that made the failure possible.

From a disaster recovery perspective, this is a limitation. Improving incident response is valuable, but it does not necessarily improve disaster recovery outcomes, where the cost of a wrong decision can be far greater than the cost of slow triage.

Where AI Can Actually Help: Before the Pager Goes Off

The real leverage of AI in disaster recovery lies upstream, during planning, modeling, and decision preparation.

Decision Support, Not Decision Making

In high-pressure situations, teams rarely hesitate because they lack runbooks. They hesitate because they lack confidence. The consequences of a wrong decision are often severe enough to make any action feel risky – data loss, compliance violations, customer impact, etc.

AI can help reduce that uncertainty by bringing together signals that are otherwise scattered across systems, teams, and historical data.

For example, an AI-assisted system could:

Compare current symptoms with patterns from past incidents
Estimate likely recovery timelines based on similar failures
Model the expected RTO (Recovery Time Objective) and RPO (Recovery Point Objective) for different recovery strategies
Surface hidden dependencies that may not be obvious in architecture diagrams
Highlight regulatory or data residency risks tied to specific actions

Instead of triggering actions, the system produces a decision brief – a structured view of options, trade-offs, and likely outcomes.

The goal is not autonomy but clarity.

Runbooks That Evolve With the System

One of the most common failure points in disaster recovery is the runbook itself.

Runbooks are written at a point in time, but the systems they describe evolve continuously. Infrastructure changes, permissions drift, dependencies shift, and over time, the documented recovery path no longer matches reality.

This is why, during real incidents, teams often discover that steps are missing, permissions are insufficient, or critical dependencies were never documented.

AI offers a way to bridge this gap.

By analyzing incident timelines, chat logs, deployment histories, and infrastructure state, AI can help turn static runbooks into living documents – continuously updated representations of how recovery actually works.

Instead of a single, generic procedure, teams can have context-aware guidance:
“If identity services are degraded, skip standard failover and use this alternate path.”

This reduces reliance on tribal knowledge at exactly the moment when people are least able to rely on it.

Smarter And More Targeted DR Testing

Disaster recovery testing today tends to fall into two extremes. On one end, there are full-scale failover exercises – expensive, disruptive, and often avoided. On the other, there are tabletop exercises that validate intent but not reality.

AI makes it possible to take a more targeted approach.

By analyzing infrastructure changes, incident history, and dependency graphs, AI can help identify where the system is most likely to fail next. Instead of testing everything, teams can focus on the scenarios that carry the highest risk.

This shifts DR testing from a compliance exercise to a continuous validation process, where assumptions are regularly challenged and updated.

A Realistic Approach

Context Aware Decision Support

An AI-assisted disaster recovery system is not an autonomous “self-healing” platform. In practice, it behaves more like a decision support system built on top of your existing infrastructure.

At its foundation is a continuously updated view of the system – drawn from infrastructure-as-code repositories, dependency mappings, incident records, and compliance constraints. This data feeds into a reasoning layer that combines probabilistic models with language-based synthesis to evaluate different failure scenarios.

On top of that sits a simple but powerful interface: a decision brief presented during incidents. Instead of raw alerts and fragmented logs, responders see a structured set of recovery options, each with clearly articulated trade-offs — latency, data loss, compliance risk, and likelihood of success.

The system does not take action. It does not trigger failovers. It does something far more valuable:

It makes it easier for humans to make the right decision under pressure.

The Risks You Can’t Ignore

Introducing AI into disaster recovery does not eliminate risk, it changes its nature.

AI systems can project confidence even when their underlying data is incomplete. They can be biased toward common failure patterns, missing rare but catastrophic scenarios. They can misinterpret dependencies or rely on outdated information if the underlying data is stale.

Perhaps most importantly, they can produce recommendations that are difficult to explain. And in the middle of an incident, anything that cannot be clearly justified is unlikely to be trusted.

This is why guardrails matter.

AI should remain advisory. Decisions that impact recovery must stay firmly in human hands. Outputs should prioritize transparency over confidence, and systems should be continuously validated against real incidents and test scenarios.

In disaster recovery, speed matters, but correctness matters more.

A Starting Point

Adopting AI in disaster recovery does not require a large platform investment. In fact, the safest approach is to start small and focus on areas where the value is immediate and the risk is low.

Use AI for dependency discovery, not automation
Let AI help uncover hidden control‑plane, identity, networking, or quota dependencies that DR diagrams routinely omit — but stop short of letting it act on them.
Generate decision briefs, not actions
Ask AI to produce structured comparisons: recovery options, estimated RTO/RPO, blast radius, and risk trade‑offs. Humans should still choose what to do.
Apply AI to DR testing prioritization
Use AI to answer what to test next based on recent code churn, infra drift, and incident patterns, rather than running the same expensive DR exercise every quarter.
Keep humans explicitly accountable for final decisions
AI recommendations should be advisory by design. If no human is accountable for a recovery decision, the system is already too autonomous.
Continuously validate assumptions against real outages
After every major incident — cloud‑provider or internal — re‑evaluate which assumptions held, which failed, and whether AI‑generated guidance would have helped or misled.

Looking Ahead

Cloud outages are not going away. If anything, increasing system complexity guarantees that they will continue to happen.

What will change is how well we are prepared for them.

For years, the industry has focused on making recovery faster. But speed alone is not enough. In many cases, the most damaging failures are not caused by slow responses, but by incorrect ones.

AI will not fix disaster recovery by automating it further. Its real value lies in something far less visible, but far more important:

Helping us understand our systems well enough to know which decisions not to make.

Because in modern distributed systems, the difference between resilience and failure is often not how quickly you act — but whether you chose the right action at all.

A better design for smart and context-aware Disaster Recovery.

About Me

I’m a Site Reliability Engineer with 15+ years of experience building and operating large-scale systems on cloud platforms. I’ve spent much of my career dealing with real production failures: debugging outages, handling incidents, and learning the hard way where systems actually break. My work focuses on reliability engineering, SLO-driven operations, and building systems that hold up under real-world pressure.

Disaster Recovery Is Broken And AI Won’t Fix It was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked