Back to blog

Disaster Recovery Architecture Diagrams: RTO, RPO & Multi-Region Failover (2026)

How to create disaster recovery (DR) architecture diagrams. Covers RTO/RPO targets, active-passive vs active-active, multi-region failover, database replication, and chaos engineering — with AI prompt templates.

R
Ryan·Senior AI Engineer
·

Disaster recovery (DR) architecture diagrams answer a specific question: when the primary region goes dark, what happens, in what order, and how long does it take? DR diagrams are not purely academic — they are used by SREs during actual incidents, reviewed by enterprise procurement teams checking your BCP documentation, and tested during DR drills where every step and dependency must be visible. A good DR diagram makes the recovery path legible to someone who has never seen your system before and is reading it at 3am. This guide covers how to diagram each DR tier, what RTO and RPO targets drive architecture decisions, and prompt templates for generating complete DR diagrams.

RTO, RPO, and the four DR tiers

Two metrics define your DR architecture more than any other:

  • RTO (Recovery Time Objective): The maximum tolerable time between a disaster and restoration of service. Drives whether you need hot standby (minutes), warm standby (tens of minutes), or cold recovery (hours).
  • RPO (Recovery Point Objective): The maximum tolerable data loss measured in time. Drives your replication strategy — synchronous replication (RPO ≈ 0), asynchronous replication (RPO = replication lag), or periodic backups (RPO = backup frequency).

The four DR tiers map to progressively lower RTO/RPO at progressively higher cost:

  • Tier 1 — Backup & Restore: RTO: hours to days. RPO: hours. All data is backed up to S3 or Glacier. Recovery means provisioning infrastructure from scratch and restoring from the latest backup. Lowest cost, highest recovery time. Suitable for non-critical internal tools.
  • Tier 2 — Pilot Light: RTO: 30 minutes to 2 hours. RPO: minutes to 1 hour. A minimal “pilot light” environment runs in the DR region — just the core data tier (replicated DB, S3 buckets). On disaster, you scale up compute (launch EC2/ECS, update Route 53). Infrastructure exists but compute is off or minimal.
  • Tier 3 — Warm Standby: RTO: 15–30 minutes. RPO: seconds to minutes. A reduced-scale but fully functional copy of production runs in the DR region at all times (e.g., 20% of primary capacity). On disaster, scale up to full capacity and cut over DNS. More expensive but validated continuously.
  • Tier 4 — Active-Active (Multi-Site): RTO: near zero (seconds). RPO: near zero (synchronous replication) or seconds (async). Traffic runs simultaneously across two or more regions. On failure of one region, load balancer routes all traffic to the surviving region. Highest cost and operational complexity — required for RTO under 5 minutes.

Key components to show in a DR architecture diagram

  • Primary and DR regions: Clearly separate the primary region environment from the DR region. Use distinct visual boundaries (dashed boxes, different background colors). Label each region with its cloud region ID.
  • Replication flows: Show database replication streams (RDS cross-region read replicas, Aurora Global Database, Postgres streaming replication), object storage replication (S3 CRR — cross-region replication), and message queue replication (Kafka MirrorMaker 2) with their replication mode (sync/async) and expected lag.
  • DNS routing and health checks: Show your DNS layer (Route 53, CloudFlare) and the health check configuration that triggers failover. For active-passive, show the health check polling interval and the failover TTL. For active-active, show the weighted or latency-based routing policy.
  • Failover sequence: A numbered list of steps overlaid on the diagram, or a companion runbook flow. Example: (1) Health check detects primary unhealthy, (2) Route 53 updates DNS to DR region, (3) RDS read replica is promoted to primary, (4) ECS desired count scaled to full capacity, (5) Smoke tests run, (6) Incident declared resolved.
  • Infrastructure as Code (IaC) reference: Show that DR infrastructure is defined in Terraform/CDK and deployed from the same Git repo as primary. This confirms DR is code-reviewed and versioned, not maintained manually.
  • Backup locations and schedules: S3 buckets in the DR region (or Glacier for archival), backup schedule (RDS automated daily snapshots, transaction log backups every 5 minutes), and PITR (point-in-time recovery) window.
  • RTO/RPO annotations: Add the target RTO and RPO directly on the diagram — this is the first thing an auditor or enterprise customer will ask about, and it makes the trade-offs visible.

Prompt examples for DR architecture diagrams

Warm standby with Aurora Global Database

"Warm standby DR architecture for a B2B SaaS on AWS. Primary region: us-east-1. DR region: us-west-2. RTO target: 15 minutes. RPO target: 30 seconds. Primary: ALB → ECS Fargate (4 tasks) → Aurora PostgreSQL Global Database primary cluster (2 instances). DR region: ALB (pre-provisioned) → ECS service (desired count: 1, scales to 4 on failover) → Aurora Global Database secondary cluster (1 instance, async replication lag <30s). S3 buckets replicated via CRR to us-west-2. Route 53 health check on primary ALB (30-second interval) → on 3 consecutive failures, alias record fails over to DR ALB. Failover runbook: (1) Aurora secondary promoted to primary (typically <1 min), (2) ECS desired count scaled to 4, (3) Route 53 DNS TTL 60s propagation. Show both regions, replication streams with lag annotation, and failover trigger path."

Active-active multi-region with conflict resolution

"Active-active architecture across us-east-1 and eu-west-1 for a global SaaS. Route 53 latency-based routing: US/South American users → us-east-1, European/African users → eu-west-1. Both regions run identical stacks: ALB → ECS Fargate → DynamoDB Global Tables (multi-region active-active, sub-second replication). DynamoDB Global Tables uses last-writer-wins conflict resolution for concurrent writes to the same item. S3 with CRR (bidirectional) for file storage. Elasticache (separate per-region cluster, no cross-region replication — region-local cache only). On full region failure: Route 53 health check detects ALB unhealthy → routes all traffic to surviving region within 60 seconds. Show both regions with traffic routing, DynamoDB replication with conflict resolution annotation, and region-failure detection path."

Pilot light DR for a startup

"Pilot light DR for a cost-conscious startup. Primary: us-east-1 (RDS PostgreSQL Multi-AZ, ECS Fargate, S3, CloudFront). DR region: us-west-2. Pilot light components always running in us-west-2: RDS read replica (receives async replication from us-east-1, lag target <1 min), S3 bucket with CRR from primary bucket, ECR images mirrored to us-west-2. Compute NOT running in us-west-2 in normal state — only the data tier. On disaster declaration: (1) Terraform apply in us-west-2 workspace provisions ALB, ECS cluster, task definitions (pre-tested in DR drills), (2) RDS read replica promoted to standalone primary (break replication), (3) DNS updated to us-west-2 ALB. RTO: 45 minutes. RPO: 1 minute. Show primary region (full), DR region (data-only normal state), and the Terraform provisioning step activated on DR declaration."

DR tier comparison

TierRTORPOCost multiplierBest for
Backup & RestoreHours–daysHours~1.05xNon-critical internal tools, dev environments
Pilot Light30 min–2 hrsMinutes~1.2xStartups, internal SaaS, low-traffic products
Warm Standby15–30 minSeconds–minutes~1.5xMid-market SaaS, financial services, healthcare
Active-Active<1 minNear-zero~2x+Tier-1 SaaS, global platforms, financial trading

What to annotate on a DR diagram

  • RTO and RPO targets: Label these prominently — they are the north star metrics that justify every architectural decision in the DR diagram. Auditors and enterprise customers check these first.
  • Replication lag: Annotate the expected replication lag for each data store (database, object storage, queue). Synchronous replication = adds latency to writes; async = risk of data loss equal to the lag at time of failure.
  • Failover trigger: Show what triggers failover: an automated health check, a human decision, or a chaos engineering test. Ambiguity about who initiates failover is one of the most common DR failure modes.
  • DR test schedule: Annotate when the DR plan was last tested (and the result). An untested DR plan is not a DR plan — this annotation communicates maturity.
  • Failback path: Show how you return to the primary region after the incident is resolved. Failback is often harder than failover and is frequently omitted from DR documentation.

Related guides: cloud architecture diagram best practices, multi-tenant architecture, SOC 2 architecture diagrams, and Kafka MirrorMaker replication.

Ready to try it yourself?

Start Creating - Free