Backup & Disaster Recovery

Protect your data and ensure business continuity

Critical for Production

Backups and disaster recovery are essential for production workloads. Without them, data loss could be permanent and business-ending. Test your recovery procedures regularly!

Disaster Recovery Overview

Disaster recovery (DR) ensures your application can survive and recover from failures - from simple data corruption to complete region outages.

Key Metrics: RTO and RPO

Metric	Definition	Example
RTO	Recovery Time Objective - How long until you are back online	4 hours, 1 day
RPO	Recovery Point Objective - How much data can you afford to lose	1 hour, 5 minutes

Define Your Requirements

Before designing DR, determine your business requirements:

How long can you be offline? (RTO)
How much data loss is acceptable? (RPO)
What is your DR budget?

RDS automated backups provide point-in-time recovery within the retention period:

HCL

resource "aws_db_instance" "main" {
  identifier = "my-app-db"
  engine     = "postgres"
  # ... other config ...

  # Automated backups
  backup_retention_period = 7          # Days to retain backups (max 35)
  backup_window           = "03:00-04:00"  # UTC - during low traffic
  maintenance_window      = "Mon:04:00-Mon:05:00"

  # Enable deletion protection
  deletion_protection = true

  # Copy tags to snapshots
  copy_tags_to_snapshot = true

  # Enable storage encryption
  storage_encrypted = true

  # Final snapshot before deletion
  skip_final_snapshot       = false
  final_snapshot_identifier = "my-app-db-final-snapshot"
}

Restore to Point-in-Time

Terminal

$aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier my-app-db \ --target-db-instance-identifier my-app-db-restored \ --restore-time 2024-01-15T10:30:00Z

{
  "DBInstance": {
    "DBInstanceIdentifier": "my-app-db-restored",
    "DBInstanceStatus": "creating"
  }
}

Recovery Creates New Instance

Point-in-time recovery creates a new RDS instance. You will need to update your application connection string to point to the restored instance.

Manual Snapshots

Terminal

$aws rds create-db-snapshot --db-instance-identifier my-app-db --db-snapshot-identifier my-app-db-snapshot-20240115

{
  "DBSnapshot": {
    "DBSnapshotIdentifier": "my-app-db-snapshot-20240115",
    "DBInstanceIdentifier": "my-app-db",
    "Status": "creating"
  }
}

S3 Cross-Region Replication (CRR) copies objects to a bucket in another region:

HCL

# Source bucket (primary region)
resource "aws_s3_bucket" "source" {
  bucket = "my-app-data-ap-southeast-1"
}

resource "aws_s3_bucket_versioning" "source" {
  bucket = aws_s3_bucket.source.id
  versioning_configuration {
    status = "Enabled"  # Required for replication
  }
}

# Destination bucket (DR region)
resource "aws_s3_bucket" "destination" {
  provider = aws.us_east_1
  bucket   = "my-app-data-us-east-1"
}

resource "aws_s3_bucket_versioning" "destination" {
  provider = aws.us_east_1
  bucket   = aws_s3_bucket.destination.id
  versioning_configuration {
    status = "Enabled"
  }
}

# Replication configuration
resource "aws_s3_bucket_replication_configuration" "replication" {
  bucket = aws_s3_bucket.source.id
  role   = aws_iam_role.replication.arn

  rule {
    id     = "replicate-all"
    status = "Enabled"

    destination {
      bucket        = aws_s3_bucket.destination.arn
      storage_class = "STANDARD"
    }
  }
}

# IAM role for replication
resource "aws_iam_role" "replication" {
  name = "s3-replication-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "s3.amazonaws.com"
      }
    }]
  })
}

resource "aws_iam_role_policy" "replication" {
  role = aws_iam_role.replication.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetReplicationConfiguration",
          "s3:ListBucket"
        ]
        Resource = [aws_s3_bucket.source.arn]
      },
      {
        Effect = "Allow"
        Action = [
          "s3:GetObjectVersionForReplication",
          "s3:GetObjectVersionAcl",
          "s3:GetObjectVersionTagging"
        ]
        Resource = ["${aws_s3_bucket.source.arn}/*"]
      },
      {
        Effect = "Allow"
        Action = [
          "s3:ReplicateObject",
          "s3:ReplicateDelete",
          "s3:ReplicateTags"
        ]
        Resource = ["${aws_s3_bucket.destination.arn}/*"]
      }
    ]
  })
}

Replication Costs

S3 CRR incurs charges for:

PUT requests to destination bucket
Data transfer between regions
Storage in destination bucket

Scenario 1: Data Corruption

Plain Text

1. IDENTIFY the issue
   - Check CloudWatch alarms
   - Review application logs
   - Identify affected data/time

2. STOP writes to affected data
   - Scale down services if needed
   - Communicate to stakeholders

3. RESTORE from backup
   - RDS: Point-in-time restore to new instance
   - S3: Restore object versions

4. VERIFY restored data
   - Run data integrity checks
   - Compare against known good state

5. UPDATE connections
   - Point application to restored instance
   - Verify application functionality

6. DOCUMENT incident
   - Timeline of events
   - Root cause analysis
   - Prevention measures

Scenario 2: Region Failure

Plain Text

1. DETECT failure
   - Route 53 health checks trigger
   - CloudWatch cross-region alarms
   - External monitoring alerts

2. FAILOVER DNS
   - Route 53 automatic failover (if configured)
   - Or manually update DNS records

3. VERIFY DR infrastructure
   - Check DR region resources are healthy
   - Verify database replication is current

4. PROMOTE DR database
   - RDS: Promote read replica to primary
   - Or restore from latest backup

5. UPDATE application config
   - Secrets point to DR resources
   - Verify all connections

6. MONITOR and communicate
   - Monitor DR region closely
   - Communicate status to stakeholders

7. PLAN failback
   - Once primary region recovers
   - Plan data sync and failback procedure

DR Strategies by Cost

Strategy	RTO	RPO	Monthly Cost
Backup & Restore	Hours	Hours	$50-100 (backup storage)
Pilot Light	Minutes-Hours	Minutes	$100-300 (minimal DR infra)
Warm Standby	Minutes	Seconds-Minutes	$300-500 (scaled-down DR)
Active-Active	Seconds	Zero	2x primary cost

Start Simple

For most applications, Backup & Restore or Pilot Light provides adequate protection at reasonable cost. Upgrade to higher tiers as your business requirements demand.

Testing Your DR Plan

Document procedures - Write step-by-step runbooks
Test regularly - Quarterly DR drills at minimum
Measure RTO/RPO - Verify you meet objectives
Train your team - Everyone should know the procedures
Update after changes - Review DR plan after infrastructure changes

Pipeline Notifications

Multi-Region

Backup & Disaster Recovery

Disaster Recovery Overview

Key Metrics: RTO and RPO

RDS Automated Backups

Restore to Point-in-Time

Manual Snapshots

S3 Cross-Region Replication

ECS Service Recovery

Secrets Manager Replication

Disaster Recovery Procedures

Scenario 1: Data Corruption

Scenario 2: Region Failure

DR Strategies by Cost

Testing Your DR Plan