Backup & Disaster Recovery
Protect your data and ensure business continuity
Critical for Production
Backups and disaster recovery are essential for production workloads. Without them, data loss could be permanent and business-ending. Test your recovery procedures regularly!
Disaster Recovery Overview
Disaster recovery (DR) ensures your application can survive and recover from failures - from simple data corruption to complete region outages.
Key Metrics: RTO and RPO
| Metric | Definition | Example |
|---|---|---|
| RTO | Recovery Time Objective - How long until you are back online | 4 hours, 1 day |
| RPO | Recovery Point Objective - How much data can you afford to lose | 1 hour, 5 minutes |
Define Your Requirements
Before designing DR, determine your business requirements:
- How long can you be offline? (RTO)
- How much data loss is acceptable? (RPO)
- What is your DR budget?
RDS automated backups provide point-in-time recovery within the retention period:
HCL
resource "aws_db_instance" "main" {
identifier = "my-app-db"
engine = "postgres"
# ... other config ...
# Automated backups
backup_retention_period = 7 # Days to retain backups (max 35)
backup_window = "03:00-04:00" # UTC - during low traffic
maintenance_window = "Mon:04:00-Mon:05:00"
# Enable deletion protection
deletion_protection = true
# Copy tags to snapshots
copy_tags_to_snapshot = true
# Enable storage encryption
storage_encrypted = true
# Final snapshot before deletion
skip_final_snapshot = false
final_snapshot_identifier = "my-app-db-final-snapshot"
}Restore to Point-in-Time
Terminal
$aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier my-app-db \
--target-db-instance-identifier my-app-db-restored \
--restore-time 2024-01-15T10:30:00Z
{
"DBInstance": {
"DBInstanceIdentifier": "my-app-db-restored",
"DBInstanceStatus": "creating"
}
}Recovery Creates New Instance
Point-in-time recovery creates a new RDS instance. You will need to update your application connection string to point to the restored instance.
Manual Snapshots
Terminal
$aws rds create-db-snapshot --db-instance-identifier my-app-db --db-snapshot-identifier my-app-db-snapshot-20240115
{
"DBSnapshot": {
"DBSnapshotIdentifier": "my-app-db-snapshot-20240115",
"DBInstanceIdentifier": "my-app-db",
"Status": "creating"
}
}S3 Cross-Region Replication (CRR) copies objects to a bucket in another region:
HCL
# Source bucket (primary region)
resource "aws_s3_bucket" "source" {
bucket = "my-app-data-ap-southeast-1"
}
resource "aws_s3_bucket_versioning" "source" {
bucket = aws_s3_bucket.source.id
versioning_configuration {
status = "Enabled" # Required for replication
}
}
# Destination bucket (DR region)
resource "aws_s3_bucket" "destination" {
provider = aws.us_east_1
bucket = "my-app-data-us-east-1"
}
resource "aws_s3_bucket_versioning" "destination" {
provider = aws.us_east_1
bucket = aws_s3_bucket.destination.id
versioning_configuration {
status = "Enabled"
}
}
# Replication configuration
resource "aws_s3_bucket_replication_configuration" "replication" {
bucket = aws_s3_bucket.source.id
role = aws_iam_role.replication.arn
rule {
id = "replicate-all"
status = "Enabled"
destination {
bucket = aws_s3_bucket.destination.arn
storage_class = "STANDARD"
}
}
}
# IAM role for replication
resource "aws_iam_role" "replication" {
name = "s3-replication-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "s3.amazonaws.com"
}
}]
})
}
resource "aws_iam_role_policy" "replication" {
role = aws_iam_role.replication.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetReplicationConfiguration",
"s3:ListBucket"
]
Resource = [aws_s3_bucket.source.arn]
},
{
Effect = "Allow"
Action = [
"s3:GetObjectVersionForReplication",
"s3:GetObjectVersionAcl",
"s3:GetObjectVersionTagging"
]
Resource = ["${aws_s3_bucket.source.arn}/*"]
},
{
Effect = "Allow"
Action = [
"s3:ReplicateObject",
"s3:ReplicateDelete",
"s3:ReplicateTags"
]
Resource = ["${aws_s3_bucket.destination.arn}/*"]
}
]
})
}Replication Costs
S3 CRR incurs charges for:
- PUT requests to destination bucket
- Data transfer between regions
- Storage in destination bucket
Scenario 1: Data Corruption
Plain Text
1. IDENTIFY the issue
- Check CloudWatch alarms
- Review application logs
- Identify affected data/time
2. STOP writes to affected data
- Scale down services if needed
- Communicate to stakeholders
3. RESTORE from backup
- RDS: Point-in-time restore to new instance
- S3: Restore object versions
4. VERIFY restored data
- Run data integrity checks
- Compare against known good state
5. UPDATE connections
- Point application to restored instance
- Verify application functionality
6. DOCUMENT incident
- Timeline of events
- Root cause analysis
- Prevention measuresScenario 2: Region Failure
Plain Text
1. DETECT failure
- Route 53 health checks trigger
- CloudWatch cross-region alarms
- External monitoring alerts
2. FAILOVER DNS
- Route 53 automatic failover (if configured)
- Or manually update DNS records
3. VERIFY DR infrastructure
- Check DR region resources are healthy
- Verify database replication is current
4. PROMOTE DR database
- RDS: Promote read replica to primary
- Or restore from latest backup
5. UPDATE application config
- Secrets point to DR resources
- Verify all connections
6. MONITOR and communicate
- Monitor DR region closely
- Communicate status to stakeholders
7. PLAN failback
- Once primary region recovers
- Plan data sync and failback procedureDR Strategies by Cost
| Strategy | RTO | RPO | Monthly Cost |
|---|---|---|---|
| Backup & Restore | Hours | Hours | $50-100 (backup storage) |
| Pilot Light | Minutes-Hours | Minutes | $100-300 (minimal DR infra) |
| Warm Standby | Minutes | Seconds-Minutes | $300-500 (scaled-down DR) |
| Active-Active | Seconds | Zero | 2x primary cost |
Start Simple
For most applications, Backup & Restore or Pilot Light provides adequate protection at reasonable cost. Upgrade to higher tiers as your business requirements demand.
Testing Your DR Plan
- Document procedures - Write step-by-step runbooks
- Test regularly - Quarterly DR drills at minimum
- Measure RTO/RPO - Verify you meet objectives
- Train your team - Everyone should know the procedures
- Update after changes - Review DR plan after infrastructure changes