Reliability and business continuity

Scalability and elasticity #

Scalability - the ability of a system to increase resources to accommodate increased demand. This can be done vartically or horizontally, and is not necessarily automated.

Elasticity - the ability if a system to increase and decrease resources allocated (usually horizontally) to match demand, and implies automation.

AWS autoscaling resources:

  • Auto Scaling groups
  • spot fleet
  • ECS
  • DynamoDB
  • Aurora Read Replicas

Scaling strategies:

  • availability
  • balanced
  • cost

Launch template:

  • operational excellence:
    • versioning for rollback, re-use or even default templates
    • use for regular EC2 launch tasks
  • cost optimization:
    • utilize T* unlimited instances, dedicated hosts
    • on-demand and spot at the same time
  • reliability, performance efficiency:
    • multiple instance types

Autoscaling operations:

  • autoscaling policies
  • monitoring ASG limits
  • deploy launch templates
  • lifecycle hooks
  • cooldown periods
  • warm pool configuration

Autoscaling scenarios:

  • stateless web apps
  • unpredictable traffic
  • steady-state groups
  • message consumer apps

Autoscaling anti-scenarios:

  • monolithic applications (singleton instance)
  • applications with fixed IP address
  • applications with many manual deploy steps
  • applications with short, large, random traffic spikes

Caching solutions #

CloudFront #

CloudFront basics:

  • global scope
  • namaged CDN
  • SSL termination
  • cache reads and writes

CloudFront operations:

  • HTTP redirect to HTTPS
  • edge location selection
  • ACM (Amazon certification manager) cert integration
  • IAM cert integration
  • multiple origins

ElasticCashe #

Elasticcache can help absorb traffic spikes, reduce latence, especially for same-AZ traffic.

We can use Memcached or Redis.

Elasticache basics

  • Memcached:
    • AZ scoped
    • managed in-memory
    • volatile
    • all endpoints writable
  • Redis:
    • AZ scoped
    • managed in-memory
    • persistent (it replicate the same cache for each node in the cluster)
    • 1 write endpoint
    • sharding option

RDS and aurora replicas #

RDS resilience #

RDS may has read replicas in each AZ (but in one region), that can be promoded to primary.

Also possible to use cross-region read replicas (except SQL server).

Multi-AZ deployment as a configuration option allows you to deploy one or more standby node, which can be promote to primary in case of some issues with primary node.

Satndby node will promote to the primary in next cases:

  • resize instance
  • OS Patching
  • initiate reboot with failover
  • AZ-scoped availability issue

RDS read replicas #

  • unique endpoint
  • separate implementation, than Multi-AZ
  • up to 5 RRs per primary

Aurora failover conditions #

Failover with existing replicas flips DNS CNAME, promotes replica - takes ~30s.

If multiple replicas are present, customer can set replicas priority and Promotion tiers.

Failover with Aurora Serverless launches replacement in new AZ.

If no replicas and not running Serverless, Aurora attempts new DB instance in same AZ as primary on best-effort basis.

Aurora read replicas:

  • single endpoint
  • replace multi-AZ
  • up to 15 RRs per primary
  • +5 RDS RRs per primary
  • promotion tier support

Loosely Coupled Architectures #

Route 53 may do the helth checks for multiple ELBs in defferent regions. And in case of failure of primary ELB it will switch DNS ALIAS to secondary ELB automatically.

In addition to helth checks Route 53 may use Latence-based routing. And if all ELBs are unavailable it may return static site from S3 as response.

S3 + Lambda integration. We can trigger Lambda function by uploading a file to the S3.

HA and resilience #

ELB and Route 53 Health Checks #

Route 53 Health cheks basics:

  • monitor an endpoint
    • endpoint
      • IP address
      • host name
      • domain name
    • protocol
      • HTTP/HTTPS
        • path (optional)
      • TCP
    • port (single or range)
  • monitor other health cheks (calculated)
    • up to 256 other HCs
  • monitor CW alarms
    • only one CW alarm may be monitored

Load balancers:

  • Classic
  • Network
  • Application
  • Gateway

ELB HC basics:

  • check an EC2 instance
    • supported by all 4 types of LBs
  • check an IP address
    • supported by all LBs, except Classic
  • check a Lambda function
    • supported only by Application LB

LB protocol support

TCP PingHTTPHTTPS
CLB+++
NLB+
ALB++
GWLB+++

Single AZ and Multiple AZ Architectures #

EC2 Auto scaling:

  • deploy into a single AZ for:
    • performance
    • colocation with other resources
  • deploy into multiple AZ for:
    • resilience
    • cost optimization

ELB:

  • deploy into a single AZ for:
    • performance
    • colocation with other resources
  • deploy into multiple AZ for:
    • resilience

FSx filesystem:

Only supports one AZ for data. But it creates an ENI in a single A which can be reached from multiple AZs.

RDS:

  • deploy into a single AZ for:
    • cost optimization
  • deploy using Multi-AZ:
    • resilience
    • synchronous replication
  • deploy Read Replicas for:
    • further resilience
    • horizontally scaled reads

Fault-tolerant Workloads #

EIP (elastic IP address):

  • region scoped
  • can be ressigned
  • it is static

EFS:

  • resion scoped
  • durable
  • data is replicated around the region
  • you can acces to the FS directly with mount targets:
    • AZ-scoped
    • can be accessed cross-AZ
    • can specify directory and userID
  • can be mounted to:
    • EC2
    • ECS
    • EKS
    • lambda function

Route 53 Routing #

Routing options

  • simple routing
    • traditional DNS - each resord has the same weight
  • weighted routing
    • you can assing weight from 0-255 for each endpoint
    • in case of weight 0, this endpoint won’t be returned
  • latency-based routing
    • DNS response determined by lowest latence from client
  • failover routing
    • you have primary and secondary endpoints
    • secondary will returned only if primary fails health cheks
    • also you can have multiple primary endpoints
    • requires health checks

Alias records - are pointers to some AWS resources:

  • CloudFront Distributions
  • Elastic Beanstalk Environment
  • ELB
  • S3 static site endpoints
  • route 53 record in the same zone
  • VPC endpoint
  • API gateway
  • Global accelerator

Backup and restore #

Automated Snapshots and Backups #

Authomated backups

ServiceEnabled by defaultCan be disabledRetention(days)PITR*
EC2+0-∞
RDS++0-35+
Aurora+1-35+
Redshift+1-35
DynamoDB+35+

* Point in time recovery

EC2 has no native backup feature, you can use AWS Backup or DLM(Data Lifecycle Manager) for this task.

Manual backups

ServiceAWS backupRotateRetention(days)Copy or share
EC2++
RDS++
Aurora++
Redshift+
DynamoDB+

You may implement rotation by your own or use AWS Backup to gain that ability.

EC2 data lifecycle manager (DLM) allows you to create:

  • EBS snapshots
  • EC2 AMIs
  • Cross-account copy
  • Snapshot/AMI rotation
  • No unique features against AWS Backup

AWS Backup resources

  • backup vault
  • backup plan
  • backup job
  • restore point

List of supported services

AWS Backup workflow

  1. Resources assignments for Backup plans require tags to function. So you have to create a tag backup:true
  2. Backup vaults hold restore points and provide resource-level access control
  3. Backup plan provide the integration between resources, jobs abd vaults
  4. Each backup job generates a single restore point in a vault
  5. Vaults can hold multiple restore points from different resource types

Backup plan - is a siries of a tasks that you have to compleate. You need to determine:

  • backup rules
    • ways to creation:
      • template
      • interactive build
      • JSON
    • anatomy
      • backup vault
      • frequency
        • time period
        • cron expression
      • retention
      • copy (optional)
        • remote region
        • rmeote account
  • resource assignments
    • specify tags

Database Restore Tasks #

RDS restore task

# Automated backup
# *
aws rds restore-db-instance-to-point-in-time

# Manual snapshot
# *
aws rds restore-db-instance-from-db-snapshot

# Backup in S3
# *
aws rds restore-db-instance-from-s3

# Promote read replica
aws rds promote-read-replica

* - these commands create a new DB instance (and potentially new endpoint) during restore

Aurora restore tasks

# Continues backup
# *
aws rds restore-db-instance-to-point-in-time

# Manual snapshot
# *
aws rds restore-db-instance-from-db-snapshot

# Backup in S3
# *
aws rds restore-db-instance-from-s3

# Promote read replica
aws rds promote-read-replica

# Backtrack
aws rds backtrack-db-cluster

* - these commands create a new DB instance (and potentially new endpoint) during restore

BynamoDB

# On-demand backup
aws dynamodb restore-table-from-backup

# PITR
aws dynamodb restore-table-to-point-in-time

Both of these create a new table during restore

Redshift

# all snapshots
# *
aws redshift restore-from-cluster-snapshot

# restore table
aws redshift restore-table-from-cluster-snapshot

# relocate a cluster to another AZ in the same region
aws redshift modify-cluster --availability-zone

* this creates a new cluster endpoint during restore

S3 Versioning and Lifecycle Rules #

By default S3 versioning is disabled. When you enable versioning:

  • version ID attached to each version of an object
  • delete operation attaches a delete marker to the object (and doesn’t actually delete the data)

Versioning is a good way to avoid accidental deletion.

Considerations:

  • cost of many versions
  • performance of many versions
  • more complex lifecycle rules

Lifecycle rules - options

  • transition current version
  • transition previous version
  • expire current version
  • delete expired delete markers
  • delete incomplete multipart uploads

Lifecycle rules considirations:

  • object size
  • object age requirements
  • bucket or prefix scope
  • tag scope
  • conflicting rules

Glacier vault lock

  • permission policy
  • use to deny deletes
  • enforce lifecycle
  • locked after 24 hours

S3 replication options:

  • same-region, same account
  • cross-region, same account
  • cross-region, cross-account (ownership can be changed to that of the destination bucket)
  • multiple bucket destination
    • storage class of a destination can be changed for cost optimisation
    • rules are evaluated in priority order
  • multy-way replication between 2+ buckets
    • pre-existing objects can only be replicated by AWS support

All replication requires:

  • versioning enabled at source and destination
  • and IAM Role for permissions