Monitoring, logging and retention

Collecting logs #

CloudWatch #

region scope
fault tolerant
durable
push, not pull

Everything in CW logs is going to be stored in a log group

A log stream - entity, which records into particular log group

A log record - is a single record of some event

# log group operations
aws logs create-log-group
aws logs delete-log-group
aws logs associate-kms-key

# log stream operations
aws logs create-log-stream
aws logs delete-log-stream

# log record operations
aws logs put-log-events
aws logs get-log-events
aws logs get-log-record

Intagration with another services #

service	type of logs
CloudTrail	itself logs
API Gateway	Access/Execution logs
ECS	Container logs
Lambda	Execution logs
RDS	DB logs
Systems Manager	Run
EC2	App/OS logs

CloudTrail #

region scope
audit log
logs both success and failure
works with different types of events:
- management
- data
- insights
organizations support

# trail operations
aws cloudtrail create-trail
aws cloudtrail delete-trail

# logging operations
aws cloudtrail start-logging
aws cloudtrail stop-logging

# validation operations
aws cloudtrail validate-logs

Analyzing logs #

CloudWatch insights #

use existing log group/stream
(sql-like) query engine
alarm integration

# Create query
aws logs put-query-definition
aws logs delete-query-definition

# execute query and get results
aws logs start-query
aws logs stop-query
aws logs get-query

CloudTrail insights #

use existing events
ML-driven
no query engine

aws cloudtrail put-insights-selectors
aws cloudtrail get-insights-selectors

ElasticSearch and Kibana #

region scope
analytics engine
reporting engine
CWL integration

Exporting logs #

CWL export #

S3 export
CLI tail
subscription filter - automatically delivers logs from CW to tne another service:
- kinesis stream
- lambda function
- kinesis data firehose
- logical destination

# create or delete export task
aws logs create-export-task
aws logs cancel-export-task

# live tail of log stream
aws logs tail

# create new subscription filter
aws logs put-subscription-filter
  --log-group-name
  --filter-name
  --filter-pattern
  --destination-arn
  --role-arn

Kinesis data firehose #

This is a sort of batch mechanism

region scope
streaming pipeline
buffer and batch

Possible destinations:

S3
Redshift
ES
HTTP endpoint
Datadog
MongoDB
New Relic
Splunk

Kinesis Firehose delivery to S3:

CloudWatch Alarm #

Terms #

Period - length of time evaluate metric or expression to create an individual data point

Evaluation Perion - number of data points to evaluate when determining alarm state

Datapionts to Alarm - number of data points within evaluation period that must be breaching to cause ALARM state. Does not have to be consecutive.

Evaluation range - number of data points retrieved by CW for alarm evaluation. Greater than Evaluation Period but varies.

Missing data points #

missing - alarm doesn’t consider messing data points at all
notBreaching - missing data points treated as being within threshold
breaching - missing data points treated as breaching threshold
ignore - current alarm state is maintained

Other alarms options #

high resolution alarm - period less than 1 minute
math expression alarm - combine up to 10 metrics
percentile alarm - percentile as monitored statistics
anomaly detection alarm - use standard deviation
composite alarm - combine alarm state of other alarms

alarm actions #

SNS topic:
- email
- sms
- Lambda
- etc
EC2 actions:
- stop
- reboot
- terminate
- recover
Auto Scaling action
Systems Manager (SSM) OpsItem

CloudWatch Metric Filters #

You can create a metric from a string with Metric Filters.

String Metric Filters basics #

match everything - ""
single term - "ERROR"
include/exclude terms - "ERROR" - "permissions" // will exclude all strings, which contains “permissions”
multiple terms using AND - "ERROR memory exceprion"
multiple terms using OR - ?ERROR ?WARN

Space-delimited metric filter basics #

specify all fields with a name, bounded by [] and separated by commas
specify unknown number of fields with "..."
add conditions: =, !=, <, <=, >, >=
utilize * to match partial string or numbers
implement AND with && and OR with ||

# filter examples with Apache log

# match all 4XX response codes
[ip, id, user, timestamp, request, status_code = 4*, size]

# match response size > 1000 bytes
[ip, id, user, timestamp, request, status_code, size > 1000]

# ignore all redirect response codes
[ip, id, user, timestamp, request, status_code != 3*, size]

JSON metric filter basics #

{SELECTOR EQUALITY_OPERATOR STRING}
- EQUALITY_OPERATOR is = or !=
{SELECTOR NUMERIC_OPERATOR NUMBER}
- NUMERIC_OPERATOR can be =, !=, <, >, <=, >=
SELECTOR starts with $, indicating the root of JSON
- SELECTOR supports arrays
- implement AND with && and OR with ||
- publish numerical value using "metricValue:"

# filter example with CloudTrail log

# match all console login failures
{ ($.eventName = ConsoleLogin) && ($.responseElements.ConsoleLogin = "Failure") }

# match all console logins by IAM user john.doe
{ ($.eventName = ConsoleLogin) && ($.userIdentity.userName = "john.doe") }

# match all root user activity
{ $.userIdentiry.type = "Root" && $.userIdentity.invokeBy NOT EXISTS && $.eventType != "AwsServiceEvent" }

CloudWatch Dashboards #

Dashboard sharing:

enable in settings
share with specific IAM user(s)
share publicly
share via SSO
share logs widgets

CloudWatch alarm remediation #

EC2 actions #

recover action mya be used only if underline HW has failed. It allows to move instance to the another HW. Also:

certain instance types
VPC only
default or dedicated tenance
no instance store

Autoscaling actions #

Manual actions #

If we recieve notification from CW alarm, we can fix the issue manually with AWS console or cli.

performance metrics:
- scale horiaontally
- scale vertically
availability metrics:
- restore
- recover
- fail over

Automated actions #

An SNS topik can trigger:

custom endpoint (with any lind of script)
Lambda funcition via API Gateway
Lambda function directly

EC2 status checks #

System reachability - represents host OS and hardware layer
Instance reachability - represents guest OS and processes

EBS volume status checks #

if volume NOT provision IOPs:
- ok
- warning
- impaired
- insufficient data
in other case:
- Normal
- Degreded
- Stalled
- insufficient data

EventBridge rules #

EventBridge basics

region based
default event bus
custom event bus
sources and targets
replay feature
DLQ feature

Sources:

CloudTrail API events
GuardDuty findings
Other service events
Forwarded events
Scheduled events

Rules is a JSON format match event properties:

API gateway
CW logs
EC2 actions
Remote event bus
Lambda function
SNS topic

Patterns:

event
schedule

Config rules and SSM automation documents #

Config basics

region scope
config streams
partial coverage
capture changes
capture config
snapshots

Config rule creation example

We have EC2 instance and AWS config. The last one will capture any changes in instance’s properties via config stream.

You can react for these changes via:

AWS managed rules (for specific changes)
custom rules via Lambda function