Monitoring, logging and retention

Collecting logs #

CloudWatch #

  • region scope
  • fault tolerant
  • durable
  • push, not pull

Everything in CW logs is going to be stored in a log group

A log stream - entity, which records into particular log group

A log record - is a single record of some event

# log group operations
aws logs create-log-group
aws logs delete-log-group
aws logs associate-kms-key

# log stream operations
aws logs create-log-stream
aws logs delete-log-stream

# log record operations
aws logs put-log-events
aws logs get-log-events
aws logs get-log-record

Intagration with another services #

servicetype of logs
CloudTrailitself logs
API GatewayAccess/Execution logs
ECSContainer logs
LambdaExecution logs
RDSDB logs
Systems ManagerRun
EC2App/OS logs

CloudTrail #

  • region scope
  • audit log
  • logs both success and failure
  • works with different types of events:
    • management
    • data
    • insights
  • organizations support
# trail operations
aws cloudtrail create-trail
aws cloudtrail delete-trail

# logging operations
aws cloudtrail start-logging
aws cloudtrail stop-logging

# validation operations
aws cloudtrail validate-logs

Analyzing logs #

CloudWatch insights #

  • use existing log group/stream
  • (sql-like) query engine
  • alarm integration
# Create query
aws logs put-query-definition
aws logs delete-query-definition

# execute query and get results
aws logs start-query
aws logs stop-query
aws logs get-query

CloudTrail insights #

  • use existing events
  • ML-driven
  • no query engine
aws cloudtrail put-insights-selectors
aws cloudtrail get-insights-selectors

ElasticSearch and Kibana #

  • region scope
  • analytics engine
  • reporting engine
  • CWL integration

Exporting logs #

CWL export #

  • S3 export
  • CLI tail
  • subscription filter - automatically delivers logs from CW to tne another service:
    • kinesis stream
    • lambda function
    • kinesis data firehose
    • logical destination
# create or delete export task
aws logs create-export-task
aws logs cancel-export-task

# live tail of log stream
aws logs tail

# create new subscription filter
aws logs put-subscription-filter
  --log-group-name
  --filter-name
  --filter-pattern
  --destination-arn
  --role-arn

Kinesis data firehose #

This is a sort of batch mechanism

  • region scope
  • streaming pipeline
  • buffer and batch

Possible destinations:

  • S3
  • Redshift
  • ES
  • HTTP endpoint
  • Datadog
  • MongoDB
  • New Relic
  • Splunk

Kinesis Firehose delivery to S3:

CloudWatch Alarm #

Terms #

Period - length of time evaluate metric or expression to create an individual data point

Evaluation Perion - number of data points to evaluate when determining alarm state

Datapionts to Alarm - number of data points within evaluation period that must be breaching to cause ALARM state. Does not have to be consecutive.

Evaluation range - number of data points retrieved by CW for alarm evaluation. Greater than Evaluation Period but varies.

Missing data points #

  • missing - alarm doesn’t consider messing data points at all
  • notBreaching - missing data points treated as being within threshold
  • breaching - missing data points treated as breaching threshold
  • ignore - current alarm state is maintained

Other alarms options #

  • high resolution alarm - period less than 1 minute
  • math expression alarm - combine up to 10 metrics
  • percentile alarm - percentile as monitored statistics
  • anomaly detection alarm - use standard deviation
  • composite alarm - combine alarm state of other alarms

alarm actions #

  • SNS topic:
    • email
    • sms
    • Lambda
    • etc
  • EC2 actions:
    • stop
    • reboot
    • terminate
    • recover
  • Auto Scaling action
  • Systems Manager (SSM) OpsItem

CloudWatch Metric Filters #

You can create a metric from a string with Metric Filters.

String Metric Filters basics #

  • match everything - ""
  • single term - "ERROR"
  • include/exclude terms - "ERROR" - "permissions" // will exclude all strings, which contains “permissions”
  • multiple terms using AND - "ERROR memory exceprion"
  • multiple terms using OR - ?ERROR ?WARN

Space-delimited metric filter basics #

  • specify all fields with a name, bounded by [] and separated by commas
  • specify unknown number of fields with "..."
  • add conditions: =, !=, <, <=, >, >=
  • utilize * to match partial string or numbers
  • implement AND with && and OR with ||
# filter examples with Apache log

# match all 4XX response codes
[ip, id, user, timestamp, request, status_code = 4*, size]

# match response size > 1000 bytes
[ip, id, user, timestamp, request, status_code, size > 1000]

# ignore all redirect response codes
[ip, id, user, timestamp, request, status_code != 3*, size]

JSON metric filter basics #

  • {SELECTOR EQUALITY_OPERATOR STRING}
    • EQUALITY_OPERATOR is = or !=
  • {SELECTOR NUMERIC_OPERATOR NUMBER}
    • NUMERIC_OPERATOR can be =, !=, <, >, <=, >=
  • SELECTOR starts with $, indicating the root of JSON
    • SELECTOR supports arrays
    • implement AND with && and OR with ||
    • publish numerical value using "metricValue:"
# filter example with CloudTrail log

# match all console login failures
{ ($.eventName = ConsoleLogin) && ($.responseElements.ConsoleLogin = "Failure") }

# match all console logins by IAM user john.doe
{ ($.eventName = ConsoleLogin) && ($.userIdentity.userName = "john.doe") }

# match all root user activity
{ $.userIdentiry.type = "Root" && $.userIdentity.invokeBy NOT EXISTS && $.eventType != "AwsServiceEvent" }

CloudWatch Dashboards #

Dashboard sharing:

  • enable in settings
  • share with specific IAM user(s)
  • share publicly
  • share via SSO
  • share logs widgets

CloudWatch alarm remediation #

EC2 actions #

recover action mya be used only if underline HW has failed. It allows to move instance to the another HW. Also:

  • certain instance types
  • VPC only
  • default or dedicated tenance
  • no instance store

Autoscaling actions #

Manual actions #

If we recieve notification from CW alarm, we can fix the issue manually with AWS console or cli.

  • performance metrics:
    • scale horiaontally
    • scale vertically
  • availability metrics:
    • restore
    • recover
    • fail over

Automated actions #

An SNS topik can trigger:

  • custom endpoint (with any lind of script)
  • Lambda funcition via API Gateway
  • Lambda function directly

EC2 status checks #

  • System reachability - represents host OS and hardware layer
  • Instance reachability - represents guest OS and processes

EBS volume status checks #

  • if volume NOT provision IOPs:
    • ok
    • warning
    • impaired
    • insufficient data
  • in other case:
    • Normal
    • Degreded
    • Stalled
    • insufficient data

EventBridge rules #

EventBridge basics

  • region based
  • default event bus
  • custom event bus
  • sources and targets
  • replay feature
  • DLQ feature

Sources:

  • CloudTrail API events
  • GuardDuty findings
  • Other service events
  • Forwarded events
  • Scheduled events

Rules is a JSON format match event properties:

  • API gateway
  • CW logs
  • EC2 actions
  • Remote event bus
  • Lambda function
  • SNS topic

Patterns:

  • event
  • schedule

Config rules and SSM automation documents #

Config basics

  • region scope
  • config streams
  • partial coverage
  • capture changes
  • capture config
  • snapshots

Config rule creation example

We have EC2 instance and AWS config. The last one will capture any changes in instance’s properties via config stream.

You can react for these changes via:

  • AWS managed rules (for specific changes)
  • custom rules via Lambda function