A brief introduction to CloudWatch

January 11, 2024

Amazon CloudWatch monitors the performance and health of our resources and applications in AWS. As a result it lets us:

Track resource and application performance
Collect and monitor log files
Get notified when an alarm goes off
CloudWatch consists of three primary components: metrics, alarms, and events.

When running applications on Amazon EC2 instances, monitoring workload performance is crucial. This involves addressing two key questions: ensuring sufficient EC2 resources for fluctuating performance requirements and automating resource provisioning on demand. While Amazon CloudWatch facilitates performance monitoring and log file collection, it doesn't directly manage EC2 instances. Amazon EC2 Auto Scaling is our solution, as it enables dynamic scaling to maintain fleet health and availability during demand fluctuations. Amazon CloudWatch serves as a distributed statistics-gathering system, collecting and tracking metrics, including custom ones, and triggering notifications for alarms.

CloudWatch has two different monitoring options:

Basic Monitoring for Amazon EC2 instances: Seven pre-selected metrics at a 5-minute frequency and three status check metrics at a 1-minute frequency, for no additional charge.
Detailed Monitoring for Amazon EC2 instances: All metrics that are available to Basic Monitoring at a 1-minute frequency, for an additional charge. Instances with detailed monitoring enabled provide data aggregation by Amazon EC2, Amazon Machine Image (AMI) ID, and instance type.

The diagram an example of CloudWatch monitoring:

EC2 Instance Monitoring:

An EC2 instance on the left has the CloudWatch agent installed, with detailed monitoring enabled.
Two metrics are highlighted:

CPU Utilization: A standard CloudWatch metric collected easily.
Memory Utilization: Monitored for the httpd service using a custom-defined metric since it's not visible at the hypervisor layer.

CloudWatch Alarm Configuration:

A CloudWatch alarm is set up to trigger when CPU utilization exceeds a defined threshold.
Upon triggering, an alert is sent via Amazon SNS (Simple Notification Service), generating an email notification.
The alert is also sent to an Amazon SQS (Simple Queue Service) topic, creating a work item.

In the diagram above, we can see the following CloudWatch alarm behavior taking place:

Alarm Configuration:

Threshold is set to 3.
Minimum breach condition is set to 3 consecutive periods.

Alarm State Changes:

Time Periods 1-2: Threshold not breached, alarm state is OK.
Time Periods 3-5: Threshold breached consecutively for three periods, alarm state changes to ALARM.
Time Period 6: Value dips below the threshold, alarm state reverts to OK.
Time Periods 7-8: Threshold not breached, alarm state is OK.
Time Period 9: Threshold breached but not for three consecutive periods, alarm state remains OK.

In Amazon CloudWatch, metrics are uniquely defined by a name, a namespace, and zero or more dimensions. Each data point includes a timestamp and optionally a unit of measure. Metrics exist only in the Region where they are created.

Namespace: A container for CloudWatch metrics, with metrics from different namespaces isolated to prevent aggregation errors. AWS namespaces follow the AWS/<service> naming convention (e.g., AWS/EC2).
Dimension: A name-value pair uniquely identifying a metric, with up to 10 dimensions allowed per metric. Dimensions categorize and describe specific metric characteristics, aiding in structuring statistics plans and filtering result sets.
Period: The time length associated with a CloudWatch statistic, defined in seconds. Data aggregation can be adjusted by varying the period, which can range from 1 second to 1 day (86,400 seconds).
Common CloudWatch use cases include monitoring account resources for suspicious activity, such as billing alerts to detect potential security violations. Alerts can be set based on estimated billing charges exceeding a specified threshold.

Additionally:

CloudWatch Events: Trigger services like AWS Lambda based on near-real-time events, enhancing automation capabilities.
CloudWatch Logs: Collect application logs by filtering metric data points for events, providing a comprehensive logging solution.

Amazon CloudWatch Events provides a real-time stream of AWS resource changes. Using configured rules, it can match and route events to various targets for actions. Events can represent changes in AWS resources or be custom application-level events, including scheduled events. Targets include EC2 instances, Lambda functions, SNS topics, and SQS queues. Rules match incoming events, directing them to one or more targets.

In an example, a CloudWatch Events rule triggers an AWS Systems Manager Run Command script each time a new EC2 instance is created.

Amazon CloudWatch Logs allows monitoring, storing, and accessing log files from diverse sources like EC2 instances, CloudTrail, and Route 53. It enables near-real-time log analysis for specific patterns. Users can set alarms based on log data, visualize metrics, and store logs indefinitely externally to EC2 instances, eliminating concerns about storage limitations.

You can think of the process of log analysis as having three distinct phases:

Configure – Decide what information you need to capture in your logs, and where and how it will be stored.
Collect – Instances are provisioned and removed in a cloud environment. You need a strategy for periodically uploading a server’s log files so that this valuable information is not lost when an instance is eventually terminated.
Analyze – After all the data is collected, it is time to analyze it. Using log data gives you greater visibility into the daily health of your systems. It can also provide information on upcoming trends in customer behavior, and insight into how customers currently use your system.

AWS CloudTrail facilitates continuous logging of AWS account activity, capturing information that can be directed to storage services like Amazon S3 or business reporting tools. CloudTrail monitors AWS API calls made through the AWS CLI and Management Console. Despite being comprehensive, CloudTrail isn't able to track events within Amazon EC2 instances, such as shutdowns initiated through SSH sessions. However with configuration, CloudTrail can send auditing logs to Amazon S3, enabling governance, compliance, operational auditing, and risk auditing for AWS accounts.

CloudTrail allows storing API usage logs in an S3 bucket for analysis. Whilst the default CloudTrail event history shows results from the last 90 days, limited to management events and account activity we can configure CloudWatch to capture a complete record of events. We're also able to enhance our analysis by using Athena with CloudTrail logs to query AWS service and Application Load Balancer activity, as well as Amazon Virtual Private Cloud Flow Logs for investigating network traffic patterns with the purpose of identifying threats in your VPC network.

A brief introduction to CloudWatch

Popular posts from this blog

Familiarizing with the Command Line Interface

Network Fundamentals for the Cloud

Security Fundamentals for the Cloud

CLI Fundamentals for the Cloud

A brief introduction to Databases and MySQL

DataDog, a Cloud Analytics & Monitoring application

AWS CodeCommit + Creating a CI/CD pipeline

A brief introduction to AWS Cloud Adoption Framework (CAF) and Well-Architected Framework (WAF)

Future Orientation: Tips from a AWS re/Start Graduate

Building a VPC in AWS