Past Client

How to Use Amazon CloudWatch to Maximize Productivity and Optimize Cost for Your Tech Stack

Amazon CloudWatch provides full observability of your AWS resources and applications on AWS and on-premises.

Amazon CloudWatch is an application and infrastructure monitoring tool natively compatible with over 70 AWS services (ie. Amazon EC2, Amazon DynamoDB, Amazon S3, Amazon ECS, Amazon EKS, and AWS Lambda).

 It enables dev-op engineers, developers, site reliability engineers (SREs), IT managers, and product managers to perform their day-to-day tasks without worrying too much about security maintenance due to Amazon CloudWatch’s automated detection features.

Technical professionals often have to carry out their operations through an array of applications. Manually keeping track of the performance and health of all of them can be time-consuming and lead to diminished productivity in their main line of work. Professionals can have peace of mind using Amazon CloudWatch as it provides data and actionable insights to not only monitor, but optimize your applications.

Amazon CloudWatch is equipped with two key features— ALARMS and Metric Insights. Using data points and machine learning, it detects atypical behavior across the entire AWS infrastructure and allows users to set alarms that are triggered when certain thresholds have been breached.

In this article, we will break down the four ways Amazon Cloud Watch can be used to maximize the productivity of your AWS stack, so you can focus on other important things.

Tracking the Performance and Health of All Your Systems

Businesses that rely heavily on online infrastructure to carry out their operations need to ensure every component of the tech stack is operating smoothly. If one piece of the infrastructure is running poorly, it can have an adverse impact on the entire system leading to heavy costs.

Using alarms, metrics, and logs, CloudWatch takes automated actions to troubleshoot issues and relay insights to optimize your applications to keep them running at their best.

Perform Visualizations with ServiceLens

ServiceLens by CloudWatch computes visualizations with CloudWatch metrics and traces from Amazon X-ray for a holistic view of health, performance, and availability of all applications from a single place—making it easy to identify performance bottlenecks, isolate root causes of application issues, and determine the impact on application users.

Servicelens provides visibility in three main areas:

1. Infrastructure monitoring (using metrics and logs)

2. Transaction monitoring (using traces to understand dependencies)

3. End-user monitoring (using canaries to monitor your endpoints and notify when user-end experience has been degraded

It also comes equipped with a Service Map which visualizes contextual linking of all your resources and UX-enhanced interfaces such that users can easily pick up on correlated monitoring data.

Identify Top Contributors Influencing System Performance

Using the Contributor Insights feature on CloudWatch, it will analyze time-series data to capture the top culprits that are influencing system performance (positive and negative). This helps developers and site monitors quickly isolate, diagnose and solve issues during an operational event. It does this by analyzing log events in real-time and displaying reports that show the top contributors and the number of unique contributors to the dataset.

A contributor is an aggregate metric based on dimensions contained as log fields in CloudWatch logs, account id, interface ID in VPC Flow Logs, or any custom set of dimensions.

Reduce Resolution Time with Metrics Insights

Amazon CloudWatch recently released Metric Insights, a fast and reliable SQL-based query engine that helps reduce mean time to resolution (MTTR). Metric Insights enables users to identify trends and patterns across millions of operational metrics, that can be grouped by dimension, in real-time.

The grouping function allows for more rapid resolution because you can narrow down and pinpoint issues faster. To GROUP:

  • For example, if an application is underperforming you may want to pinpoint the instances that are consuming the most CPU. Use the ORDER BY function and select the “top N” types of queries and run back only the top 10 CPU-consuming instances.
  • In the background, CloudWatch will analyze over a million instances within the application you’re targeting.
  • It will then deliver insights on where the issue is coming from and how to troubleshoot.
  • You can also GROUP by InstanceID to narrow down analysis and pinpoint failing instances right away.

The benefit of using Metric Insights is user experience. Metric Insights can run powerful visualizations which help users easily identify where the problem is and how to solve them. It’s built-in with standard SQL language and there’s also a builder view that allows users to set up detection systems fast by selecting metrics, namespaces, and dimensions from a database of pre-built query samples.

Users can also leverage the query editor function if they want to customize their own fields. Queries can be deployed at scale, meaning you can analyze multiple metrics simultaneously.

 

Monitoring Endpoints with Synthetics

An endpoint is when the code that allows two software to communicate together (API), connects with the software program.

Now imagine if something were to impede this important bridge way.

The purpose of CloudWatch Synthetics is to provide 24/7 monitoring of application endpoints through automated tests. It will send an alert if any of these endpoints are misbehaving.

Tests can be customized according to a number of conditions such as latency (how much time it takes for a data packet to travel from one designated point to another), transactions, broken or dead links, step-by-step task completion, page load errors, load latencies for UI assets, complex wizard flows, and checkout flows.

Users can also isolate alarming endpoints and map them directly back to the infrastructure root cause and reduce MTTR.

A new feature involves canary testing. This validates and reduces the risk of new software before deploying it for public use. Synthetic collects canary traffic which will continuously monitor customer experience regardless of whether there’s traffic or not. It also monitors REST, API, URLs, and website content for complete securitization. 

Alarms

CloudWatch alarms help detect atypical behavior across all AWS resources. Alarms can have auto-adjusted thresholds based on natural metric patterns (time of day, season, changing trends). They can be added to the dashboard and be set to send out SNS notifications.

There are two types of alarms: metric alarms and composite alarms.

Metric Alarms

Metric alarms watch a single CloudWatch metric or a mathematical expression based on metrics. It performs one or more functions based on the metric relative to a threshold over a certain number of time periods.

Composite Alarms

Includes a rule expression that takes into account the other alarms you’ve set (ie. If X + Y + Z occurs, set to ALARM).

 For example, you can set a composite alarm to only go into ALARM, when all the metric alarms are triggered.

This way, if there’s a pertinent issue that affects multiple resources, you can have a single alarm for those cases, instead of having an alarm go off for every single one.

Alarms are color-coded:

  • Grey: insufficient data (alarm has just started, wait at least 30 minutes)
  • Red: ALARM state (outside defined threshold)
  • No color: normal state (inside defined threshold)

Alarms come with three settings (to enable CloudWatch to change states for alarm):

  • Period: length of time to evaluate metric or expression to create each individual datapoint for an alarm
  • Evaluation: number of most recent periods or data points, to evaluate when determining alarm state
  • Datapoints to Alarm: number of data points within the Evaluation period that must be breached to trigger the alarm. Must be within the last set of data points equal to the Evaluation period.

Missing data points

  • Can set CloudWatch to interpret missing data points as
  • Not breaching (below threshold)
  • Breaching (cross threshold)
  • Ignore- current alarm state-maintained
  • Missing (INSUFFICIENT DATA)

Set Alarm for Auto-Scaling

The auto-scaling function provides users with the flexibility and ease to add or remove capacity based on demand and set targets. When demand spikes, it will automatically increase capacity so you can maintain the same level of performance and output. Conversely, when resources are being underutilized, it will scale back the capacity to conserve costs. 

The dashboard allows you to quickly preview the average utilization of all the resources. As well, users can build scaling plans and determine how groups of resources respond to changes in capacity.

Users can set a threshold of resource utilization across different groups of resources and have an alarm set to trigger the auto-scaling action.

Troubleshooting with Ease

CloudWatch monitors key metrics across your entire infrastructure at any given time and correlates the data to understand and solve the root causes of performance issues.

One way it does this is to produce diagnostic information (ie. container restart failures) through data aggregation and summarization. The information it collects includes CPU usage, memory, disk, and network data. This can be previewed on a single dashboard. Simultaneously, users can see log errors and correlated performance metrics and receive real-time insights to help with resolution.

Amazon CloudWatch Container

Container Insights by Cloudwatch gives users a simplified way to monitor and troubleshoot their containerized applications and microservices. These services include Amazon ECS for Kubernetes (EKS), Amazon Elastic Container Service (ECS), AWS Fargate, and standalone Kubernetes (k8s). 

How it works

From each container, Cloudwatch will collect metrics on memory, errors, CPU usage, network, and disk information and label them as performance events. Using this aggregated data, it will automatically produce custom metrics that will be used for monitoring and alarming.

By translating the performance events into digestible logs, the overall troubleshooting process can be simplified. Custom metrics are automatically extracted from logs and can be further analyzed using advanced query language.

CloudWatch Lambda Insights

Lambda Insights has the same function as Container insights, except it’s specifically used to monitor Amazon Lambda metrics and logs.

Full Operational Visibility

With Amazon CloudWatch, users can gain complete visibility across their AWS infrastructure (resources, applications, services) either running on AWS or on-premise.

CloudWatch allows for powerful visualizations, such that the experience of application end users can be adjusted upon experimentation.

It’s also equipped with a CloudWatch dashboard which gives users a single platform for observability, reducing the need to toggle between different applications. The benefit of such is the ability to break down data silos and gain system-wide visibility so users can quickly pinpoint and resolve issues.

Data is automatically collected and published with one-second granularity and retained for up to 15 months. As well, the dashboard allows you to monitor your CloudWatch logs from one place.

Amazon CloudWatch Logs

Amazon CloudWatch Logs are used to monitor, store and access your log files from Amazon Elastic Compute Cloud (EC2) instances, AWS CloudTrail, Route 53, and other sources.

These logs are centralized in CloudWatch from all your systems, applications, and AWS services. Regardless of the source, logs can be viewed altogether and categorized by time. They can also be queried and grouped by dimension. Users also have the option to create customized computations with SQL.

Powered by query language, logs operate with a few simple, but powerful commands. CloudWatch is equipped with sample queries, command descriptions, query autocomplete, and log field discovery, allowing users to seamlessly perform queries to solve operational issues in a more effective manner.

Logs can be easily viewed and the search function allows users to pinpoint specific error codes or patterns. Users can set filters based on specific fields, or archive them to come back to in the future.

For Amazon EC2 instances, for example, CloudWatch logs and monitors the number of errors that occur in your application logs and sends you a notification whenever the rate of errors exceeds a threshold you specify. Logs are kept indefinitely or you can set them to expire within a certain time (ie. 5, 10 years).

There are three types of logs:

1) Vended logs: These are natively published by AWS on a user’s behalf. Currently,  Amazon VPC Flow Logs and Amazon Route 53 logs are the two supported types.

 2) Logs published by AWS services. Currently, more than 30 AWS services publish logs to CloudWatch. They include Amazon API Gateway, AWS Lambda, AWS CloudTrail, and many others.

3) Custom logs. These are logs from your own application and on-premises resources.

End-User Experience

The success of applications largely falls on the experience of the end-users. If end-users are satisfied with the navigation and usage of applications, they’re less likely to consume time from support teams or be dissatisfied with the product and look for an alternative.

CloudWatch enables developers to have a real-time view of what the end-user experience looks like.

One of the features of CloudWatch is Amazon CloudWatch RUM which gives developers visibility of the client-side performance of applications. RUM collects data from the client-side in real-time to produce insights on how to debug issues, thus reducing MTTR. RUM also tracks your user’s entire journey across an app, identifying locations for features and debugging priorities.

Resource Optimization & Automated Capacity Planning

Ideally, we want our resources to not only be running smoothly but to perform in synergy with all the other applications we’re using.

If there was a way we could get maximum use out of our applications, while conserving energy and costs— this would be a win-win scenario. Luckily, CloudWatch enables applications to do just that.

Every minute, detailed and custom metrics are released with up to one-second granularity. Users can derive actionable insights from the logs to troubleshoot operational issues with ease.

Once an alarm is triggered, CloudWatch can automatically enable Amazon EC2, auto-scaling, or stop an instance. For example, if there’s not enough space for a certain application to run, CloudWatch will enable auto-scaling.

Billing data and usage are gathered on the CloudWatch dashboard from multiple applications. In the VMs section, users can preview two important items:

  • Average Usage 
  • Peak Usage

This helps users easily see if their resources are being used to their full potential. If resources have very low average usage and peak usage, it might be an indicator it’s time for a downgrade. Conversely, if resources have decent average usage and are near 100% peak usage, this elicits the need for an upgrade.

Cost Optimization

CloudWatch works in tandem with services like Amazon EC2 to help users conserve costs on idle or underutilized resources.

A CloudWatch agent will detect idle or underutilized instances to recommend cost-saving opportunities (ie. terminating an idle instance or downsizing an application that’s consuming more energy than needed).

How it works: CloudWatch analyzes the last 14 days of Amazon EC2 usage and existing reservation footprint. It detects two states:

Idle: Max CPU is at less than 1%.

Downsize: CPU is at between 1% to 40%.

In order to do this, you must have Amazon EC2 Resource Optimization enabled. Go to AWS Management Console and select AWS Cost Explorer, select Recommendations in the left menu bar and click view all. Here you can easily see optimization opportunities, estimated monthly savings, and estimated total savings from using this tool.

Conclusion

In any technical workspace, errors that are not immediately dealt with can be detrimental. Amazon Cloudwatch gives technical professionals more security to carry out their work across multiple stacks without taking up too much time or resources. Notably, it helps users optimize resource utilization and save costs in the day-to-day.

Sources

https://aws.amazon.com/about-aws/whats-new/2022/04/amazon-CloudWatch-metrics-insights/

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html

https://trackit.io/managing-resources-effectively-aws-CloudWatch-trackit/

https://aws.amazon.com/cloudwatch/features/