Reduce CloudWatch alarms by combining metrics to reduce costs or improve auto scaling

Andreas Wittig – 17 Dec 2019

Every part of your AWS infrastructure emits utilization metrics. Amazon CloudWatch collects these metrics and allows you to visualize them as well as to define alarms. AWS announced an exciting new feature allowing you to combine multiple metrics recently: IF/AND/OR statements for metric math.

CloudWatch metric math

Combining CloudWatch metrics has several advantages:

  1. Simplify your monitoring configuration by reducing the number of CloudWatch alarms.
  2. Reduce costs by reducing the number of CloudWatch alarms (each alarm costs around USD 0.10 per month).
  3. Increase or decrease the desired capacity of an Auto Scaling Group according to multiple metrics (e.g., the typical bottlenecks CPU, memory, and network).

All you need to do is to define a Metric Math Expression that combines multiple metrics. Doing so results in a calculated metric. Next, you can define a CloudWatch alarm or a visualization based on the calculated metric.

Use metric math to combine multiple metrics

Let’s imagine the following scenario: you are using an Auto Scaling Group to launch EC2 instances. Typical bottlenecks of your virtual machines are:

Monitoring Assistant
Monitor EC2 instances and receive alerts in Slack or Microsoft Teams!

  1. Add marbot to Slack or Microsoft Teams.
  2. Invite marbot to a channel.
  3. Follow the setup wizard.
It couldn't be easier!

How do you get notified or scale-out automatically when one of these resources gets scarce? And how do you get notified or scale in automatically when the resources are no longer being used?

The following screenshot shows four basic metrics:

  • CPU Utilization
  • Memory Utilization
  • Network In
  • Network Out

Please note that AWS does not provide a memory utilization metric by default. Therefore, I’m using the CloudWatch Agent to collect the data for a memory utilization metric.

Also, as explained in Monitoring EC2 Network Utilization, you need to combine the Network In and Network Out metric to calculate the total network throughput. The Network Utilization metric calculates the percentage utilization of the network.

However, I want to put your attention on the Summary Utilization metric:

IF(cpu > 70, 1, 0) OR IF(memory > 75, 1, 0) OR IF(network > 80, 1, 0)
  • If the CPU utilization is above 70%, the metric math expression will return 1.
  • If the memory utilization is above 75%, the metric math expression will return 1.
  • If the network utilization is above 80%, the metric math expression will return 1.
  • Otherwise, the metric math expression will return 0.

Metric Math Expression

Next, define a CloudWatch alarm based on the Summary Utilization metric. Use 1 for the threshold.

Metric Math Expression

The alarm will transition into the ALARM state when the CPU utilization is above 70%, or the memory utilization is above 75%, or the network utilization is above 80%. Configure the CloudWatch alarm to send a notification or increase the desired capacity of the Auto Scaling Group.

Do you prefer Infrastructure as Code? The following code snippet shows how to create the CloudWatch alarm with the help of CloudFormation.

Note: The example assumes that you are running a m5.large instance with a maximal network throughout of about 0.75 Gbit/s.

AWSTemplateFormatVersion: '2010-09-09'
Parameters:
AutoScalingGroupName:
Type: String
Resources:
EC2HighUtilization:
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmDescription: 'EC2 High Utilization: CPU, memory, or network'
Metrics:
- Id: summary
Label: EC2 Utilization
Expression: IF(cpu > 70, 1, 0) OR IF(memory > 75, 1, 0) OR IF(network > 80, 1, 0)
ReturnData: true
- Id: cpu
MetricStat:
Metric:
Namespace: AWS/EC2
MetricName: CPUUtilization
Dimensions:
- Name: AutoScalingGroupName
Value: !Ref AutoScalingGroupName
Stat: Maximum
Period: 300
ReturnData: false
- Id: memory
MetricStat:
Metric:
Namespace: CWAgent
MetricName: mem_used_percent
Dimensions:
- Name: AutoScalingGroupName
Value: !Ref AutoScalingGroupName
Stat: Maximum
Period: 300
ReturnData: false
- Id: network
Label: Network Utilization
Expression: "((network_in+network_out)/300/1000/1000/1000*8)/0.75*100"
ReturnData: false
- Id: network_in
MetricStat:
Metric:
Namespace: AWS/EC2
MetricName: NetworkIn
Dimensions:
- Name: AutoScalingGroupName
Value: !Ref AutoScalingGroupName
Stat: Sum
Period: 300
ReturnData: false
- Id: network_out
MetricStat:
Metric:
Namespace: AWS/EC2
MetricName: NetworkOut
Dimensions:
- Name: AutoScalingGroupName
Value: !Ref AutoScalingGroupName
Stat: Sum
Period: 300
ReturnData: false
ComparisonOperator: GreaterThanOrEqualToThreshold
EvaluationPeriods: 1
DatapointsToAlarm: 1
Threshold: '1'

That’s all. Happy monitoring!

Summary

As CloudWatch metric math supports IF/AND/OR statements, it is possible to aggregate multiple metrics into a single metric. Doing so allows you to scale an Auto Scaling Group based on multiple metrics as well as reduce the number of CloudWatch alarms, which reduces costs.

Andreas Wittig

Andreas Wittig

Consultant focusing on Amazon Web Services (AWS). Entrepreneur building marbot.io. Author of Amazon Web Services in Action, Rapid Docker on AWS, and cloudonaut.io.

You can contact me via Email, Twitter, and LinkedIn.

Published on

marbot teaser

Chatbot for AWS Monitoring

Configure monitoring for Amazon Web Services: CloudWatch, EC2, RDS, EB, Lambda, and more. Receive and manage alerts via Slack. Solve incidents as a team.

Slack
Add to Slack
Microsoft Teams
Add to Teams