Runbook: CloudWatch Alarm observing the AWS/SQS ApproximateAgeOfOldestMessage metric
SQS reports the
ApproximateAgeOfOldestMessage metric to CloudWatch. The metric reports the age (in seconds) of the oldest non-deleted message in the queue.
A large number can indicate:
- Errors while processing messages
- Not enough capacity to process messages fast enough
A step by step guide to reacting to a CloudWatch Alarm observing the
Are you reading this runbook in Slack? If yes, proceed to step 1. If not:
a. Go to the SQS service in the AWS Management Console
b. Select your message queue
c. Proceed to step 2.
- Follow the Details Quick Link to access the SQS management console.
- Select the Monitoring tab.
- Set the Time Range to Last 3 Days.
- Is there a sudden drop in the NumberOfMessagesDeleted metric? If not, proceed to step 5. If yes, your message processing worker might have stopped working. Typically, message processing workers are running on Elastic Beanstalk worker environment, EC2 instances in Auto Scaling Groups (ASG), or Lambda functions. Check the log files to investigate processing errors:
a. Elastic Beanstalk worker environment: Go to the environment page of the Elastic Beanstalk service in th AWS Management Console, navigate to Logs, choose Request Logs, and then choose Full Logs. Search for errors in the logs.
b. EC2+ASG: Collecting logs from EC2 instances is not standardized. Find out where logs are shipped to and search for errors.
c. Lambda function: Continue with the AWS Lambda Errors metric runbook.
d. End of runbook.
- Is the NumberOfMessagesSent metric in the last hours higher than in the past? If not, proceed to step 6. If yes, more messages are sent to the queue then usual. You might want to increase the capacity of the workers. Keep in mind that your downstream dependencies might also need to be scaled (such as databases or other services):
a. Elastic Beanstalk worker environment: Go to the environment page of the Elastic Beanstalk service in th AWS Management Console, navigate to Configuration, look for the Capacity section and click Modify, ensure that Environment type is set to Load balanced, increase the Max value. Apply the changes.
b. EC2+ASG: Go to the Auto Scaling Groups page of the EC2 service in the AWS Management Console. Select your group and increase the Desired Capacity. Keep in mind that the desired capacity must be <= the max size so you might also increase the max size. If your group uses scaling policies, you might also review them to scale faster in the future.
c. Lambda function: A Lambda function scales automatically. But the number of concurrent execution is limited (regional and optionally per function). Go to the Dashboard page of the Lambda service in the AWS Management Console. Is the
Throttlesmetric higher than zero? If yes, check the Concurrency settings of your Lambda function. Is capacity reserved? If yes, increase it. If not, you run into the regional Limit. You can increase it with a Service Limit Increase via the AWS Support Center.
d. End of runbook.
- You are running into an unknown error. End of runbook.
Incident Management for Slack
Team up to solve incidents with marbot. Never miss a critical alert. Escalate alerts from your AWS infrastructure among your team members. Strong integrations with all parts of your AWS infrastructure: CloudWatch, Elastic Beanstalk, RDS, EC2, ...