Scale up and scale down cloudwatch alarms

February 8, 2024

I have an ECS service that I want to scale up and down depending on how many items are in an SQS queue.

resource "aws_cloudwatch_metric_alarm" "sqs_scale_up" {
  alarm_name = "scale-up"

  comparison_operator       = "GreaterThanOrEqualToThreshold"
  evaluation_periods        = "1"
  metric_name               = "ApproximateNumberOfMessagesVisible"
  namespace                 = "AWS/SQS"
  period                    = "60"
  threshold                 = "1"
  statistic                 = "Sum"
  alarm_description         = "Increase task count"
  insufficient_data_actions = []
  alarm_actions             = [aws_appautoscaling_policy.scale_up.arn]

  dimensions = {
    QueueName = aws_sqs_queue.this.name
  }
}

resource "aws_cloudwatch_metric_alarm" "sqs_scale_down" {
  alarm_name          = "scale-down"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "ExactNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = "60"
  threshold           = "1"
  statistic           = "Sum"
  alarm_description   = "Decrease task count"
  alarm_actions       = [aws_appautoscaling_policy.scale_down.arn]

  dimensions = {
    QueueName = aws_sqs_queue.this.name
  }
}

The fact that I have 1 alarm for count>0 and 1 alarm for count<1 means that one of these alarms will be be in the alarm state?

Is this normal?

>Solution :

Don’t panic over the word ‘ALARM‘. Instead, think of it as saying that the condition is TRUE.

If there are any messages in the queue, you presumably want to scale-out from a "nothing is running" state. Therefore, you want the scale-out alarm to be TRUE. However, you need to set a limit so that it doesn’t continually scale — it might just need one pod.

When the queue is empty, you want to scale-in. However, you don’t want to flip-flop between the two states. The general rule is "scale-out quickly, but scale-in slowly". Therefore, the rule should use a longer evaluation period before deciding to scale-in (eg 10 minutes).

Thus, there might not always be an alarm in the TRUE (ALARM) state. If there are no messages in the queue, then the scale-out alarm will be FALSE. Plus, if the sum of ExactNumberOfMessagesVisible over the previous 10 minutes is not zero, then the scale-in alarm won’t be TRUE either. Instead, both alarms will be FALSE so nothing will be changing at that time. This is good.