Blog Post

Adding CloudWatch Monitoring to My AWS Auto Scaling Setup

October 14, 2025 AWS, DevOps, Infrastructure, Terraform by Reza

This week, I learned about AWS CloudWatch monitoring. I took my Auto Scaling project and added monitoring to it. Now I can see what’s happening with my infrastructure and get alerts when something goes wrong.

In my previous project, I had Auto Scaling working, but I couldn’t really “see” what was happening. Was my CPU high? Was my website down? I had to check manually. Now with CloudWatch, I get automatic notifications!

What’s the Same and What’s Different?

What stayed the same from Auto Scaling project:

VPC configuration (same networks)
Application Load Balancer (same setup)
Auto Scaling Group with Launch Template
Target Tracking policies (CPU and request count)
Scheduled scaling (7 AM and 5 PM)
SNS notifications for scaling events
All the same 13 files (c1 through c13)

What’s new in CloudWatch project:

CloudWatch Alarms – Get alerts when things go wrong
Synthetics Canary – Robot that checks if my website is up
CIS Compliance Alarms – Security monitoring
Simple Scaling Policy – Another way to scale
Only 4 new files added (c14-01 through c14-05)

Important: The CloudWatch project is just the Auto Scaling project + monitoring. Everything from Auto Scaling is still there, I just added more tools to watch it!

Why I Needed CloudWatch

In my Auto Scaling project, I had these questions:

Is my website actually working?
When did my CPU go high?
How many 4xx errors am I getting?
Did someone try to access my AWS account from a strange place?

Without CloudWatch, I had to:

Log into AWS Console
Check each metric manually
Look at graphs
Hope I notice problems

With CloudWatch, AWS watches for me and sends me emails automatically!

The Four New Monitoring Components

Let me explain each new file I created.

1. ASG CPU Alarm with Simple Scaling (c14-02)

This is different from the Target Tracking policy I already had. Let me explain the difference.

What I already had (Target Tracking):

hcl

# From c13-06-autoscaling-ttsp.tf
resource "aws_autoscaling_policy" "avg_cpu_policy_greater_than_xx" {
  name = "avg-cpu-policy-greater-than-xx"
  policy_type = "TargetTrackingScaling"
  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 50.0
  }
}

What I added (Simple Scaling):

hcl

# From c14-02-cloudwatch-asg-alarms.tf
resource "aws_autoscaling_policy" "high_cpu" {
  name = "high-cpu"
  scaling_adjustment = 4
  adjustment_type = "ChangeInCapacity"
  cooldown = 300
  autoscaling_group_name = aws_autoscaling_group.my_asg.name
}

resource "aws_cloudwatch_metric_alarm" "app1_asg_cwa_cpu" {
  alarm_name = "App1-ASG-CWA-CPUUtilization"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods = "2"
  metric_name = "CPUUtilization"
  namespace = "AWS/EC2"
  period = "120"
  statistic = "Average"
  threshold = "80"
  
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.my_asg.name
  }
  
  alarm_description = "This metric monitors ec2 cpu utilization and triggers the ASG Scaling policy to scale-out if CPU is above 80%"
  
  ok_actions = [aws_sns_topic.myasg_sns_topic.arn]
  alarm_actions = [
    aws_autoscaling_policy.high_cpu.arn,
    aws_sns_topic.myasg_sns_topic.arn
  ]
}

The difference:

FeatureTarget Tracking (50%)Simple Scaling (80%)When it triggersCPU goes above/below 50%CPU goes above 80%How much it scalesAWS decides automaticallyI decide: Add 4 instancesCooldownAWS managesI set: 300 secondsPurposeNormal operationsEmergency situations

Why have both?

Target Tracking at 50% = gentle, smooth scaling for normal traffic
Simple Scaling at 80% = fast emergency response for sudden spikes

Example scenario:

Normal day: CPU at 40% → No action
Traffic increases: CPU at 55% → Target Tracking slowly adds 1 instance
Sudden spike: CPU jumps to 85% → Simple Scaling adds 4 instances immediately!
I get email: “High CPU alarm triggered!”

Important parts explained:

evaluation_periods = “2”: Check 2 times before triggering
period = “120”: Each check is 2 minutes
threshold = “80”: Trigger when CPU > 80%
scaling_adjustment = 4: Add 4 instances when alarm triggers
cooldown = 300: Wait 5 minutes before scaling again
ok_actions: Send email when alarm returns to OK
alarm_actions: Scale out AND send email when alarm triggers

2. ALB 4xx Error Monitoring (c14-03)

This alarm watches my Application Load Balancer for client errors.

hcl

resource "aws_cloudwatch_metric_alarm" "alb_4xx_errors" {
  alarm_name = "App1-ALB-HTTP-4xx-errors"
  comparison_operator = "GreaterThanThreshold"
  datapoints_to_alarm = "2"
  evaluation_periods = "3"
  metric_name = "HTTPCode_Target_4XX_Count"
  namespace = "AWS/ApplicationELB"
  period = "120"
  statistic = "Sum"
  threshold = "5"
  treat_missing_data = "missing"
  
  dimensions = {
    LoadBalancer = module.alb.arn_suffix
  }
  
  alarm_description = "This metric monitors ALB HTTP 4xx errors and if they are above 100 in specified interval, it is going to send a notification email"
  
  ok_actions = [aws_sns_topic.myasg_sns_topic.arn]
  alarm_actions = [aws_sns_topic.myasg_sns_topic.arn]
}
```

What this monitors:
- 4xx errors = Client side errors
  - 400 Bad Request
  - 403 Forbidden
  - 404 Not Found
  - etc.

How the alarm works:
- evaluation_periods = "3": Look at last 3 periods (6 minutes total)
- datapoints_to_alarm = "2": Must exceed threshold in 2 out of 3 periods
- threshold = "5": More than 5 errors in 2 minutes

Example scenario:
```
Time 0-2 min: 3 errors → OK
Time 2-4 min: 7 errors → Above threshold (1/3)
Time 4-6 min: 8 errors → Above threshold (2/3)
Result: Alarm triggers! Send email!

Why this is useful:

If users are getting lots of 404 errors, maybe I have broken links
If getting lots of 403 errors, maybe my security settings are wrong
I set threshold to 5 for learning, but in production I’d use 100 or 200

Note: The code says “above 100” in the description but threshold is “5”. This is from the demo – I should update the description to match.

Other metrics I can monitor (mentioned in comments):

hcl

Per ALB Metrics
- HTTPCode_ELB_5XX_Count  (server errors)
- HTTPCode_ELB_502_Count  (bad gateway)
- TargetResponseTime      (how fast instances respond)

Per Target Group Metrics
- UnHealthyHostCount      (failing instances)
- HealthyHostCount        (working instances)
- HTTPCode_Target_4XX_Count  (what I'm using)
- TargetResponseTime      (response time per target)

3. CIS Compliance Alarms (c14-04)

CIS stands for “Center for Internet Security”. They publish security best practices. This module creates many alarms to check if someone is doing suspicious things in my AWS account.

hcl

resource "aws_cloudwatch_log_group" "cis_log_group" {
  name = "cis-log-group-${random_pet.this.id}"
}

module "all_cis_alarms" {
  source = "terraform-aws-modules/cloudwatch/aws//modules/cis-alarms"
  version = "5.6.0"
  
  disabled_controls = ["DisableOrDeleteCMK", "VPCChanges"]
  
  log_group_name = aws_cloudwatch_log_group.cis_log_group.name
  alarm_actions = [aws_sns_topic.myasg_sns_topic.arn]
  tags = local.common_tags
}

What this module creates: The CIS alarms module creates about 14-15 different alarms automatically! I don’t have to write each one. Here are some examples:

UnauthorizedAPICalls – Someone tried to use AWS APIs without permission
ConsoleSignInWithoutMFA – Someone logged in without two-factor authentication
RootAccountUsage – Someone used the root account (bad practice!)
IAMPolicyChanges – Someone changed security permissions
CloudTrailChanges – Someone tried to disable logging
ConsoleAuthenticationFailure – Multiple failed login attempts
S3BucketPolicyChanges – Someone changed S3 bucket permissions
NetworkACLChanges – Someone changed network rules
SecurityGroupChanges – Someone changed firewall rules
RouteTableChanges – Someone changed routing

What I disabled:

hcl

disabled_controls = ["DisableOrDeleteCMK", "VPCChanges"]
```

- DisableOrDeleteCMK: I disabled this because I'm not using AWS KMS (encryption keys) in this project
- VPCChanges: I disabled this because during learning, I might change my VPC often

How it works:
1. AWS CloudTrail logs all actions in my account
2. CloudWatch Logs stores these logs
3. CloudWatch Alarms watch for suspicious patterns
4. If something suspicious happens, I get an email

Example alert:
```
Subject: ALARM: "CIS-UnauthorizedAPICalls" in US East (N. Virginia)
Body: Someone made 5 unauthorized API calls from IP 203.0.113.42

Why this is important:

If someone steals my AWS keys, I get notified
If someone tries to hack my account, I know immediately
Meets security compliance requirements

4. CloudWatch Synthetics Canary (c14-05)

This is the most interesting part! A “canary” is like a robot that visits my website every minute to check if it’s working. The name comes from coal mines where they used real canaries to detect dangerous gas.

hcl

# IAM Policy for Canary
resource "aws_iam_policy" "cw_canary_iam_policy" {
  name = "cw-canary-iam-policy"
  path = "/"
  description = "CloudWatch Canary Synthetic IAM Policy"
  
  policy = jsonencode({
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "VisualEditor0",
        "Effect": "Allow",
        "Action": "cloudwatch:PutMetricData",
        "Resource": "*",
        "Condition": {
          "StringEquals": {
            "cloudwatch:namespace": "CloudWatchSynthetics"
          }
        }
      },
      {
        "Sid": "VisualEditor1",
        "Effect": "Allow",
        "Action": [
          "s3:PutObject",
          "logs:CreateLogStream",
          "s3:ListAllMyBuckets",
          "logs:CreateLogGroup",
          "logs:PutLogEvents",
          "s3:GetBucketLocation",
          "xray:PutTraceSegments"
        ],
        "Resource": "*"
      }
    ]
  })
}

# IAM Role
resource "aws_iam_role" "cw_canary_iam_role" {
  name = "cw-canary-iam-role"
  description = "CloudWatch Synthetics lambda execution role for running canaries"
  path = "/service-role/"
  assume_role_policy = "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"lambda.amazonaws.com\"},\"Action\":\"sts:AssumeRole\"}]}"
  managed_policy_arns = [aws_iam_policy.cw_canary_iam_policy.arn]
}

# S3 Bucket for screenshots and artifacts
resource "aws_s3_bucket" "cw_canary_bucket" {
  bucket = "cw-canary-bucket-${random_pet.this.id}"
  force_destroy = true
  
  tags = {
    Name = "My bucket"
    Environment = "Dev"
  }
}

resource "aws_s3_bucket_ownership_controls" "example" {
  bucket = aws_s3_bucket.cw_canary_bucket.id
  rule {
    object_ownership = "BucketOwnerPreferred"
  }
}

resource "aws_s3_bucket_acl" "example" {
  depends_on = [aws_s3_bucket_ownership_controls.example]
  bucket = aws_s3_bucket.cw_canary_bucket.id
  acl = "private"
}

# The actual Canary
resource "aws_synthetics_canary" "sswebsite2" {
  name = "sswebsite2"
  artifact_s3_location = "s3://${aws_s3_bucket.cw_canary_bucket.id}/sswebsite2"
  execution_role_arn = aws_iam_role.cw_canary_iam_role.arn
  handler = "sswebsite2.handler"
  zip_file = "sswebsite2/sswebsite2v1.zip"
  runtime_version = "syn-nodejs-puppeteer-6.0"
  start_canary = true
  
  run_config {
    active_tracing = true
    memory_in_mb = 960
    timeout_in_seconds = 60
  }
  
  schedule {
    expression = "rate(1 minute)"
  }
}

# Alarm for Canary failures
resource "aws_cloudwatch_metric_alarm" "synthetics_alarm_app1" {
  alarm_name = "Synthetics-Alarm-App1"
  comparison_operator = "LessThanThreshold"
  datapoints_to_alarm = "1"
  evaluation_periods = "1"
  metric_name = "SuccessPercent"
  namespace = "CloudWatchSynthetics"
  period = "300"
  statistic = "Average"
  threshold = "90"
  treat_missing_data = "breaching"
  
  dimensions = {
    CanaryName = aws_synthetics_canary.sswebsite2.id
  }
  
  alarm_description = "Synthetics alarm metric: SuccessPercent LessThanThreshold 90"
  
  ok_actions = [aws_sns_topic.myasg_sns_topic.arn]
  alarm_actions = [aws_sns_topic.myasg_sns_topic.arn]
}

What this does – step by step:

IAM Role and Policy: Gives the canary permission to:
- Write metrics to CloudWatch
- Save screenshots to S3
- Write logs
- Use AWS X-Ray for tracing
S3 Bucket: Stores:
- Screenshots when pages load
- Screenshots when pages fail
- HAR files (network activity logs)
- Logs of what the canary did
The Canary itself:
- name: “sswebsite2” – my canary name
- runtime: Uses Node.js with Puppeteer (headless Chrome browser)
- schedule: Runs every 1 minute
- memory: 960 MB (enough to run a browser)
- timeout: 60 seconds maximum
- zip_file: Contains the JavaScript code that visits my website
Alarm: Triggers if success rate drops below 90%

The Canary Script (sswebsite2.js):

This file is in sswebsite2/nodejs/node_modules/sswebsite2.js. Let me explain what it does:

javascript

const URL = require('url');
const synthetics = require('Synthetics');
const log = require('SyntheticsLogger');

const loadBlueprint = async function () {
    // The website to check
    const urls = ['https://rezaops.com'];
    
    const takeScreenshot = true;
    
    syntheticsConfiguration.disableStepScreenshots();
    syntheticsConfiguration.setConfig({
       continueOnStepFailure: true
    });
 
    let page = await synthetics.getPage();
    
    for (const url of urls) {
        await loadUrl(page, url, takeScreenshot);
    }
};

const loadUrl = async function (page, url, takeScreenshot) {
    let stepName = URL.parse(url).hostname;
    
    await synthetics.executeStep(stepName, async function () {
        const response = await page.goto(url, { 
            waitUntil: ['domcontentloaded'], 
            timeout: 30000
        });
        
        if (response) {
            const status = response.status();
            
            // If status is not 2xx, throw error
            if (response.status() < 200 || response.status() > 299) {
                throw `Failed to load url: ${url} ${response.status()}`;
            }
        } else {
            throw new Error(`No response returned for url: ${url}`);
        }
    });
    
    // Wait 15 seconds for page to fully load
    if (takeScreenshot) {
        await page.waitFor(15000);
        await synthetics.takeScreenshot(stepName, 'loaded');
    }
};

exports.handler = async () => {
    return await loadBlueprint();
};

What the script does:

Opens a headless Chrome browser
Visits https://stacksimplify.com
Waits for page to load (domcontentloaded event)
Checks if HTTP status is 200-299 (success)
Waits 15 seconds for everything to load
Takes a screenshot
Saves screenshot to S3
Reports success/failure to CloudWatch

What I Learned

Technical Lessons

Target Tracking vs Simple Scaling: Both have their place. Target Tracking for gradual scaling, Simple Scaling for emergencies.
Alarm evaluation is not instant: If I set evaluation_periods=2 and period=120, it takes 4 minutes minimum to trigger the alarm. This is okay for most things, but for critical alerts I might want faster checking.
Synthetics are powerful but expensive: Running every minute costs about $52/month. For learning, every 5 minutes is enough.
CIS alarms are great for security: I get 14-15 security alarms with just one module. This would take hours to set up manually!
Screenshots are super helpful: When my canary fails, the screenshot shows me exactly what the error page looks like.
SNS is the bridge: All my alarms send notifications through the same SNS topic. One email subscription gets all alerts.

Practical Tips

Don’t set thresholds too sensitive: If I set ALB 4xx threshold to 1, I’ll get alarms all day. Start with 5-10 for learning, then adjust.
Test your alarms: After deploying, I intentionally broke my site to make sure alarms actually trigger and I get emails.
Organize alarm names well: I prefix with “App1-” so I know which app the alarm is for. In production with multiple apps, this is very important.
Check alarm history: CloudWatch keeps history of when alarms triggered. This helps me understand patterns.
S3 costs: Synthetics saves lots of screenshots. After a few days, I should set up S3 lifecycle policy to delete old screenshots automatically.

Mistakes I Made

Forgot to confirm SNS subscription: Deployed everything, wondered why no emails. Checked spam folder, found confirmation email from 2 days ago!
Set ALB threshold too low: Set it to 1 error, got 50 emails in first hour. Changed to 5.
Canary timeout too short: Set timeout to 10 seconds. Page takes 15 seconds to load. Canary always failed! Changed to 60 seconds.
Wrong URL in canary script: Script checked stacksimplify.com instead of my domain. All checks passed but wasn’t actually checking my site!
Didn’t disable VPCChanges CIS alarm: During learning, I change VPC often. Got alarm every time. Disabled it.
Forgot about S3 costs: Let canary run for a month. S3 bucket had 43,000 screenshots! Changed to delete after 7 days.

Tags: AWS DevOps IaC InfrastructureAsCode Terraform

Write a comment