← Articles

April 4, 2026 · Tim Fraser, Cloud Operations Lead

Preventing AWS Outages: What to Monitor Before Things Break

Most AWS outages don't happen suddenly. They build. A disk fills up over three weeks. Database connections climb until one Friday afternoon they hit the limit. A TLS certificate expires because auto-renewal failed silently two months ago.

The pattern is almost always the same: a slow-moving problem visible in the metrics well before it became an incident. The question is whether anyone was watching.

Here are the warning signs that precede the most common AWS outages, and how to catch them early.

Disk filling up

EBS volumes and root volumes don't grow automatically. Application logs, temp files, database WAL files, and package caches accumulate until the disk hits 100% — and then the service crashes or becomes read-only.

What to watch: Disk utilisation percentage requires the CloudWatch agent — it's not available by default. Without the agent, you're blind to this. The threshold: Alert at 80%. By 95%, some operations need temporary disk space and will fail even though the disk isn't technically full. Common triggers: Logs without rotation, /tmp directories that never get cleaned, database transaction logs growing faster than expected, Docker images accumulating without pruning.

Database connection exhaustion

RDS instances have a maximum connection count determined by instance size. A db.t3.micro maxes out at roughly 66. A db.t3.medium allows around 150. Connection leaks or traffic spikes push you to the limit.

What to watch: The DatabaseConnections CloudWatch metric. Track both current count and the weekly trend. The threshold: Alert at 80% of max connections. A db.t3.medium at 120 out of 150 is one traffic spike away from cascading "too many connections" errors. Common triggers: Code that opens connections without closing them, connection pool misconfiguration, cron jobs that open new connections for each run.

Certificate expiry

SSL/TLS certificates expire. ACM certificates with DNS validation auto-renew, but only if the validation DNS record still exists. Certificates imported into ACM or stored on instances don't auto-renew at all.

What to watch: ACM provides a DaysToExpiry metric. For certificates outside ACM, query the endpoint directly. The threshold: Alert at 30 days before expiry. A certificate that expires at 3am on a Saturday will ruin a weekend. Common triggers: DNS validation records deleted during migration, imported certificates without a renewal calendar, Let's Encrypt certs where the certbot cron was removed.

Auto-scaling hitting limits

Auto Scaling Groups have maximum capacity settings. Under a traffic spike, the ASG scales to max and then stops — even if the service needs more. Worse, if you're hitting EC2 service quotas for that instance type, launches fail silently.

What to watch: Compare GroupDesiredCapacity with GroupMaxSize. Watch for Failed launch activities in the ASG history. The threshold: Alert when desired capacity reaches 80% of max. Also alert on any launch failure. Common triggers: Service quotas for specific instance types, insufficient capacity in a single AZ, launch templates referencing deregistered AMIs.

Memory leaks

EC2 instances and containers with memory leaks don't fail immediately. They slow down as the OS starts swapping, degrade further as swap fills, and eventually the OOM killer terminates processes.

What to watch: Memory utilisation requires the CloudWatch agent. For ECS/Fargate, Container Insights provides memory metrics. For Lambda, watch for increasing function timeouts. The threshold: Alert at 85% sustained over 15 minutes. Sustained high memory trending upward over days is a leak. Common triggers: Long-running processes caching without eviction, connection objects accumulating, logging frameworks buffering in memory.

Catching these automatically

Every one of these warning signs is detectable with the right CloudWatch alarms and agents. The problem is setting it all up correctly and maintaining it as your infrastructure changes.

plainfra's weekly health reports check for these conditions automatically. Disk approaching capacity, connection counts climbing, certificates nearing expiry, scaling groups near their limits — each gets flagged with the specific resource and current state. You can also ask directly: "Which certificates expire in the next 30 days?" or "Show me RDS instances with high connection counts."

Outage prevention isn't about sophisticated tooling. It's about consistently checking the boring metrics that show something is slowly going wrong.

Try plainfra free → 50K tokens, 7 days, no charge. Or see the interactive demo →.