← Articles

April 4, 2026 · Tim Fraser, Cloud Operations Lead

Reliability Monitoring for SaaS on AWS: What to Watch

When you run a SaaS product on AWS, uptime monitoring is table stakes. But the outages that actually hurt SaaS companies aren't the dramatic failures — they're the slow-burn problems that build over weeks until something breaks at 2 AM on a Thursday.

Here's what to watch beyond basic uptime.

Database storage growth

RDS instances have a maximum allocated storage. Autoscaling buys time, but not infinite time. At some point your database hits the cap and writes start failing.

The fix isn't just monitoring free storage — it's monitoring the rate of growth. A database using 60% of storage isn't concerning. A database using 60% that was at 40% three months ago is worth investigating. On Aurora, watch VolumeBytesUsed. On standard RDS, watch FreeStorageSpace. Set an alert at 80%, but more importantly, track the trend.

Connection pool saturation

Every RDS instance has a maximum number of connections, determined by instance class. A db.t3.micro maxes out around 66. A db.r6g.large supports around 1,000. If your application opens connections without releasing them, or you scale out the application tier without scaling the database, you hit the limit and new requests fail.

Watch DatabaseConnections in CloudWatch. If you're routinely above 80% of max, you have a problem waiting to happen.

ECS container memory limits

ECS tasks have hard memory limits. When a container exceeds its limit, ECS kills it. If this happens during a traffic spike, you get cascading restarts — each restart drops requests, increasing load on remaining containers, pushing them toward their limits too.

Monitor MemoryUtilization at the task level, not the service level. A service-level average of 60% can hide individual tasks at 95%. Also watch for memory that climbs steadily without dropping back — that's a leak, and restarts are inevitable.

Queue depth

If you use SQS for background processing, watch ApproximateNumberOfMessagesVisible. A growing queue means consumers can't keep up. The dangerous scenario is a queue that looks fine during business hours but grows during batch jobs and never fully drains.

Set alerts on queue depth and on ApproximateAgeOfOldestMessage. If messages sit for more than a few minutes, something is wrong.

Certificate expiry

ACM certificates auto-renew, mostly. Auto-renewal fails when DNS validation records get removed, when the certificate is attached to a resource in a region you've forgotten about, or when domain DNS is managed outside Route 53 and someone changed the CNAME.

A failed renewal doesn't cause an outage immediately — it causes one on the expiry date, weeks or months later. Check DaysToExpiry across all regions. Alert at 30 days.

Single points of failure

The most dangerous reliability problem doesn't show up in metrics: a resource with no redundancy. A single-AZ RDS instance. A single NAT Gateway. An ECS service with desiredCount: 1. An ElastiCache node without a replica.

For each critical resource, ask: what happens if this disappears? If the answer is "outage," it needs redundancy.

How to prioritise

How plainfra helps

plainfra connects to your AWS account with read-only access and checks all of the above as a weekly report. It identifies single-AZ databases, monitors storage growth trends, flags high connection counts, checks certificate expiry across all regions, and catches queue depth anomalies.

Most SaaS teams intend to check these things regularly but don't, because feature work always wins. A report that arrives without anyone remembering to run it catches slow-burn problems while they're still fixable — not after the 2 AM page.

Try plainfra free → 50K tokens, 7 days, no charge. Or see the interactive demo →.