April 4, 2026 · Tim Fraser, Cloud Operations Lead
Reliability Monitoring for SaaS on AWS: What to Watch
When you run a SaaS product on AWS, uptime monitoring is table stakes. But the outages that actually hurt SaaS companies aren't the dramatic failures — they're the slow-burn problems that build over weeks until something breaks at 2 AM on a Thursday.
Here's what to watch beyond basic uptime.
Database storage growth
RDS instances have a maximum allocated storage. Autoscaling buys time, but not infinite time. At some point your database hits the cap and writes start failing.
The fix isn't just monitoring free storage — it's monitoring the rate of growth. A database using 60% of storage isn't concerning. A database using 60% that was at 40% three months ago is worth investigating. On Aurora, watch VolumeBytesUsed. On standard RDS, watch FreeStorageSpace. Set an alert at 80%, but more importantly, track the trend.
Connection pool saturation
Every RDS instance has a maximum number of connections, determined by instance class. A db.t3.micro maxes out around 66. A db.r6g.large supports around 1,000. If your application opens connections without releasing them, or you scale out the application tier without scaling the database, you hit the limit and new requests fail.
Watch DatabaseConnections in CloudWatch. If you're routinely above 80% of max, you have a problem waiting to happen.
ECS container memory limits
ECS tasks have hard memory limits. When a container exceeds its limit, ECS kills it. If this happens during a traffic spike, you get cascading restarts — each restart drops requests, increasing load on remaining containers, pushing them toward their limits too.
Monitor MemoryUtilization at the task level, not the service level. A service-level average of 60% can hide individual tasks at 95%. Also watch for memory that climbs steadily without dropping back — that's a leak, and restarts are inevitable.
Queue depth
If you use SQS for background processing, watch ApproximateNumberOfMessagesVisible. A growing queue means consumers can't keep up. The dangerous scenario is a queue that looks fine during business hours but grows during batch jobs and never fully drains.
Set alerts on queue depth and on ApproximateAgeOfOldestMessage. If messages sit for more than a few minutes, something is wrong.
Certificate expiry
ACM certificates auto-renew, mostly. Auto-renewal fails when DNS validation records get removed, when the certificate is attached to a resource in a region you've forgotten about, or when domain DNS is managed outside Route 53 and someone changed the CNAME.
A failed renewal doesn't cause an outage immediately — it causes one on the expiry date, weeks or months later. Check DaysToExpiry across all regions. Alert at 30 days.
Single points of failure
The most dangerous reliability problem doesn't show up in metrics: a resource with no redundancy. A single-AZ RDS instance. A single NAT Gateway. An ECS service with desiredCount: 1. An ElastiCache node without a replica.
For each critical resource, ask: what happens if this disappears? If the answer is "outage," it needs redundancy.
How to prioritise
- Outage imminent: certificate expiry under 14 days, storage over 90%, connection pools over 90%
- Outage eventually: growing queues, memory leaks, single points of failure
- Performance degradation: oversized or undersized resources, missing autoscaling
How plainfra helps
plainfra connects to your AWS account with read-only access and checks all of the above as a weekly report. It identifies single-AZ databases, monitors storage growth trends, flags high connection counts, checks certificate expiry across all regions, and catches queue depth anomalies.
Most SaaS teams intend to check these things regularly but don't, because feature work always wins. A report that arrives without anyone remembering to run it catches slow-burn problems while they're still fixable — not after the 2 AM page.
Try plainfra free → 50K tokens, 7 days, no charge. Or see the interactive demo →.