April 4, 2026 · Tim Fraser, Cloud Operations Lead
Finding Single Points of Failure in Your SaaS Infrastructure
A single point of failure is any component that, if it fails, takes your application down. Every SaaS platform has them, usually more than the team realises. SPOFs don't show up during normal operations — they reveal themselves when something breaks, which is the worst time to discover them.
Here are the most common SPOFs on AWS and how to fix each one.
Single-AZ databases
This is the most common and most dangerous SPOF. A standard RDS instance runs in one Availability Zone. If that AZ has a hardware failure or network issue, your database goes offline — and for most SaaS products, that means a complete outage.
How to check:aws rds describe-db-instances — look at the MultiAZ field. If it says false, you have a single-AZ database.
How to fix: Enable Multi-AZ. AWS creates a standby replica in a different AZ with automatic failover. This roughly doubles your RDS cost for that instance, but for production databases it's not optional.
No auto-scaling on the application tier
A fixed-capacity ECS service or EC2 fleet can't absorb traffic spikes. Fixed capacity also means a single task failure reduces total capacity by a larger percentage.
How to check: Does the ECS service have scaling policies attached? A service withdesiredCount: 2 and no scaling policy handles one task failure, but not a traffic increase.
How to fix: Add target tracking scaling based on CPU or request count. Set minimum count to at least 2, maximum to protect your budget, target utilisation around 60-70%.
Single NAT Gateway
If your application runs in private subnets, outbound traffic goes through NAT Gateways. A single NAT Gateway in one AZ means all private subnets depend on that AZ for outbound connectivity.
How to check: Look at VPC route tables. If all private subnets route0.0.0.0/0 to the same NAT Gateway, that's a SPOF.
How to fix: Create a NAT Gateway in each AZ with private subnets. Update route tables to use the local one. This also eliminates cross-AZ data transfer charges for outbound traffic.
Unhealthy load balancer targets
An ALB with unhealthy targets is a warning sign. If 3 out of 4 targets are healthy, you're one failure from 50% capacity. Lenient health checks (30-second intervals, 5-check threshold) mean over two minutes to detect a failure.
How to check: Check Target Group health andHealthyHostCount CloudWatch metrics over the past week. Intermittent unhealthy targets suggest instability.
How to fix: Tighten health checks: 10-second interval, 5-second timeout, 2 consecutive failures for unhealthy. Ensure healthy host count stays above 2.
No failover for Redis or ElastiCache
A single-node ElastiCache cluster is a SPOF. When that node fails, cached data is lost and the application either errors or falls back to the database — which may not be provisioned for that load.
How to check: Look at node count per cluster. One node with no replicas means no failover. How to fix: Add a replica in a different AZ with automatic failover enabled. Use the cluster's primary endpoint, not a specific node endpoint, so failover is transparent to your application.Systematic SPOF audit
Walk the request path from browser to database:
- DNS — Route 53 is inherently redundant
- CDN — CloudFront is redundant by design
- Load Balancer — redundant if healthy targets span multiple AZs
- Application tier — redundant if auto-scaled with min >= 2
- Cache layer — redundant if replicated with failover
- Database — redundant if Multi-AZ
- Outbound internet — redundant if NAT Gateway per AZ
Any "no" is a SPOF worth fixing.
How plainfra checks automatically
plainfra audits your AWS account for SPOFs as part of its weekly health report — single-AZ databases, services without auto-scaling, single NAT Gateways, unreplicated caches, and unhealthy targets.
You can also ask directly: "Do we have any single points of failure?" plainfra queries the relevant APIs and returns a specific list with recommendations.
SPOFs get introduced gradually — a new service with desiredCount: 1, a database created without Multi-AZ because it was "just for testing." A weekly check catches these before they cause an outage.