Engineers traced the issue to a routine update that went wrong

-
The outage affecting Amazon Web Services (AWS) on Monday originated in the companys US-EAST-1 region (Northern Virginia).
-
AWS reports the root cause was an internal subsystem failure tied to the health-monitoring of network load-balancers (which route traffic between servers) within its EC2 internal network.
-
A key symptom was a failure of DNS (Domain Name System) resolution for endpoints of AWSs DynamoDB database service in that region, which triggered cascading service errors and widespread disruption of dependent applications.
Engineers at Amazon Web Services think they have pinpointed the cause of Mondays massive service interruption that caused thousands of websites and apps to go dark.
AWSs status updates show that the trouble began in the US East 1 region and manifested as increased error rates and high latency. AWS later described the issue as stemming from a subsystem that monitors the health of its network load balancers.
The failure started in one of AWSs main data centers in Northern Virginia after a routine update to the API, which connects various computer software programs. The update was being made to an important database service that stores information for websites.
According to engineers, there was an apparent error in the update that eventually disabled the Domain Name System (DNS), which translates website addresses into a series of numbers, known as IP addresses.
Without a properly operating DNS, apps and websites were disconnected from their IP address.
After the system failed, other AWS services also began to go dark. Reportedly, as many as 113 services were blocked by the outage. By Monday afternoon ET, AWS said most systems were restored.
Why this was especially disruptive
-
US-EAST-1 is one of AWSs largest and often-used regions; many services default to it.
-
When DNS resolution fails for a core API endpoint (like DynamoDB), the effect isnt limited to one application: many downstream systems depend on that service.
-
The combination of internal routing failures and name-resolution failures creates what appears to be a cascading outage: one part stops, others rely on it, they fail, more parts go offline.
What remains unclear
-
Exactly why the health-monitoring subsystem failed (design flaw? configuration error? software bug?)
-
Whether the load-balancer health-monitoring failure directly triggered the DNS issues, or whether the DNS failure was a separate consequence.
-
Whether AWS has taken all mitigations required to prevent such failure modes in the future.
-
The full scope of back-logs and whether any data or services were permanently affected.
Monday's outage highlights how much the global internet ecosystem depends on few big cloud providers and on single regions within them.
Posted: 2025-10-21 12:17:26