ELBs can also be deployed in multiple Availability Zones. In this configuration, each Availability Zone’s end-point will have a separate IP address. A single Domain Name will point to all of the end-points’ IP addresses. When a client, such as a web browser, queries DNS with a Domain Name, it receives the IP address (“A”) records of all of the ELBs in random order. While some clients only process a single IP address, many (such as newer versions of web-browsers) will retry the subsequent IP addresses if they fail to connect to the first. A large number of non-browser clients only operate with a single IP address.
For multi-Availability Zone ELBs, the ELB service maintains ELBs redundantly in the Availability Zones a customer requests them to be in so that failure of a single machine or datacenter won’t take down the end-point. The ELB service avoids impact (even for clients which can only process a single IP address) by detecting failure and eliminating the problematic ELB instance’s IP address from the list returned by DNS. The ELB control plane processes all management events for ELBs including traffic shifts due to failure, size scaling for ELB due to traffic growth, and addition and removal of EC2 instances from association with a given ELB.
During the disruption this past Friday night, the control plane (which encompasses calls to add a new ELB, scale an ELB, add EC2 instances to an ELB, and remove traffic from ELBs) began performing traffic shifts to account for the loss of load balancers in the affected Availability Zone. As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before. The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes. This resulted in a sudden flood of requests which began to backlog the control plane. At the same time, customers began launching new EC2 instances to replace capacity lost in the impacted Availability Zone, requesting the instances be added to existing load balancers in the other zones. These requests further increased the ELB control plane backlog. Because the ELB control plane currently manages requests for the US East-1 Region through a shared queue, it fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete.„