Load balancer update.

We’ve been able to determine that block-lb02 experienced a partial failure
wherein those portions of its software stack which load balance requests
stopped functioning due to a memory starvation issue, but those portions of
the stack responding to heartbeat requests didn’t stop working; meaning
that the other unit in this highly-available pair (block-lb01) did not take
over as it was unable to detect this partial failure. This continued for
about 5 minutes until we pulled block-lb02 off the network, which caused
block-lb01 to detect the failure and take over.

We will be upgrading block-lb02 today and scheduling a brief window tonight
at 23:00 PDT (07:00 UTC) to put it back in the loop and fail services back
over to it so we can upgrade block-lb01. During this secondary failover,
some services may experience up to 20 seconds of down-time as the load
balanced services move to the other host.

All of our other load balancers are built with more RAM than these two, so
should not be in danger of any memory-starvation related failures in the
immediate future. In addition to this, over the next week we will be
putting better detection in place to catch low RAM scenarios on the load
balancers before they are likely to cause an outage.

(Web only post)