Load balancer update.
We’ve been able to determine that block-lb02 experienced a partial failure
wherein those portions of its software stack which load balance requests
stopped functioning due to a memory starvation issue, but those portions of
the stack responding to heartbeat requests didn’t stop working; meaning
that the other unit in this highly-available pair (block-lb01) did not take
over as it was unable to detect this partial failure. This continued for
about 5 minutes until we pulled block-lb02 off the network, which caused
block-lb01 to detect the failure and take over.
We will be upgrading block-lb02 today and scheduling a brief window tonight
at 23:00 PDT (07:00 UTC) to put it back in the loop and fail services back
over to it so we can upgrade block-lb01. During this secondary failover,
some services may experience up to 20 seconds of down-time as the load
balanced services move to the other host.
All of our other load balancers are built with more RAM than these two, so
should not be in danger of any memory-starvation related failures in the
immediate future. In addition to this, over the next week we will be
putting better detection in place to catch low RAM scenarios on the load
balancers before they are likely to cause an outage.
(Web only post)
Load Balancer outage update.
The load balancer outage referenced in the latest update also affected a
number of customer applications. It appears that the load balancer was
timing out on processing requests, however since it didn’t go down
completely and was still available for its health check, it didn’t fail
over to its redundant pair. Now that things are back online, we are
investigating the cause of this issue, as well as ways to ensure that
should similar issues arise in the future, failover will happen in a more
timely fashion.
We’ll provide an update once we have more information.
Boxpanel unavailable for 10 minutes
Our Box Panel application was offline for 10 minutes today (3:12pm - 3:22pm
PT), during which time API was also likely offline.
Our load balancer was in a non-responsive state, but after a reboot all
appears normal.
Emergency Networking Upgrade Complete
We have completed the emergency network upgrade.
If you see any abnormal latency, reduction in service or dropped packets please open an escalated ticket with our support team or give us a call at 800.613.4305.
Emergency Networking Update: 0200 PST Status
We’re about 65-70% complete with the networking work. There have been some
BGP issues with our upstream link that has caused some short outages for
sites, and some internal traffic between servers has been slower than usual.
We’ll be getting a complete wrap up out early next week.
Emergency Network upgrade - updated
The network upgrade is proceeding according to plan. There is the the
potential for diminished functionality to some hosts when some changes take
effect.
If you are noticing connectivity issues, please let us know at
support@bluebox.net or via customer chat.
Network Issue this morning.
From 10:35 - 10:55 a portion of our network experienced some packet loss
due to a distributed denial of service attack (DDoS) directed at a server
on our network. We have addressed the issue, and any network connectivity
issues should now be resolved. If you continue to experience any network
problems, please get in touch with our support team, and we’ll address it
as best we can.
Emergency Network Upgrade
We will be performing an emergency upgrade of our core routers and the distribution layer of our network from GigE to 10-GigE on Friday, February 10th, 2012 starting at 23:00 PST. This network upgrade will address network performance issues that have been raised and alleviate network congestion. No downtime should be incurred as a result of this upgrade. We will have our senior administrators as well as our entire NOC and network engineering team on site during this work to both assist in the upgrade and to address any complications, should they arise. This upgrade is scheduled to take 5 hours, and again no downtime or significant degradation of network service is expected.
If you have any questions, concerns or if there is anything that we can address, please open a ticket with our support team or give us a call at 800.613.4305.
All Systems Go
All systems are go at Blue Box.
Networking Issue: Updated
From 11:25 to 11:45pm Pacific Time, a small portion of our network experienced packet loss due to a traffic flood generated by a customer’s server. This server has been terminated and all network activity is operating normally at this time. We will continue to monitor the situation closely and our technology and networking teams will be reviewing the event in full detail tomorrow.