Dispatches From The Geeks

News and Announcements from the MCS Systems Group

Outage update: The story so far and when things will come back.

At 2:28 PM today, the lab suffered a partial site-wide power outage. I do not yet have the exact cause, though it was no doubt related to the incredible storm we were seeing. I happened to be recording the storm outside my window right as the power went off, so you can hear my exasperation in the recording: https://www.dropbox.com/s/jkchw3pppkqkg6k/2016-07-28%2019.28.43.mp4?dl=0.

This took out all infrastructure housed in 240 that was not backed up by UPS. The remaining systems remained up for a short time before the UPS ran out of juice and was unable to sustain the load any longer. At this point, the remaining systems failed.

CELS Systems maintains a presence in the 221 data center to house critical servers and services to be able to ride out outages of this nature. An overzealous team member, in an attempt to cleanly shut down the systems in 240 that were going to run out of UPS, inadvertently also shut down the systems in 221.

So, while a team of us ran around 240 trying to ascertain the situation, another team hoofed it over to 221 to bring affected systems back online.

At this point, we were dark on all fronts – supercomputers, HPC clusters, core computing, base IT. Through some magic of UPSes and fairy dust, the wifi and phones in some parts of the building remained operational through this outage (until the batteries in those UPSes also went down).

At 4:00, we were informed that the rest of the site had power back, but building 240 still was not powering up, and that the building’s electrical contractor was en route to determine why. Traffic and weather conspired to delay their arrival until well after 5PM.

During this time, the tenants of the 240 datacenter discussed our options. We’ve had some recent experience with power outages and what it takes to successfully come back from a planned outage (and, sadly, some experience with the unplanned outages as well). We determined that it would be counterproductive to try to bring the systems back online tonight, as we had no estimate of return to operation for power. Once power was restored, the dependency chain of having enough heat load in the room to run the chillers to cool the room would kick in.

Because of the amount of time and effort to get ALCF, LCRC, Magellan, and Beagle back into operation, and the fact there was no true estimation of when power would return nor if the chillers would be able to return to operation, we made the decision to postpone powering up the room until tomorrow morning, beginning at 6AM. Without the supercomputers and HPC clusters running there is not enough heat being generated in the room to run the chillers, which results in them shutting down, which results in the few systems that are running to overheat.

At 6AM tomorrow the building operations staff and the various sysadmins will converge on the data center to bring back the equipment housed in there.

We’re still working to fix the niggling items that have not come back from building 221. I’ll send updates as I have them.


Craig

Advertisements

Written by Craig Stacey

July 28, 2016 at 6:13 pm

Posted in Uncategorized

%d bloggers like this: