Planned power work in the data center took out our file server (which should have stayed up due to redundancy, but we think we may have a bad power supply). The team’s working to bring it back as we speak. Thanks for your patience.
vanquish has a long-running job on it, so we will postpone its rebuild until we can work with the owner of the job.
crank has suffered a disk error, so its rebuild is being accelerated to this week (today).
Sorry for any inconvenience, and thanks for your patience.
We’re pushing through on updating the remaining 64 bit compute nodes to Ubuntu 14.04 Trusty. Here’s the schedule:
This week (through Sep 9)
Next week (Sep 12-16)
Week 3 (Sep 19-23)
Week 4 (Sep 26-30)
Week 5 (Oct 3-7)
During each rebuild, the machine will be unavailable for some portion of that day. We’ll announce the shutdown on the machine itself to all logged-in users 30 minutes prior to shutdown. After the machine is rebuilt, you’ll need to recreate any crontabs you had in place. Also note /sandbox is not backed up and data will be lost – never keep data in /sandbox that can’t be easily reproduced.
If you notice software packages missing or other oddities, please report them to email@example.com.
We’ll start this week’s batch of machines tomorrow (Wednesday, September 7).
Let us know if this presents any problems.
Confluence has been updated. When you login to https://collab.cels.anl.gov you may get a popup pointing out some of the changes. Let us know if you have any issues.
Just a reminder the outage announced on Friday is happening at 5PM today. Thanks!
Hi, all. Our confluence server is behind the latest version by a few releases now, so we’d like to take the server down and bring it up to the current version. We’re expecting it won’t take this long, but we’re making the outage window 3 hours, from 5PM until 8PM on Tuesday, August 30. During this time, https://collab.cels.anl.gov will be unavailable. No data will be lost, though after the upgrade you may find some of the options for using it have changed (menu items may have moved, etc.)
If this timing poses a problem for you, please let us know so we can reschedule as necessary.
A hard drive failed in thrash and in rebuilding it we moved it to the new trusty build (Ubuntu 14.04). This is functionally a completely fresh install with no carryover except the hostname and SSH keys, so any crontabs or other local data would not be there. As we move more servers to trusty, we’ll let you know. If you find software packages you need but aren’t installed, let us know and we’ll get them installed across the whole trusty environment.
At 2:28 PM today, the lab suffered a partial site-wide power outage. I do not yet have the exact cause, though it was no doubt related to the incredible storm we were seeing. I happened to be recording the storm outside my window right as the power went off, so you can hear my exasperation in the recording: https://www.dropbox.com/s/jkchw3pppkqkg6k/2016-07-28%2019.28.43.mp4?dl=0.
This took out all infrastructure housed in 240 that was not backed up by UPS. The remaining systems remained up for a short time before the UPS ran out of juice and was unable to sustain the load any longer. At this point, the remaining systems failed.
CELS Systems maintains a presence in the 221 data center to house critical servers and services to be able to ride out outages of this nature. An overzealous team member, in an attempt to cleanly shut down the systems in 240 that were going to run out of UPS, inadvertently also shut down the systems in 221.
So, while a team of us ran around 240 trying to ascertain the situation, another team hoofed it over to 221 to bring affected systems back online.
At this point, we were dark on all fronts – supercomputers, HPC clusters, core computing, base IT. Through some magic of UPSes and fairy dust, the wifi and phones in some parts of the building remained operational through this outage (until the batteries in those UPSes also went down).
At 4:00, we were informed that the rest of the site had power back, but building 240 still was not powering up, and that the building’s electrical contractor was en route to determine why. Traffic and weather conspired to delay their arrival until well after 5PM.
During this time, the tenants of the 240 datacenter discussed our options. We’ve had some recent experience with power outages and what it takes to successfully come back from a planned outage (and, sadly, some experience with the unplanned outages as well). We determined that it would be counterproductive to try to bring the systems back online tonight, as we had no estimate of return to operation for power. Once power was restored, the dependency chain of having enough heat load in the room to run the chillers to cool the room would kick in.
Because of the amount of time and effort to get ALCF, LCRC, Magellan, and Beagle back into operation, and the fact there was no true estimation of when power would return nor if the chillers would be able to return to operation, we made the decision to postpone powering up the room until tomorrow morning, beginning at 6AM. Without the supercomputers and HPC clusters running there is not enough heat being generated in the room to run the chillers, which results in them shutting down, which results in the few systems that are running to overheat.
At 6AM tomorrow the building operations staff and the various sysadmins will converge on the data center to bring back the equipment housed in there.
We’re still working to fix the niggling items that have not come back from building 221. I’ll send updates as I have them.
I’m hopinh we’ll be back soon, but I still have no info on root cause, actual scope of outage, or an estimate on time to return to operation.