Dispatches From The Geeks

News and Announcements from the MCS Systems Group

Unexpected file server outage

Planned power work in the data center took out our file server (which should have stayed up due to redundancy, but we think we may have a bad power supply). The team’s working to bring it back as we speak. Thanks for your patience.

Written by Craig Stacey

September 20, 2016 at 10:03 am

Posted in Uncategorized

Slight change in upgrade schedule

vanquish has a long-running job on it, so we will postpone its rebuild until we can work with the owner of the job.

crank has suffered a disk error, so its rebuild is being accelerated to this week (today).

Sorry for any inconvenience, and thanks for your patience.

Written by Craig Stacey

September 8, 2016 at 9:07 am

Posted in Uncategorized

thwomp is upgraded to 14.04. More machines tomorrow (vanquish is delayed).

Written by Craig Stacey

September 7, 2016 at 2:36 pm

Posted in Uncategorized

Compute server upgrades continue

We’re pushing through on updating the remaining 64 bit compute nodes to Ubuntu 14.04 Trusty. Here’s the schedule:

This week (through Sep 9)



Next week (Sep 12-16)



Week 3 (Sep 19-23)




Week 4 (Sep 26-30)



Week 5 (Oct 3-7)


During each rebuild, the machine will be unavailable for some portion of that day. We’ll announce the shutdown on the machine itself to all logged-in users 30 minutes prior to shutdown. After the machine is rebuilt, you’ll need to recreate any crontabs you had in place. Also note /sandbox is not backed up and data will be lost – never keep data in /sandbox that can’t be easily reproduced.

If you notice software packages missing or other oddities, please report them to help@cels.anl.gov.

We’ll start this week’s batch of machines tomorrow (Wednesday, September 7).

Let us know if this presents any problems.

Written by Craig Stacey

September 6, 2016 at 9:58 am

Posted in Uncategorized

Confluence upgrade complete

Confluence has been updated. When you login to https://collab.cels.anl.gov you may get a popup pointing out some of the changes. Let us know if you have any issues.


Written by Craig Stacey

August 30, 2016 at 9:01 pm

Posted in Uncategorized

Reminder: collab.cels.anl.gov (confluence service) maintenance outage: Aug 30, 2016, 5-8PM

Just a reminder the outage announced on Friday is happening at 5PM today. Thanks!

Written by Craig Stacey

August 30, 2016 at 4:02 pm

Posted in Uncategorized

collab.cels.anl.gov (confluence service) maintenance outage: Aug 30, 2016, 5-8PM

Hi, all. Our confluence server is behind the latest version by a few releases now, so we’d like to take the server down and bring it up to the current version. We’re expecting it won’t take this long, but we’re making the outage window 3 hours, from 5PM until 8PM on Tuesday, August 30. During this time, https://collab.cels.anl.gov will be unavailable. No data will be lost, though after the upgrade you may find some of the options for using it have changed (menu items may have moved, etc.)

If this timing poses a problem for you, please let us know so we can reschedule as necessary.


Written by Craig Stacey

August 26, 2016 at 10:16 am

Posted in Uncategorized

thrash.mcs.anl.gov OS update

A hard drive failed in thrash and in rebuilding it we moved it to the new trusty build (Ubuntu 14.04). This is functionally a completely fresh install with no carryover except the hostname and SSH keys, so any crontabs or other local data would not be there. As we move more servers to trusty, we’ll let you know. If you find software packages you need but aren’t installed, let us know and we’ll get them installed across the whole trusty environment.

Written by Craig Stacey

August 25, 2016 at 9:09 am

Posted in Uncategorized

Outage update: The story so far and when things will come back.

At 2:28 PM today, the lab suffered a partial site-wide power outage. I do not yet have the exact cause, though it was no doubt related to the incredible storm we were seeing. I happened to be recording the storm outside my window right as the power went off, so you can hear my exasperation in the recording: https://www.dropbox.com/s/jkchw3pppkqkg6k/2016-07-28%2019.28.43.mp4?dl=0.

This took out all infrastructure housed in 240 that was not backed up by UPS. The remaining systems remained up for a short time before the UPS ran out of juice and was unable to sustain the load any longer. At this point, the remaining systems failed.

CELS Systems maintains a presence in the 221 data center to house critical servers and services to be able to ride out outages of this nature. An overzealous team member, in an attempt to cleanly shut down the systems in 240 that were going to run out of UPS, inadvertently also shut down the systems in 221.

So, while a team of us ran around 240 trying to ascertain the situation, another team hoofed it over to 221 to bring affected systems back online.

At this point, we were dark on all fronts – supercomputers, HPC clusters, core computing, base IT. Through some magic of UPSes and fairy dust, the wifi and phones in some parts of the building remained operational through this outage (until the batteries in those UPSes also went down).

At 4:00, we were informed that the rest of the site had power back, but building 240 still was not powering up, and that the building’s electrical contractor was en route to determine why. Traffic and weather conspired to delay their arrival until well after 5PM.

During this time, the tenants of the 240 datacenter discussed our options. We’ve had some recent experience with power outages and what it takes to successfully come back from a planned outage (and, sadly, some experience with the unplanned outages as well). We determined that it would be counterproductive to try to bring the systems back online tonight, as we had no estimate of return to operation for power. Once power was restored, the dependency chain of having enough heat load in the room to run the chillers to cool the room would kick in.

Because of the amount of time and effort to get ALCF, LCRC, Magellan, and Beagle back into operation, and the fact there was no true estimation of when power would return nor if the chillers would be able to return to operation, we made the decision to postpone powering up the room until tomorrow morning, beginning at 6AM. Without the supercomputers and HPC clusters running there is not enough heat being generated in the room to run the chillers, which results in them shutting down, which results in the few systems that are running to overheat.

At 6AM tomorrow the building operations staff and the various sysadmins will converge on the data center to bring back the equipment housed in there.

We’re still working to fix the niggling items that have not come back from building 221. I’ll send updates as I have them.


Written by Craig Stacey

July 28, 2016 at 6:13 pm

Posted in Uncategorized

Some areas have power restored, Building 240 still down

I’m hopinh we’ll be back soon, but I still have no info on root cause, actual scope of outage, or an estimate on time to return to operation.

Written by Craig Stacey

July 28, 2016 at 3:32 pm

Posted in Uncategorized