CELS Systems Outage, June 17-19, 2016.

As you may be aware by now, building 240 is undergoing a complete shutdown of power beginning in the morning of Friday, June 17, with an outage window extending into Sunday, June 19. We hope the outage will be shorter than that, but fully expect it will last until the evening of the 18th at the absolute earliest.

This affects all computers in the building 240 data center. Each IT organization is going to be notifying its users of the impact on them, and that’s what I’m writing to you about today. Our served customer base in CELS has grown since the last time we had to endure one of these, so I’m going to break it out a little bit, and I expect I will also have some more detailed mails that will only be targeted at the BIO division after the fact. This message’s goal is to give you a heads up that this is happening, and make sure you plan accordingly.

I’ll make some division-specific announcements below, but everyone can expect affected compute systems to start going down beginning by 6AM on June 17, and shutdowns will be complete by 9AM. Because your network files will not be available, we encourage you to make sure you have files and data you need locally for that day. Getting accustomed to working in Box can help with that. You can find more information at http://inside.anl.gov/services/box.

We’ll send more announcements on this as we get closer to the date. You can also keep up to date via twitter (@mcssys) or WordPress (https://mcssys.wordpress.com).

MCS/CELS/ALCF:

The affect of this work so closely mirrors the work that happened a year ago that I’m largely going to crib from that announcement. Pardon my unoriginality:

The short answer is that it’s easier to say what won’t be affected. Mail services we provide (forwarding for mcs.anl.gov, alcf.anl.gov, cels.anl.gov, ci.uchicago.edu, etc.) and mailing lists will be unaffected. Most web sites we host (WordPress, Confluence, etc.) will remain up. We’ll notify site owners of any exceptions to this. CIS-provided services (e-mail, web, business systems) are generally unaffected. Externally hosted services (Box, Dayforce, TAMS) are unaffected.

Now for the info you really need — what will be down. All MCS/CELS file and compute servers will be down. This includes SSH logins (login.mcs.anl.gov), unix and Mac home file servers, linux compute servers, all desktops, and all networking in building 240. We had planned to move the subversion server at repo.anl-external.org to the 221 data center, but have not been able to accomplish that in time for this work and thus it, too, will be down.

It’s outside the scope of this announcement, but I’ll also just remind you if it’s in the data center, it’s down. So that means LCRC, Mira and friends, Beagle, Magellan, Chameleon… you get the gist.

I will send an update closer to the outage detailing the exact services that will still be up, just like I did last year. (See https://mcssys.wordpress.com/2015/05/27/reminder-tcs-power-outage-june-1-2015/ for historical reference.)

BIO:

I’m not sure what Rocky told you during last year’s power work. I can tell you that the net effect will be the same and we’ll take what steps we can to minimize the effect on you. That being said, all services hosted in 240 will go down, and that’s where your entire back-end services live at the moment, so the effect will be felt. We’ll reduce your dependency on the server that handles giving out network addresses to wired computers so that as long as you don’t reboot you’ll stay up on the network. Your file server will go down (bioxshared, Y drive, Z drive), and you may have issues logging in to your computer if you logout or reboot.

Unfortunately, the work to move BIO computers off the BIO child domain and remove those dependencies is starting in June, but won’t be complete in time for this outage (it will take some time). We will, however, get what pieces we can in place prior to the outage. I’m personally carving out a chunk of time next week to see if I can’t at least get the DHCP (IP address assignment) component put away and have that functional during the outage. File systems are a bigger fish to fry and will take a lot more time. As such, we’re strongly encouraging you to embrace Box if you haven’t already. I’ll spend a bit of time helping folks out with Box at our session next week.

We will be taking the opportunity this outage brings us to make some improvements in the power layout of your servers and switches to make them more resilient to partial power failures. Alas, nothing short of a big ole generator makes them resilient to a total outage like this.

I will send weekly updates to BIO on the progress to minimize the effect this outage has on the division. Next week’s update will be summarized at the previously announced tech session in the auditorium.

Thanks, and please understand – if we had any say in this, it wouldn’t be happening. But we’ll power through it as best we can.

Written by Craig Stacey

May 26, 2016 at 5:41 pm

Posted in Uncategorized

Dispatches From The Geeks