As you may be aware by now, building 240 is undergoing a complete shutdown of power beginning in the morning of Friday, June 17, with an outage window extending into Sunday, June 19. We hope the outage will be shorter than that, but fully expect it will last until the evening of the 18th at the absolute earliest.
This affects all computers in the building 240 data center. Each IT organization is going to be notifying its users of the impact on them, and that’s what I’m writing to you about today. Our served customer base in CELS has grown since the last time we had to endure one of these, so I’m going to break it out a little bit, and I expect I will also have some more detailed mails that will only be targeted at the BIO division after the fact. This message’s goal is to give you a heads up that this is happening, and make sure you plan accordingly.
I’ll make some division-specific announcements below, but everyone can expect affected compute systems to start going down beginning by 6AM on June 17, and shutdowns will be complete by 9AM. Because your network files will not be available, we encourage you to make sure you have files and data you need locally for that day. Getting accustomed to working in Box can help with that. You can find more information at http://inside.anl.gov/services/box.
We’ll send more announcements on this as we get closer to the date. You can also keep up to date via twitter (@mcssys) or WordPress (https://mcssys.wordpress.com).
The affect of this work so closely mirrors the work that happened a year ago that I’m largely going to crib from that announcement. Pardon my unoriginality:
The short answer is that it’s easier to say what won’t be affected. Mail services we provide (forwarding for mcs.anl.gov, alcf.anl.gov, cels.anl.gov, ci.uchicago.edu, etc.) and mailing lists will be unaffected. Most web sites we host (WordPress, Confluence, etc.) will remain up. We’ll notify site owners of any exceptions to this. CIS-provided services (e-mail, web, business systems) are generally unaffected. Externally hosted services (Box, Dayforce, TAMS) are unaffected.
Now for the info you really need — what will be down. All MCS/CELS file and compute servers will be down. This includes SSH logins (login.mcs.anl.gov), unix and Mac home file servers, linux compute servers, all desktops, and all networking in building 240. We had planned to move the subversion server at repo.anl-external.org to the 221 data center, but have not been able to accomplish that in time for this work and thus it, too, will be down.
It’s outside the scope of this announcement, but I’ll also just remind you if it’s in the data center, it’s down. So that means LCRC, Mira and friends, Beagle, Magellan, Chameleon… you get the gist.
I will send an update closer to the outage detailing the exact services that will still be up, just like I did last year. (See https://mcssys.wordpress.com/2015/05/27/reminder-tcs-power-outage-june-1-2015/ for historical reference.)
I’m not sure what Rocky told you during last year’s power work. I can tell you that the net effect will be the same and we’ll take what steps we can to minimize the effect on you. That being said, all services hosted in 240 will go down, and that’s where your entire back-end services live at the moment, so the effect will be felt. We’ll reduce your dependency on the server that handles giving out network addresses to wired computers so that as long as you don’t reboot you’ll stay up on the network. Your file server will go down (bioxshared, Y drive, Z drive), and you may have issues logging in to your computer if you logout or reboot.
Unfortunately, the work to move BIO computers off the BIO child domain and remove those dependencies is starting in June, but won’t be complete in time for this outage (it will take some time). We will, however, get what pieces we can in place prior to the outage. I’m personally carving out a chunk of time next week to see if I can’t at least get the DHCP (IP address assignment) component put away and have that functional during the outage. File systems are a bigger fish to fry and will take a lot more time. As such, we’re strongly encouraging you to embrace Box if you haven’t already. I’ll spend a bit of time helping folks out with Box at our session next week.
We will be taking the opportunity this outage brings us to make some improvements in the power layout of your servers and switches to make them more resilient to partial power failures. Alas, nothing short of a big ole generator makes them resilient to a total outage like this.
I will send weekly updates to BIO on the progress to minimize the effect this outage has on the division. Next week’s update will be summarized at the previously announced tech session in the auditorium.
Thanks, and please understand – if we had any say in this, it wouldn’t be happening. But we’ll power through it as best we can.
On June 1, ANL Cyber will be implementing new rules on the outgoing web proxy (affects all hosts in BIO, and all hosts on wireless at ANL) that block traffic for outdated/vulnerable machines. This is not a new policy, just updating the rules to reflect currently supported versions.
If CELS Systems is the administrator for your machine, we’ll get you up to date. If you maintain your own machine, and things stop working for you on June 1, this will be why.
NOTE: This also applies to mobile devices like iPads, iPhones, Android devices, etc. Current versions will be required on the Auth wireless network.
The list of supported OS and software versions follows:
Windows: Vista or later, with latest service pack and patches. Internet Explorer 10 or better.
Mac OS X: El Capitan (10.11.4 or 10.11.5), Yosemite (10.10.5), Mavericks (10.9.5, latest version)
Java: 1.8.0_76, 1.8.0_77, 1.8.0_91, 1.8.0_92.
Google Chrome: 49.0 or newer. 49 will be removed when 51 is released (50 is current version).
Firefox: 38.8.0 ESR (expires June 7, though), 45.1.1 ESR, and 46.0.1 or better.
Contact email@example.com if resoft or logout/login doesn’t work.
We’ve received notice regarding upcoming power work for building 240. Time frames have not been set, but this is something that will take out the entire building for a weekend, and many servers and services we provide. The current expectation is a weekend in June will be scheduled for the work. This work is to install necessary power for the installation of Theta in ALCF, the newest supercomputer coming on-site.
As we approach the date (once it’s decided), we’ll have a more definite list of what will and won’t be affected, but it’s safe to say all compute resources, file servers, desktops, and anything else that’s housed in building 240 will go down.
Over the years, we’ve moved a lot of critical CELS resources to building 221 (at least, the ones we’re able to fit into half a rack), so things like mailing lists and websites will generally continue to work. We’ll keep you updated as dates are shored up, though the critical scheduling factors will be driven by the Argonne site, ComEd, and ALCF.
First of all, here’s a reminder about the user survey we’re conducting as announced in the last update. You can find the survey at the URL below, and it’s pretty quick. Should only take a few minutes to fill out. I’ll be closing the survey down at the end of the month.
Next up, we’ve got some staffing changes coming up. On April 4th we’ll be welcoming our newest team member, Brad Fritz. He’ll be joining us as a Systems Administrator, filling the role than John Roberts had previously worked under before his promotion in LCRC. Brad will be joining us from Motorola where he has a long history maintaining various wireless and telecom installations. But his passion’s been doing unix administration, and he’s very happy to be finally getting to do it for a living instead of a hobby. He’s excited to join Argonne, and we’re thrilled to have him!
Lastly, some bittersweet news. Many of you surely know this by now, but Ti Leggett is leaving our team to go on to bigger things. He’s not leaving Argonne, thankfully, as he’s joining ALCF as their new Deputy Project Director & Deputy Director of Operations. Our loss is most assuredly ALCF’s gain. I’m sure you’ll all join me in congratulating Ti and wishing him great success in his new role; I’m confident he’s going to be amazing.
As such, we’ve got a new hole to fill on this team, so if you know of people who might be interested in a leadership role dealing with production computing in CELS, let me know.
Hi, folks. We’ve had a few changes in the group since the new year, and I’m overdue in letting you know about them. Let’s dive right in:
At the end of January, BIO’s IT admin Rocky Patel left Argonne for another opportunity. With that, our group took on supporting the BIO division along with the divisions we’re currently supporting. I sent a separate note earlier this month specifically to the BIO division after this happened, and have since had a meeting with the division, but I thought it would be good to have everyone aware of the situation, since we’re going to be spending some effort figuring out BIO’s IT architecture and incorporating them into what we do.
I’ve added the BIO division to this announcement list, so BIO folks, you’re going to see all the announcements we send to this list. It’s not very high traffic, and not everything announced on it is relevant to your division yet. You can also follow our exploits on our blog at https://mcssys.wordpress.com, and @mcssys on Twitter.
Welcome Kat and Jasan:
Martino left our group at the end of October to join CIS and help run their Service Desk, leaving a void for us. We filled that void (and added a little extra effort to help with BIO) since the break. You’ve probably already encountered our newest team members, but I wanted to introduce them all the same.
Jasan Krupka previously spent a number of years with Apple as a Technician/Genius/Product Specialist. He’s also an active member of the National Guard.
Kat Tylka has spent a number of years providing tech support and help desk services in the area, most recently with Porvisur Technologies in Mokena.
We’re thrilled to have both of them on board, so stop by and say “hi" if you haven’t already. (I also hope to be introducing one more team member in the coming month as we’re in the final stages of hiring a new junior sysadmin.)
BIO in-person coverage:
Now that we’ve got a fully staffed service desk, we’ve gone back to full hours again. And with the addition of the BIO division, we’re adding in-person hours in building 446. We’re trying out this schedule to see how it works, and will adjust accordingly as needs dictate, but for the moment we’ll have someone on the desk over there on Tuesday and Thursday mornings, from 8:30 – 11:30 AM. Either Tina, Kat, or Jasan will be located in A128A-1 in building 446 during those hours. Of course, if something comes up that requires an in-person visit outside those hours we’ll do what we’ve been doing, but this gives us a little more regular coverage in the building. You still get help the same way, by e-mailing firstname.lastname@example.org, or calling extension 6813.
We haven’t done a survey in some time, and I think we’re overdue. At the link below you’ll find a Google Form with a rather free-form questionnaire on our team’s services. There aren’t many questions, and it’s really an opportunity for you to let us know where we should be putting our efforts in the coming year. BIO folks, I know you’ve only had a month or so with us, but your input is helpful in this as well, so please dive in.
You can find the survey at http://goo.gl/forms/RA60ciGfB3, and I’ll post a summary in a couple of months. I’d like to keep it open for the month of March, closing it to answers at the end of the month.
Due to the weather, we’re going to be closing the CELS service desk at lunch today and work remotely for the rest of the day. That means no walk-ups and phone calls will go to Voice Mail. However we’ll be monitoring email@example.com and will be handling any issues we can remotely.
Monday’s power work will not generally affect servers run by us. Systems Administrators of affected systems will notify their users directly, but the affected systems are Mira, Beagle, TRACC, and some portions of Magellan.
The power outage in December took out much of the data center, but left most of the office side of the building up. This is the opposite of that situation – most of the data center (with the exception of two Power Distribution Units) will stay up, but the office side of things will generally lose power.
For those of your in B240 and are supported by us in CELS Systems, a couple of notes:
- Your desktop will almost certainly lose power and reboot. Before you leave on Friday, you should cleanly shut down your machine.
- Upon your return on Monday, even if you did cleanly shut it down, it’s possible it will have come back on its own. However, it may have come back before things were ready for it to do so. As such, please reboot your computer if anything seems out of the ordinary before opening a trouble ticket with us – I promise it’s the first thing we’re going to do once we get a ticket and it might get you up and running quicker.
- After you’ve rebooted your computer, if things are still messed up, please let us know at firstname.lastname@example.org (or email@example.com) and we’ll look into it. Please be patient, we’re short-staffed and will be dealing with a building that just rebooted.
Here’s the official list of affected services:
Note that Single Sign On (SSO) which allows external apps to authenticate against Argonne accounts (like Box, Workday, etc.) is being temporarily migrated offsite so that it will remain up during the network outage. And CIS is confident this outage won’t last more than 15 minutes.