First announcement: https://mcssys.wordpress.com/2016/05/26/cels-systems-outage-june-17-19-2016/
Second announcement: https://mcssys.wordpress.com/2016/06/13/cels-systems-outage-june-17-19-2016-2/
For status updates during the outage you can follow our twitter account at https://twitter.com/mcssys
First, updates for BIO (MCS/CELS announcements follow this section):
I’m awaiting word that the switchover is complete, but we should have DHCP (network address assignment) switched over to CIS servers today, which means if a machine reboots during the outage, it will still come up. The BIO authentication server backup is running in the 221 data center and will stay up, so logins will continue to work. Y, Z, and X drives will be down. Linux systems will not be able to login to BIO or ANL accounts, and linux shared file systems will be down.
The canon copiers are now available on the CIS print server (which will stay up during the outage). You can see them here: \\printers.anl.gov. The printers installed are:
q104copier.sbc.anl.gov (iR-ADV 6055)
a102copier.bio.anl.gov (iR-ADV 6055)
If you normally print directly to the copiers, that would work as usual. The Xerox and other printers are not yet on that print server, we’ll get those moved over after the outage and retire “BIOPRINT”.
Kat will be on-site at our desk in 446 to assist tomorrow, and the team will be monitoring the ticket queue.
Updates for MCS/LCF/CELS:
The CELS user-facing linux environment will be down, as previously announced, including login and home file servers. We will issue a system-wide shutdown command to all linux workstations tomorrow morning. If you run a mac or a self-managed machine, please shut down the machine before you leave tonight. Mac file servers will shut down at COB today, as will our tape backup servers.
On Monday, if anything is not working as expected on your computer, please reboot first, then if that doesn’t fix the issue, report it to help or call the help desk at x6813.
The list of user-facing machines that will *not* go down is included below. The ALCF accounts webpages will have a brief downtime tomorrow for maintenance on the server, but will be up by afternoon. Note that repo.anl-external.org is one of the sites that will unfortunately be down during the outage.
The outage window is advertised to be until Sunday, though it’s very likely power will return on Saturday. However, it may take some time to get systems back, so please do not expect anything to be back in operation until we say it is. I will send a notice to this list, which will also post to the previously noted Twitter account and to the blog at https://mcssys.wordpress.com, when we believe things are back in normal operation.
Some systems not part of CELS core IT may not return until Monday – please watch for communications from the teams for those systems. Also note any all-clear I give is purely related to the IT systems – the building will not resume normal business operations until Monday morning.
These servers will not go down as part of the outage:
app001.cels.anl.gov (collab.cels.anl.gov – Confluence)
app003.cels.anl.gov (gitlab.cels.anl.gov – internal gitlab)
app006.cels.anl.gov (dev.esg.anl.gov, dev.esgf.anl.gov)
app007.cels.anl.gov (www.esg.anl.gov, http://www.esgf.anl.gov)
beehive0.mcs.anl.gov (waggle project)
beehive1.mcs.anl.gov (waggle project)
beehive2.mcs.anl.gov (waggle project)
ca.mcs.anl.gov (certificate authority)
caveat.mcs.anl.gov (MPICH sv/trac)
coredb.mcs.anl.gov (internal database server)
davmail.mcs.anl.gov (Exchange/WebDAV connector)
gust.mcs.anl.gov (WordPress sites)
jenkins.mcs.anl.gov (build test server)
kerdap-2.mcs.anl.gov (MCS/CELS kerberos/LDAP server)
kerdap.jlse.anl.gov (JLSE LDAP/kerberos server)
lic001.cels.anl.gov (License server for some software packages)
mon001.cels.anl.gov (monitoring server)
newnewman-1.mcs.anl.gov (email relay)
newnewman-2.mcs.anl.gov (email relay)
nginx.mcs.anl.gov (web proxy front-end for CELS-hosted websites)
owney.mcs.anl.gov (CELS-hosted mailman lists)
rdp.mcs.anl.gov (Windows terminal server)
rt.mcs.anl.gov (Request Tracker ticketing system)
squall.mcs.anl.gov (ALCF websites)
typhoon.mcs.anl.gov (JLSE wiki)
variant.mcs.anl.gov (svn/trac server)
wilbur.mcs.anl.gov (MCS webpages not hosted by CIS)
wind.mcs.anl.gov (wordpress, mediawiki sites)
xgitlab.cels.anl.gov (externally available gitlab server)
yubi-221.mcs.anl.gov (CELS One Time Password server)
(Previous announcement here: https://mcssys.wordpress.com/2016/05/26/cels-systems-outage-june-17-19-2016/)
Here’s an update on the state of things for the outage this Friday. Please read the above for context. Another reminder/announcement will be sent on Thursday.
Aside from what’s noted in the prior announcement, please note we will be turning off the Mac file servers on Thursday evening at close of business. Please make sure you have the files you need moved to Box or your local machine prior to 5PM on June 16.
The morning of June 17 we will issue a building-wide shutdown command to all linux workstations managed by us. If you self-manage your machine, or if you run a Mac or Windows machine, please shut it down before you leave on Thursday to ensure no data loss on the local disk. Remember, you won’t be coming into building 240 at all on Friday, it will be off limits to everyone (including Systems).
The list of machines that will stay up is largely the same as last time (https://mcssys.wordpress.com/2015/05/27/reminder-tcs-power-outage-june-1-2015/) with the addition of gitlab.cels.anl.gov and xgitlab.cels.anl.gov and some back-end services. I’ll post the complete list later this week.
Work to reduce the impact of this outage on your work is progressing well. Our expectation is that login services for bio.anl.gov workstations and DHCP (the service that provides the network address for your workstation) will remain up and running. BIO File servers will, however, be down. The BIO print server will also be down.
If you need to print something on Friday, you can connect to the canons from your web browser (ex: http://a102copier.bio.anl.gov) and print PDFs that way. Save the document you want to print as a PDF, go to the printer in your web browser, and click “End User Mode”, then login (no PIN). Click “Direct Print”, then under “Specify File”, choose “Browse” and select the PDF you want to print. Once the options below look like what you want, click “Start Printing”.
Kat will be over there through most of the day to help with issues, and we’ll be monitoring the support queue at help.
In order to apply some security patches to our gitlab servers, we’re going to have a brief outage on Wednesday afternoon (June 8) from noon to 12:30pm. During this time, the servers won’t be reachable, pushes and pulls will fail, etc. No data loss will occur, and everything will be back in operation within 30 minutes.
If this poses an undue hardship, please let me know and we’ll reschedule. Thanks!
As you may be aware by now, building 240 is undergoing a complete shutdown of power beginning in the morning of Friday, June 17, with an outage window extending into Sunday, June 19. We hope the outage will be shorter than that, but fully expect it will last until the evening of the 18th at the absolute earliest.
This affects all computers in the building 240 data center. Each IT organization is going to be notifying its users of the impact on them, and that’s what I’m writing to you about today. Our served customer base in CELS has grown since the last time we had to endure one of these, so I’m going to break it out a little bit, and I expect I will also have some more detailed mails that will only be targeted at the BIO division after the fact. This message’s goal is to give you a heads up that this is happening, and make sure you plan accordingly.
I’ll make some division-specific announcements below, but everyone can expect affected compute systems to start going down beginning by 6AM on June 17, and shutdowns will be complete by 9AM. Because your network files will not be available, we encourage you to make sure you have files and data you need locally for that day. Getting accustomed to working in Box can help with that. You can find more information at http://inside.anl.gov/services/box.
We’ll send more announcements on this as we get closer to the date. You can also keep up to date via twitter (@mcssys) or WordPress (https://mcssys.wordpress.com).
The affect of this work so closely mirrors the work that happened a year ago that I’m largely going to crib from that announcement. Pardon my unoriginality:
The short answer is that it’s easier to say what won’t be affected. Mail services we provide (forwarding for mcs.anl.gov, alcf.anl.gov, cels.anl.gov, ci.uchicago.edu, etc.) and mailing lists will be unaffected. Most web sites we host (WordPress, Confluence, etc.) will remain up. We’ll notify site owners of any exceptions to this. CIS-provided services (e-mail, web, business systems) are generally unaffected. Externally hosted services (Box, Dayforce, TAMS) are unaffected.
Now for the info you really need — what will be down. All MCS/CELS file and compute servers will be down. This includes SSH logins (login.mcs.anl.gov), unix and Mac home file servers, linux compute servers, all desktops, and all networking in building 240. We had planned to move the subversion server at repo.anl-external.org to the 221 data center, but have not been able to accomplish that in time for this work and thus it, too, will be down.
It’s outside the scope of this announcement, but I’ll also just remind you if it’s in the data center, it’s down. So that means LCRC, Mira and friends, Beagle, Magellan, Chameleon… you get the gist.
I will send an update closer to the outage detailing the exact services that will still be up, just like I did last year. (See https://mcssys.wordpress.com/2015/05/27/reminder-tcs-power-outage-june-1-2015/ for historical reference.)
I’m not sure what Rocky told you during last year’s power work. I can tell you that the net effect will be the same and we’ll take what steps we can to minimize the effect on you. That being said, all services hosted in 240 will go down, and that’s where your entire back-end services live at the moment, so the effect will be felt. We’ll reduce your dependency on the server that handles giving out network addresses to wired computers so that as long as you don’t reboot you’ll stay up on the network. Your file server will go down (bioxshared, Y drive, Z drive), and you may have issues logging in to your computer if you logout or reboot.
Unfortunately, the work to move BIO computers off the BIO child domain and remove those dependencies is starting in June, but won’t be complete in time for this outage (it will take some time). We will, however, get what pieces we can in place prior to the outage. I’m personally carving out a chunk of time next week to see if I can’t at least get the DHCP (IP address assignment) component put away and have that functional during the outage. File systems are a bigger fish to fry and will take a lot more time. As such, we’re strongly encouraging you to embrace Box if you haven’t already. I’ll spend a bit of time helping folks out with Box at our session next week.
We will be taking the opportunity this outage brings us to make some improvements in the power layout of your servers and switches to make them more resilient to partial power failures. Alas, nothing short of a big ole generator makes them resilient to a total outage like this.
I will send weekly updates to BIO on the progress to minimize the effect this outage has on the division. Next week’s update will be summarized at the previously announced tech session in the auditorium.
Thanks, and please understand – if we had any say in this, it wouldn’t be happening. But we’ll power through it as best we can.
On June 1, ANL Cyber will be implementing new rules on the outgoing web proxy (affects all hosts in BIO, and all hosts on wireless at ANL) that block traffic for outdated/vulnerable machines. This is not a new policy, just updating the rules to reflect currently supported versions.
If CELS Systems is the administrator for your machine, we’ll get you up to date. If you maintain your own machine, and things stop working for you on June 1, this will be why.
NOTE: This also applies to mobile devices like iPads, iPhones, Android devices, etc. Current versions will be required on the Auth wireless network.
The list of supported OS and software versions follows:
Windows: Vista or later, with latest service pack and patches. Internet Explorer 10 or better.
Mac OS X: El Capitan (10.11.4 or 10.11.5), Yosemite (10.10.5), Mavericks (10.9.5, latest version)
Java: 1.8.0_76, 1.8.0_77, 1.8.0_91, 1.8.0_92.
Google Chrome: 49.0 or newer. 49 will be removed when 51 is released (50 is current version).
Firefox: 38.8.0 ESR (expires June 7, though), 45.1.1 ESR, and 46.0.1 or better.
Contact email@example.com if resoft or logout/login doesn’t work.
We’ve received notice regarding upcoming power work for building 240. Time frames have not been set, but this is something that will take out the entire building for a weekend, and many servers and services we provide. The current expectation is a weekend in June will be scheduled for the work. This work is to install necessary power for the installation of Theta in ALCF, the newest supercomputer coming on-site.
As we approach the date (once it’s decided), we’ll have a more definite list of what will and won’t be affected, but it’s safe to say all compute resources, file servers, desktops, and anything else that’s housed in building 240 will go down.
Over the years, we’ve moved a lot of critical CELS resources to building 221 (at least, the ones we’re able to fit into half a rack), so things like mailing lists and websites will generally continue to work. We’ll keep you updated as dates are shored up, though the critical scheduling factors will be driven by the Argonne site, ComEd, and ALCF.
First of all, here’s a reminder about the user survey we’re conducting as announced in the last update. You can find the survey at the URL below, and it’s pretty quick. Should only take a few minutes to fill out. I’ll be closing the survey down at the end of the month.
Next up, we’ve got some staffing changes coming up. On April 4th we’ll be welcoming our newest team member, Brad Fritz. He’ll be joining us as a Systems Administrator, filling the role than John Roberts had previously worked under before his promotion in LCRC. Brad will be joining us from Motorola where he has a long history maintaining various wireless and telecom installations. But his passion’s been doing unix administration, and he’s very happy to be finally getting to do it for a living instead of a hobby. He’s excited to join Argonne, and we’re thrilled to have him!
Lastly, some bittersweet news. Many of you surely know this by now, but Ti Leggett is leaving our team to go on to bigger things. He’s not leaving Argonne, thankfully, as he’s joining ALCF as their new Deputy Project Director & Deputy Director of Operations. Our loss is most assuredly ALCF’s gain. I’m sure you’ll all join me in congratulating Ti and wishing him great success in his new role; I’m confident he’s going to be amazing.
As such, we’ve got a new hole to fill on this team, so if you know of people who might be interested in a leadership role dealing with production computing in CELS, let me know.
Hi, folks. We’ve had a few changes in the group since the new year, and I’m overdue in letting you know about them. Let’s dive right in:
At the end of January, BIO’s IT admin Rocky Patel left Argonne for another opportunity. With that, our group took on supporting the BIO division along with the divisions we’re currently supporting. I sent a separate note earlier this month specifically to the BIO division after this happened, and have since had a meeting with the division, but I thought it would be good to have everyone aware of the situation, since we’re going to be spending some effort figuring out BIO’s IT architecture and incorporating them into what we do.
I’ve added the BIO division to this announcement list, so BIO folks, you’re going to see all the announcements we send to this list. It’s not very high traffic, and not everything announced on it is relevant to your division yet. You can also follow our exploits on our blog at https://mcssys.wordpress.com, and @mcssys on Twitter.
Welcome Kat and Jasan:
Martino left our group at the end of October to join CIS and help run their Service Desk, leaving a void for us. We filled that void (and added a little extra effort to help with BIO) since the break. You’ve probably already encountered our newest team members, but I wanted to introduce them all the same.
Jasan Krupka previously spent a number of years with Apple as a Technician/Genius/Product Specialist. He’s also an active member of the National Guard.
Kat Tylka has spent a number of years providing tech support and help desk services in the area, most recently with Porvisur Technologies in Mokena.
We’re thrilled to have both of them on board, so stop by and say “hi" if you haven’t already. (I also hope to be introducing one more team member in the coming month as we’re in the final stages of hiring a new junior sysadmin.)
BIO in-person coverage:
Now that we’ve got a fully staffed service desk, we’ve gone back to full hours again. And with the addition of the BIO division, we’re adding in-person hours in building 446. We’re trying out this schedule to see how it works, and will adjust accordingly as needs dictate, but for the moment we’ll have someone on the desk over there on Tuesday and Thursday mornings, from 8:30 – 11:30 AM. Either Tina, Kat, or Jasan will be located in A128A-1 in building 446 during those hours. Of course, if something comes up that requires an in-person visit outside those hours we’ll do what we’ve been doing, but this gives us a little more regular coverage in the building. You still get help the same way, by e-mailing firstname.lastname@example.org, or calling extension 6813.
We haven’t done a survey in some time, and I think we’re overdue. At the link below you’ll find a Google Form with a rather free-form questionnaire on our team’s services. There aren’t many questions, and it’s really an opportunity for you to let us know where we should be putting our efforts in the coming year. BIO folks, I know you’ve only had a month or so with us, but your input is helpful in this as well, so please dive in.
You can find the survey at http://goo.gl/forms/RA60ciGfB3, and I’ll post a summary in a couple of months. I’d like to keep it open for the month of March, closing it to answers at the end of the month.
Due to the weather, we’re going to be closing the CELS service desk at lunch today and work remotely for the rest of the day. That means no walk-ups and phone calls will go to Voice Mail. However we’ll be monitoring email@example.com and will be handling any issues we can remotely.