At 2:28 PM today, the lab suffered a partial site-wide power outage. I do not yet have the exact cause, though it was no doubt related to the incredible storm we were seeing. I happened to be recording the storm outside my window right as the power went off, so you can hear my exasperation in the recording: https://www.dropbox.com/s/jkchw3pppkqkg6k/2016-07-28%2019.28.43.mp4?dl=0.
This took out all infrastructure housed in 240 that was not backed up by UPS. The remaining systems remained up for a short time before the UPS ran out of juice and was unable to sustain the load any longer. At this point, the remaining systems failed.
CELS Systems maintains a presence in the 221 data center to house critical servers and services to be able to ride out outages of this nature. An overzealous team member, in an attempt to cleanly shut down the systems in 240 that were going to run out of UPS, inadvertently also shut down the systems in 221.
So, while a team of us ran around 240 trying to ascertain the situation, another team hoofed it over to 221 to bring affected systems back online.
At this point, we were dark on all fronts – supercomputers, HPC clusters, core computing, base IT. Through some magic of UPSes and fairy dust, the wifi and phones in some parts of the building remained operational through this outage (until the batteries in those UPSes also went down).
At 4:00, we were informed that the rest of the site had power back, but building 240 still was not powering up, and that the building’s electrical contractor was en route to determine why. Traffic and weather conspired to delay their arrival until well after 5PM.
During this time, the tenants of the 240 datacenter discussed our options. We’ve had some recent experience with power outages and what it takes to successfully come back from a planned outage (and, sadly, some experience with the unplanned outages as well). We determined that it would be counterproductive to try to bring the systems back online tonight, as we had no estimate of return to operation for power. Once power was restored, the dependency chain of having enough heat load in the room to run the chillers to cool the room would kick in.
Because of the amount of time and effort to get ALCF, LCRC, Magellan, and Beagle back into operation, and the fact there was no true estimation of when power would return nor if the chillers would be able to return to operation, we made the decision to postpone powering up the room until tomorrow morning, beginning at 6AM. Without the supercomputers and HPC clusters running there is not enough heat being generated in the room to run the chillers, which results in them shutting down, which results in the few systems that are running to overheat.
At 6AM tomorrow the building operations staff and the various sysadmins will converge on the data center to bring back the equipment housed in there.
We’re still working to fix the niggling items that have not come back from building 221. I’ll send updates as I have them.
I’m hopinh we’ll be back soon, but I still have no info on root cause, actual scope of outage, or an estimate on time to return to operation.
More details as we have them.
See the attached note from Cyber. Short story is if you’re not current on MacOS, you’ll start getting blocked by the proxy if you’re on Argonne Auth wifi.
Systems is taking care of machines we manage.
It appears that Apple has fallen victim to the a vuln very similar to last year’s StageFright found on Android systems last year.
Using almost any/every method (MMS, iMessage, Mail, web browsing, … to get a trojan TIFF image to the device, a buffer overflow can be exploited to run anything on the system the malware wants to do.
This was patched in last weeks patch set from Apple. This needs to be applied to any iPhone, Mac, AppleTV, and even Apple Watch.
You’ll need to install one of these on your device.
El Capitan 10.11.6
latest patch set for 10.10.5 Yosemite.
Starting Tuesday, July 26, we will be updating the web filter block list to include MacOS 10.11.5 to the outdated software list. You should already have this applied.
So people are aware, Apple still supports 10.10.5, but patch detection on that is not as easy to see if it is up to date or the original release. Please make sure these machines are up to date. Anything below 10.10.4 is going to be blocked. This will include any of the 10.9 and 10.8 releases. Those upgrades should have been completed some time ago.
On the web browser front,
Chrome has updated to version 51 for everything and headed to version 52. We will be blocking anything indicating version 49 and below.
Firefox is at version 47 for Stable release and 45.2 for Extended support release. Anything below those releases will also be blocked.
Patching should be routine, so this shouldn’t impact many systems.
v8 is now rebuilt. Feel free to test it out and let us know what pieces you think are missing from the new Trusty linux environment.
We’ve got a two phased linux compute environment upgrade we’re in the process of doing this year. Phase one is rolling out an Ubuntu 14.04 build (aka Trusty), and phase two will be rolling out a CentOS 7 build. Phase two won’t be starting until end of summer, but phase one is already under way, with some destops and servers serving as early tests.
The server v8.mcs.anl.gov is running a very old Ubuntu Build (10.04, aka Lucid), so we’re going to target that as the first of the compute servers to get the upgrade.
We’ll be taking the machine down on Monday morning, and reinstalling it as Trusty, at which point we’ll announce when it’s back up and encourage you to use it and let us know what’s missing from it. This will allow us to fine tune the Trusty environment and make sure all the machines running at that level are best suited to your needs.
What you need to do: If you’ve got data in the /sandbox on v8, make sure you’ve got it copied elsewhere, since that will be erased. If you’ve got cron jobs that run on v8, make a copy of your crontabs so you can replicate them on the new v8.
After we’re done getting that machine in a good state for your needs, we’ll announce the schedule of the updates to the rest of the compute servers. At that time we’ll also ask for more volunteers to have their desktops updated.
If this outage poses a significant problem for you, please let us know so we can reschedule it.
Hi, all! I’ve got a couple of quick announcements to knock out here, so let’s get to it.
Jasan left our group a few weeks ago for a fantastic opportunity, and we’ve just now been able to bring on his replacement. Please welcome Steve Verdone to the group! He’ll be manning the Help Desk on and off and handing user support issues with a focus on the Mac side of the world. Steve joins us from his prior gig at Apple as an Apple Expert. He’s getting up to speed on how we do things here, but I’m sure you’ll like him as much as we do – he’s a smart, friendly guy. Stop by and say hi.
Starting in July, we’re also going to be getting some additional help in getting the BIO IT environment inventoried, updated, and migrated into the ANL and CELS infrastructures. Jeff Hinthorn will be lending us some of his time over the next few months, splitting his duties between us and HEP. Jeff’s a seasoned Windows admin and his time will almost entirely be focused on BIO through the end of this fiscal year. We’re still finalizing the schedule, but he’ll be spending his time in building 446 at our IT desk over there daily. I’ll send a separate note to BIO when the schedule is finalized.
240 conference rooms:
Based on some feedback I’ve been getting, I want to send a reminder out about booking the conference and meeting rooms in the main 240 building. Instructions on doing so can be found at http://tcs.anl.gov/for-tcs-tenants/meeting-and-conference-rooms-in-tcs/ and it’s worth paying special attention to the highlighted section:
When you reserve a room, pay attention to the email response you receive back. Unless you get an e-mail indicating the room has “accepted” the meeting, you have not reserved the room. If in doubt, check the “Web View” links on the calendars below, which update every 5 minutes.
We’ve had people who think they’ve reserved the room, but either ignored the response from the room saying the room was already booked, or didn’t actually include the room as part of the invite, so it was never booked to begin with.
I’m investigating options for putting live schedule views at the entrances to the rooms to help avoid these issues in the future. More on that as it develops.
The core IT services provided by CELS Systems are back online. This includes the BIO infrastructure, the MCS general computing infrastructure, and any virtual machines CELS Systems hosts. Other larger systems (such as the compute clusters) will come back online as previously announced by their admin teams.
We had a few issues during the outage, which I’ll do a full post mortem on later in the week.
When you return to your desk on Monday, if your desktop isn’t working as you’d expect, first try rebooting it. If it still isn’t working, let us know at help or at x6813, or in person at the Help Desk.
Thanks, all. Enjoy the rest of your weekend, I know I will.
First announcement: https://mcssys.wordpress.com/2016/05/26/cels-systems-outage-june-17-19-2016/
Second announcement: https://mcssys.wordpress.com/2016/06/13/cels-systems-outage-june-17-19-2016-2/
For status updates during the outage you can follow our twitter account at https://twitter.com/mcssys
First, updates for BIO (MCS/CELS announcements follow this section):
I’m awaiting word that the switchover is complete, but we should have DHCP (network address assignment) switched over to CIS servers today, which means if a machine reboots during the outage, it will still come up. The BIO authentication server backup is running in the 221 data center and will stay up, so logins will continue to work. Y, Z, and X drives will be down. Linux systems will not be able to login to BIO or ANL accounts, and linux shared file systems will be down.
The canon copiers are now available on the CIS print server (which will stay up during the outage). You can see them here: \\printers.anl.gov. The printers installed are:
q104copier.sbc.anl.gov (iR-ADV 6055)
a102copier.bio.anl.gov (iR-ADV 6055)
If you normally print directly to the copiers, that would work as usual. The Xerox and other printers are not yet on that print server, we’ll get those moved over after the outage and retire “BIOPRINT”.
Kat will be on-site at our desk in 446 to assist tomorrow, and the team will be monitoring the ticket queue.
Updates for MCS/LCF/CELS:
The CELS user-facing linux environment will be down, as previously announced, including login and home file servers. We will issue a system-wide shutdown command to all linux workstations tomorrow morning. If you run a mac or a self-managed machine, please shut down the machine before you leave tonight. Mac file servers will shut down at COB today, as will our tape backup servers.
On Monday, if anything is not working as expected on your computer, please reboot first, then if that doesn’t fix the issue, report it to help or call the help desk at x6813.
The list of user-facing machines that will *not* go down is included below. The ALCF accounts webpages will have a brief downtime tomorrow for maintenance on the server, but will be up by afternoon. Note that repo.anl-external.org is one of the sites that will unfortunately be down during the outage.
The outage window is advertised to be until Sunday, though it’s very likely power will return on Saturday. However, it may take some time to get systems back, so please do not expect anything to be back in operation until we say it is. I will send a notice to this list, which will also post to the previously noted Twitter account and to the blog at https://mcssys.wordpress.com, when we believe things are back in normal operation.
Some systems not part of CELS core IT may not return until Monday – please watch for communications from the teams for those systems. Also note any all-clear I give is purely related to the IT systems – the building will not resume normal business operations until Monday morning.
These servers will not go down as part of the outage:
app001.cels.anl.gov (collab.cels.anl.gov – Confluence)
app003.cels.anl.gov (gitlab.cels.anl.gov – internal gitlab)
app006.cels.anl.gov (dev.esg.anl.gov, dev.esgf.anl.gov)
app007.cels.anl.gov (www.esg.anl.gov, http://www.esgf.anl.gov)
beehive0.mcs.anl.gov (waggle project)
beehive1.mcs.anl.gov (waggle project)
beehive2.mcs.anl.gov (waggle project)
ca.mcs.anl.gov (certificate authority)
caveat.mcs.anl.gov (MPICH sv/trac)
coredb.mcs.anl.gov (internal database server)
davmail.mcs.anl.gov (Exchange/WebDAV connector)
gust.mcs.anl.gov (WordPress sites)
jenkins.mcs.anl.gov (build test server)
kerdap-2.mcs.anl.gov (MCS/CELS kerberos/LDAP server)
kerdap.jlse.anl.gov (JLSE LDAP/kerberos server)
lic001.cels.anl.gov (License server for some software packages)
mon001.cels.anl.gov (monitoring server)
newnewman-1.mcs.anl.gov (email relay)
newnewman-2.mcs.anl.gov (email relay)
nginx.mcs.anl.gov (web proxy front-end for CELS-hosted websites)
owney.mcs.anl.gov (CELS-hosted mailman lists)
rdp.mcs.anl.gov (Windows terminal server)
rt.mcs.anl.gov (Request Tracker ticketing system)
squall.mcs.anl.gov (ALCF websites)
typhoon.mcs.anl.gov (JLSE wiki)
variant.mcs.anl.gov (svn/trac server)
wilbur.mcs.anl.gov (MCS webpages not hosted by CIS)
wind.mcs.anl.gov (wordpress, mediawiki sites)
xgitlab.cels.anl.gov (externally available gitlab server)
yubi-221.mcs.anl.gov (CELS One Time Password server)
(Previous announcement here: https://mcssys.wordpress.com/2016/05/26/cels-systems-outage-june-17-19-2016/)
Here’s an update on the state of things for the outage this Friday. Please read the above for context. Another reminder/announcement will be sent on Thursday.
Aside from what’s noted in the prior announcement, please note we will be turning off the Mac file servers on Thursday evening at close of business. Please make sure you have the files you need moved to Box or your local machine prior to 5PM on June 16.
The morning of June 17 we will issue a building-wide shutdown command to all linux workstations managed by us. If you self-manage your machine, or if you run a Mac or Windows machine, please shut it down before you leave on Thursday to ensure no data loss on the local disk. Remember, you won’t be coming into building 240 at all on Friday, it will be off limits to everyone (including Systems).
The list of machines that will stay up is largely the same as last time (https://mcssys.wordpress.com/2015/05/27/reminder-tcs-power-outage-june-1-2015/) with the addition of gitlab.cels.anl.gov and xgitlab.cels.anl.gov and some back-end services. I’ll post the complete list later this week.
Work to reduce the impact of this outage on your work is progressing well. Our expectation is that login services for bio.anl.gov workstations and DHCP (the service that provides the network address for your workstation) will remain up and running. BIO File servers will, however, be down. The BIO print server will also be down.
If you need to print something on Friday, you can connect to the canons from your web browser (ex: http://a102copier.bio.anl.gov) and print PDFs that way. Save the document you want to print as a PDF, go to the printer in your web browser, and click “End User Mode”, then login (no PIN). Click “Direct Print”, then under “Specify File”, choose “Browse” and select the PDF you want to print. Once the options below look like what you want, click “Start Printing”.
Kat will be over there through most of the day to help with issues, and we’ll be monitoring the support queue at help.