Just a reminder the outage announced on Friday is happening at 5PM today. Thanks!
Hi, all. Our confluence server is behind the latest version by a few releases now, so we’d like to take the server down and bring it up to the current version. We’re expecting it won’t take this long, but we’re making the outage window 3 hours, from 5PM until 8PM on Tuesday, August 30. During this time, https://collab.cels.anl.gov will be unavailable. No data will be lost, though after the upgrade you may find some of the options for using it have changed (menu items may have moved, etc.)
If this timing poses a problem for you, please let us know so we can reschedule as necessary.
A hard drive failed in thrash and in rebuilding it we moved it to the new trusty build (Ubuntu 14.04). This is functionally a completely fresh install with no carryover except the hostname and SSH keys, so any crontabs or other local data would not be there. As we move more servers to trusty, we’ll let you know. If you find software packages you need but aren’t installed, let us know and we’ll get them installed across the whole trusty environment.
At 2:28 PM today, the lab suffered a partial site-wide power outage. I do not yet have the exact cause, though it was no doubt related to the incredible storm we were seeing. I happened to be recording the storm outside my window right as the power went off, so you can hear my exasperation in the recording: https://www.dropbox.com/s/jkchw3pppkqkg6k/2016-07-28%2019.28.43.mp4?dl=0.
This took out all infrastructure housed in 240 that was not backed up by UPS. The remaining systems remained up for a short time before the UPS ran out of juice and was unable to sustain the load any longer. At this point, the remaining systems failed.
CELS Systems maintains a presence in the 221 data center to house critical servers and services to be able to ride out outages of this nature. An overzealous team member, in an attempt to cleanly shut down the systems in 240 that were going to run out of UPS, inadvertently also shut down the systems in 221.
So, while a team of us ran around 240 trying to ascertain the situation, another team hoofed it over to 221 to bring affected systems back online.
At this point, we were dark on all fronts – supercomputers, HPC clusters, core computing, base IT. Through some magic of UPSes and fairy dust, the wifi and phones in some parts of the building remained operational through this outage (until the batteries in those UPSes also went down).
At 4:00, we were informed that the rest of the site had power back, but building 240 still was not powering up, and that the building’s electrical contractor was en route to determine why. Traffic and weather conspired to delay their arrival until well after 5PM.
During this time, the tenants of the 240 datacenter discussed our options. We’ve had some recent experience with power outages and what it takes to successfully come back from a planned outage (and, sadly, some experience with the unplanned outages as well). We determined that it would be counterproductive to try to bring the systems back online tonight, as we had no estimate of return to operation for power. Once power was restored, the dependency chain of having enough heat load in the room to run the chillers to cool the room would kick in.
Because of the amount of time and effort to get ALCF, LCRC, Magellan, and Beagle back into operation, and the fact there was no true estimation of when power would return nor if the chillers would be able to return to operation, we made the decision to postpone powering up the room until tomorrow morning, beginning at 6AM. Without the supercomputers and HPC clusters running there is not enough heat being generated in the room to run the chillers, which results in them shutting down, which results in the few systems that are running to overheat.
At 6AM tomorrow the building operations staff and the various sysadmins will converge on the data center to bring back the equipment housed in there.
We’re still working to fix the niggling items that have not come back from building 221. I’ll send updates as I have them.
I’m hopinh we’ll be back soon, but I still have no info on root cause, actual scope of outage, or an estimate on time to return to operation.
More details as we have them.
See the attached note from Cyber. Short story is if you’re not current on MacOS, you’ll start getting blocked by the proxy if you’re on Argonne Auth wifi.
Systems is taking care of machines we manage.
It appears that Apple has fallen victim to the a vuln very similar to last year’s StageFright found on Android systems last year.
Using almost any/every method (MMS, iMessage, Mail, web browsing, … to get a trojan TIFF image to the device, a buffer overflow can be exploited to run anything on the system the malware wants to do.
This was patched in last weeks patch set from Apple. This needs to be applied to any iPhone, Mac, AppleTV, and even Apple Watch.
You’ll need to install one of these on your device.
El Capitan 10.11.6
latest patch set for 10.10.5 Yosemite.
Starting Tuesday, July 26, we will be updating the web filter block list to include MacOS 10.11.5 to the outdated software list. You should already have this applied.
So people are aware, Apple still supports 10.10.5, but patch detection on that is not as easy to see if it is up to date or the original release. Please make sure these machines are up to date. Anything below 10.10.4 is going to be blocked. This will include any of the 10.9 and 10.8 releases. Those upgrades should have been completed some time ago.
On the web browser front,
Chrome has updated to version 51 for everything and headed to version 52. We will be blocking anything indicating version 49 and below.
Firefox is at version 47 for Stable release and 45.2 for Extended support release. Anything below those releases will also be blocked.
Patching should be routine, so this shouldn’t impact many systems.
v8 is now rebuilt. Feel free to test it out and let us know what pieces you think are missing from the new Trusty linux environment.
We’ve got a two phased linux compute environment upgrade we’re in the process of doing this year. Phase one is rolling out an Ubuntu 14.04 build (aka Trusty), and phase two will be rolling out a CentOS 7 build. Phase two won’t be starting until end of summer, but phase one is already under way, with some destops and servers serving as early tests.
The server v8.mcs.anl.gov is running a very old Ubuntu Build (10.04, aka Lucid), so we’re going to target that as the first of the compute servers to get the upgrade.
We’ll be taking the machine down on Monday morning, and reinstalling it as Trusty, at which point we’ll announce when it’s back up and encourage you to use it and let us know what’s missing from it. This will allow us to fine tune the Trusty environment and make sure all the machines running at that level are best suited to your needs.
What you need to do: If you’ve got data in the /sandbox on v8, make sure you’ve got it copied elsewhere, since that will be erased. If you’ve got cron jobs that run on v8, make a copy of your crontabs so you can replicate them on the new v8.
After we’re done getting that machine in a good state for your needs, we’ll announce the schedule of the updates to the rest of the compute servers. At that time we’ll also ask for more volunteers to have their desktops updated.
If this outage poses a significant problem for you, please let us know so we can reschedule it.
Hi, all! I’ve got a couple of quick announcements to knock out here, so let’s get to it.
Jasan left our group a few weeks ago for a fantastic opportunity, and we’ve just now been able to bring on his replacement. Please welcome Steve Verdone to the group! He’ll be manning the Help Desk on and off and handing user support issues with a focus on the Mac side of the world. Steve joins us from his prior gig at Apple as an Apple Expert. He’s getting up to speed on how we do things here, but I’m sure you’ll like him as much as we do – he’s a smart, friendly guy. Stop by and say hi.
Starting in July, we’re also going to be getting some additional help in getting the BIO IT environment inventoried, updated, and migrated into the ANL and CELS infrastructures. Jeff Hinthorn will be lending us some of his time over the next few months, splitting his duties between us and HEP. Jeff’s a seasoned Windows admin and his time will almost entirely be focused on BIO through the end of this fiscal year. We’re still finalizing the schedule, but he’ll be spending his time in building 446 at our IT desk over there daily. I’ll send a separate note to BIO when the schedule is finalized.
240 conference rooms:
Based on some feedback I’ve been getting, I want to send a reminder out about booking the conference and meeting rooms in the main 240 building. Instructions on doing so can be found at http://tcs.anl.gov/for-tcs-tenants/meeting-and-conference-rooms-in-tcs/ and it’s worth paying special attention to the highlighted section:
When you reserve a room, pay attention to the email response you receive back. Unless you get an e-mail indicating the room has “accepted” the meeting, you have not reserved the room. If in doubt, check the “Web View” links on the calendars below, which update every 5 minutes.
We’ve had people who think they’ve reserved the room, but either ignored the response from the room saying the room was already booked, or didn’t actually include the room as part of the invite, so it was never booked to begin with.
I’m investigating options for putting live schedule views at the entrances to the rooms to help avoid these issues in the future. More on that as it develops.