CIS successfully revived our router, and most MCS hosts and services are back online. We’re still looking over the monitoring to identify what are actual issues vs. transient errors that will self-correct (some servers may not be properly syncing their time via NTP, for example), and will be addressing them as we find them.
If your local machine is misbehaving at this point, your first course of action should be to reboot. If that does not fix the issue, you can contact the help desk (email@example.com) and we’ll get you taken care of.
I’ll post a more in-depth summary of what happened, but a blow-by-blow overview is available at http://www.mcs.anl.gov/systems/blog (aka http://mcssys.wordpress.com). Thanks for your patience, and we’ll be sure to let you know of any findings we make on how to prevent this sort of outage in the future, or at least minimize the length of the downtime when there’s a catastrophic hardware failure like we had.
The unexpected power issue that caused this outage has caused a failure in a critical component of the MCS router. We’re currently working with CIS on getting this back online, either with a replacement supervisor card, or an ad-hoc temporary solution until a replacement can be installed. We’re currently estimating another couple of hours of downtime.
This outage is affecting all wired hosts on MCS networks, including most servers with a .mcs.anl.gov address.
Hosts on wifi can talk to other Argonne and external hosts. If your machine is wifi capable, that is the recommended workaround for now.
Things are slowly coming back. If you can use wifi instead of wireless, you should be able to connect to net.
A UPS failure has taken out various machines and networks. The UPS is back online, but w e’re still waiting on other units to recover. More info when we have it.
Late last week, we instituted a change in the network configuration for desktop networks. This particular change came with an unintended side effect that is resulting in some people’s computers not getting an address from the network when plugged in. In the past, if a host was not registered with us and was plugged into the network, it would get a “Visitor Network” address. Due to this configuration, people got accustomed to plugging in any device and having it work (though you would not get trusted access to lab resources such as inside.anl.gov).
For reasons we have yet to determine, this behavior is no longer being followed. We’re working with CIS to get it fixed, but in the interim the workaround is to register any unregistered machine. We should have announced this sooner, but we expected it to be fixed any day now. Alas, “any day now” has been stretching on for far too long.
If you have a computer that stopped being able to use the wired network, please fill out the following form, and we’ll get it registered as quickly as we can.
We have identified a small number of hosts that *are* properly registered, but are not getting an address. We’re also working with CIS on rectifying those issues.
Sorry for the trouble, everyone. We’ll send an update when it’s fixed.
At 3:33 PM today the fileserver that serves unix home directories unexpectedly went offline for about one hour.
The system appears stable and all services should be available now, if your workstation is misbehaving you should try rebooting it. If it’s still having issues or if you notice anything else out of order please alert the CELS/MCS help desk.
No data was should have been lost.
I regret and apologize for the inconvenience.
I’m not entirely sure of all the gritty details of what went wrong at this time, but here are a few more details for the curious:
We are in the process of moving the file services to a newer system. This afternoon, as part of preparing for this process I attempted to create a new raid array on 4 new hard drives attached to the system. Unfortunately one of the drives was faulty and something about the creation process cause the system to become unresponsive. This server is a pretty long in the tooth and hadn’t been rebooted in a long time. The boot process was not stable, and we needed to reconfigure some settings in the bios in order for the system to recognize the correct boot environment.
Again, I’m sorry for the interruption in service.
Last night, shortly before 6:00 pm CT one of the network routers in building 240 lost connectivity to another in 221. This caused a breakdown in large portions of the lab’s network fabric, both internally and externally. CIS tracked down and fixed the problem shortly after 8:00 pm CT. The result was that many, if not all, of our services lost connectivity during this time. Most services resumed normally after the issue was fixed, but a handful were left in unknown states until this morning. We believe we have now restored all services, but if you find something that isn’t responding or is responding erratically, please let us know and we’ll get it fixed.
CIS is investigating how such a disruption occurred since the network is designed to mitigate these types of disruptions.
Many of you are learning of an exploit to the bash shell that was revealed last week, so I thought it would be worthwhile to post a summary of what’s been happening and what you need to do.
First up, the exploit in question allows an attacker to take advantage of some poor coding in the Bourne Again Shell (bash), to launch processes on any servers or services that are exposed to the internet, such as web servers or poorly configured workstations.
We’ve been patching servers we manage since the announcement, and are confident we’re safe from attackers on the servers that we’ve got externally exposed.
Generally, if you’ve got a machine you’re managing you shouldn’t have a big worry unless you’re running a web server on it and allow that web server to run scripts that call a bash shell.
In any case, patching your machine is important. Linux distributions have had patches in the pipeline almost immediately, so if you’re running a current build of linux you should be able to update via your regular package manager (yum, apt, etc.). If you are running an unsupported distribution, you’ll need to download and compile a new bash to be safe. Contact firstname.lastname@example.org if you require assistance with that.
Apple released some patches for supported OS versions to address some of the vulnerabilities, but there are still some that need addressing so we expect to see more updates. The updates are not yet in the OS X automatic update package stream yet, but for those of you who manage your own machines, you can find the updates below. Also, check http://support.apple.com/downloads/#macos for future updates in the next few days.
If we manage your MacOS machine, we’ll take care of these security updates for you.
Specific updates can be found here:
* Mavericks 10.9: http://support.apple.com/downloads/DL1769/en_US/BashUpdateMavericks.dmg
* Mountain Lion 10.8: http://support.apple.com/downloads/DL1768/en_US/BashUpdateMountainLion.dmg * Lion 10.7: http://support.apple.com/downloads/DL1767/en_US/BashUpdateLion.dmg
Anything older and you’re running an unsupported and unpatched OS. It should be upgraded.
Microsoft Windows users are only affected if they are running Microsoft Unix services or Cygwin. In either case, follow the update procedures for your installation.
CIS is proposing the following dates for maintenance weekends. During these weekends, various lab services can be unavailable, including networking and business systems. Occasionally, the CELS Systems Team will schedule our maintenance activities to coincide with these dates, but that is flexible and should not necessarily affect your decision on whether these dates pose an insurmountable problem.
Please look over the proposed dates and let me know if there are reasons the lab should not schedule maintenance activities during those times. If so, please suggest an alternate timeframe for any to which you object.
- November 7-9, 2014
- January 16-18, 2015 (APS Maintenance period)
- May 15-17, 2015 Network Maintenance (APS Maintenance period)
- August 14-16, 2015