Dispatches From The Geeks

News and Announcements from the MCS Systems Group

Things are mostly back

CIS successfully revived our router, and most MCS hosts and services are back online. We’re still looking over the monitoring to identify what are actual issues vs. transient errors that will self-correct (some servers may not be properly syncing their time via NTP, for example), and will be addressing them as we find them.

If your local machine is misbehaving at this point, your first course of action should be to reboot. If that does not fix the issue, you can contact the help desk (systems@mcs.anl.gov) and we’ll get you taken care of.

I’ll post a more in-depth summary of what happened, but a blow-by-blow overview is available at http://www.mcs.anl.gov/systems/blog (aka http://mcssys.wordpress.com). Thanks for your patience, and we’ll be sure to let you know of any findings we make on how to prevent this sort of outage in the future, or at least minimize the length of the downtime when there’s a catastrophic hardware failure like we had.

Written by Craig Stacey

October 30, 2014 at 2:16 pm

Posted in Uncategorized

MCS Network Outage to continue

The unexpected power issue that caused this outage has caused a failure in a critical component of the MCS router. We’re currently working with CIS on getting this back online, either with a replacement supervisor card, or an ad-hoc temporary solution until a replacement can be installed. We’re currently estimating another couple of hours of downtime.

This outage is affecting all wired hosts on MCS networks, including most servers with a .mcs.anl.gov address.

Hosts on wifi can talk to other Argonne and external hosts. If your machine is wifi capable, that is the recommended workaround for now.

Written by Craig Stacey

October 30, 2014 at 12:05 pm

Posted in Uncategorized

Things are slowly coming back. If you can use wifi instead of wireless, you should be able to connect to net.

Written by Craig Stacey

October 30, 2014 at 11:24 am

Posted in Uncategorized

A UPS failure has taken out various machines and networks. ┬áThe UPS is back online, but w e’re still waiting on other units to recover. ┬áMore info when we have it.

Written by Craig Stacey

October 30, 2014 at 10:57 am

Posted in Uncategorized

Issue with wired desktop network

Late last week, we instituted a change in the network configuration for desktop networks. This particular change came with an unintended side effect that is resulting in some people’s computers not getting an address from the network when plugged in. In the past, if a host was not registered with us and was plugged into the network, it would get a “Visitor Network” address. Due to this configuration, people got accustomed to plugging in any device and having it work (though you would not get trusted access to lab resources such as inside.anl.gov).

For reasons we have yet to determine, this behavior is no longer being followed. We’re working with CIS to get it fixed, but in the interim the workaround is to register any unregistered machine. We should have announced this sooner, but we expected it to be fixed any day now. Alas, “any day now” has been stretching on for far too long.

If you have a computer that stopped being able to use the wired network, please fill out the following form, and we’ll get it registered as quickly as we can.

http://press3.mcs.anl.gov/virtualhelpdesk/request-ip-address/

We have identified a small number of hosts that *are* properly registered, but are not getting an address. We’re also working with CIS on rectifying those issues.

Sorry for the trouble, everyone. We’ll send an update when it’s fixed.

Written by Craig Stacey

October 30, 2014 at 10:13 am

Posted in Uncategorized

Primary Unix NFS fileserver back in operation

Greetings,

At 3:33 PM today the fileserver that serves unix home directories unexpectedly went offline for about one hour.

The system appears stable and all services should be available now, if your workstation is misbehaving you should try rebooting it. If it’s still having issues or if you notice anything else out of order please alert the CELS/MCS help desk.

No data was should have been lost.

I regret and apologize for the inconvenience.

I’m not entirely sure of all the gritty details of what went wrong at this time, but here are a few more details for the curious:

We are in the process of moving the file services to a newer system. This afternoon, as part of preparing for this process I attempted to create a new raid array on 4 new hard drives attached to the system. Unfortunately one of the drives was faulty and something about the creation process cause the system to become unresponsive. This server is a pretty long in the tooth and hadn’t been rebooted in a long time. The boot process was not stable, and we needed to reconfigure some settings in the bios in order for the system to recognize the correct boot environment.

Again, I’m sorry for the interruption in service.

Written by Craig Stacey

October 16, 2014 at 4:58 pm

Posted in Uncategorized

Primary Unix File Server is down. We are working on it. Will send a notice when it’s back.

Written by Craig Stacey

October 16, 2014 at 4:19 pm

Posted in Uncategorized

Unplanned network outage on Oct. 02, 2014

Last night, shortly before 6:00 pm CT one of the network routers in building 240 lost connectivity to another in 221. This caused a breakdown in large portions of the lab’s network fabric, both internally and externally. CIS tracked down and fixed the problem shortly after 8:00 pm CT. The result was that many, if not all, of our services lost connectivity during this time. Most services resumed normally after the issue was fixed, but a handful were left in unknown states until this morning. We believe we have now restored all services, but if you find something that isn’t responding or is responding erratically, please let us know and we’ll get it fixed.

CIS is investigating how such a disruption occurred since the network is designed to mitigate these types of disruptions.

Written by Craig Stacey

October 3, 2014 at 1:31 pm

Posted in Uncategorized

“Shell Shock” bash exploit updates

Many of you are learning of an exploit to the bash shell that was revealed last week, so I thought it would be worthwhile to post a summary of what’s been happening and what you need to do.

First up, the exploit in question allows an attacker to take advantage of some poor coding in the Bourne Again Shell (bash), to launch processes on any servers or services that are exposed to the internet, such as web servers or poorly configured workstations.

We’ve been patching servers we manage since the announcement, and are confident we’re safe from attackers on the servers that we’ve got externally exposed.

Generally, if you’ve got a machine you’re managing you shouldn’t have a big worry unless you’re running a web server on it and allow that web server to run scripts that call a bash shell.

In any case, patching your machine is important. Linux distributions have had patches in the pipeline almost immediately, so if you’re running a current build of linux you should be able to update via your regular package manager (yum, apt, etc.). If you are running an unsupported distribution, you’ll need to download and compile a new bash to be safe. Contact systems@mcs.anl.gov if you require assistance with that.

Apple released some patches for supported OS versions to address some of the vulnerabilities, but there are still some that need addressing so we expect to see more updates. The updates are not yet in the OS X automatic update package stream yet, but for those of you who manage your own machines, you can find the updates below. Also, check http://support.apple.com/downloads/#macos for future updates in the next few days.

If we manage your MacOS machine, we’ll take care of these security updates for you.

Specific updates can be found here:

* Mavericks 10.9: http://support.apple.com/downloads/DL1769/en_US/BashUpdateMavericks.dmg
* Mountain Lion 10.8: http://support.apple.com/downloads/DL1768/en_US/BashUpdateMountainLion.dmg * Lion 10.7: http://support.apple.com/downloads/DL1767/en_US/BashUpdateLion.dmg

Anything older and you’re running an unsupported and unpatched OS. It should be upgraded.

Microsoft Windows users are only affected if they are running Microsoft Unix services or Cygwin. In either case, follow the update procedures for your installation.

Thanks!

Written by Craig Stacey

September 30, 2014 at 5:03 pm

Posted in Uncategorized

Proposed FY2015 Maintenance Weekend Schedule

CIS is proposing the following dates for maintenance weekends. During these weekends, various lab services can be unavailable, including networking and business systems. Occasionally, the CELS Systems Team will schedule our maintenance activities to coincide with these dates, but that is flexible and should not necessarily affect your decision on whether these dates pose an insurmountable problem.

Please look over the proposed dates and let me know if there are reasons the lab should not schedule maintenance activities during those times. If so, please suggest an alternate timeframe for any to which you object.

Thanks!

Proposed Dates:

  • November 7-9, 2014
  • January 16-18, 2015 (APS Maintenance period)
  • May 15-17, 2015 Network Maintenance (APS Maintenance period)
  • August 14-16, 2015

Written by Craig Stacey

September 9, 2014 at 4:30 pm

Posted in Uncategorized

Follow

Get every new post delivered to your Inbox.

Join 42 other followers