We are working to bring it back up. We’re hopeful it will be back online within the hour, I will send updates if that changes.
CIS successfully revived our router, and most MCS hosts and services are back online. We’re still looking over the monitoring to identify what are actual issues vs. transient errors that will self-correct (some servers may not be properly syncing their time via NTP, for example), and will be addressing them as we find them.
If your local machine is misbehaving at this point, your first course of action should be to reboot. If that does not fix the issue, you can contact the help desk (firstname.lastname@example.org) and we’ll get you taken care of.
I’ll post a more in-depth summary of what happened, but a blow-by-blow overview is available at http://www.mcs.anl.gov/systems/blog (aka http://mcssys.wordpress.com). Thanks for your patience, and we’ll be sure to let you know of any findings we make on how to prevent this sort of outage in the future, or at least minimize the length of the downtime when there’s a catastrophic hardware failure like we had.
The unexpected power issue that caused this outage has caused a failure in a critical component of the MCS router. We’re currently working with CIS on getting this back online, either with a replacement supervisor card, or an ad-hoc temporary solution until a replacement can be installed. We’re currently estimating another couple of hours of downtime.
This outage is affecting all wired hosts on MCS networks, including most servers with a .mcs.anl.gov address.
Hosts on wifi can talk to other Argonne and external hosts. If your machine is wifi capable, that is the recommended workaround for now.
Things are slowly coming back. If you can use wifi instead of wireless, you should be able to connect to net.
A UPS failure has taken out various machines and networks. The UPS is back online, but w e’re still waiting on other units to recover. More info when we have it.
Late last week, we instituted a change in the network configuration for desktop networks. This particular change came with an unintended side effect that is resulting in some people’s computers not getting an address from the network when plugged in. In the past, if a host was not registered with us and was plugged into the network, it would get a “Visitor Network” address. Due to this configuration, people got accustomed to plugging in any device and having it work (though you would not get trusted access to lab resources such as inside.anl.gov).
For reasons we have yet to determine, this behavior is no longer being followed. We’re working with CIS to get it fixed, but in the interim the workaround is to register any unregistered machine. We should have announced this sooner, but we expected it to be fixed any day now. Alas, “any day now” has been stretching on for far too long.
If you have a computer that stopped being able to use the wired network, please fill out the following form, and we’ll get it registered as quickly as we can.
We have identified a small number of hosts that *are* properly registered, but are not getting an address. We’re also working with CIS on rectifying those issues.
Sorry for the trouble, everyone. We’ll send an update when it’s fixed.
At 3:33 PM today the fileserver that serves unix home directories unexpectedly went offline for about one hour.
The system appears stable and all services should be available now, if your workstation is misbehaving you should try rebooting it. If it’s still having issues or if you notice anything else out of order please alert the CELS/MCS help desk.
No data was should have been lost.
I regret and apologize for the inconvenience.
I’m not entirely sure of all the gritty details of what went wrong at this time, but here are a few more details for the curious:
We are in the process of moving the file services to a newer system. This afternoon, as part of preparing for this process I attempted to create a new raid array on 4 new hard drives attached to the system. Unfortunately one of the drives was faulty and something about the creation process cause the system to become unresponsive. This server is a pretty long in the tooth and hadn’t been rebooted in a long time. The boot process was not stable, and we needed to reconfigure some settings in the bios in order for the system to recognize the correct boot environment.
Again, I’m sorry for the interruption in service.
Last night, shortly before 6:00 pm CT one of the network routers in building 240 lost connectivity to another in 221. This caused a breakdown in large portions of the lab’s network fabric, both internally and externally. CIS tracked down and fixed the problem shortly after 8:00 pm CT. The result was that many, if not all, of our services lost connectivity during this time. Most services resumed normally after the issue was fixed, but a handful were left in unknown states until this morning. We believe we have now restored all services, but if you find something that isn’t responding or is responding erratically, please let us know and we’ll get it fixed.
CIS is investigating how such a disruption occurred since the network is designed to mitigate these types of disruptions.