First of all, the actual cause of the problem was a bad network link. It was difficult to find, and we were just about to begin migrating the services out of the 221 data center into 240 when it was finally discovered.
Here’s the link to Networking’s summary of what happened: https://slack-files.com/T025KBW9U-F025WFY2W-0c97c9
We discovered things were starting to throw alerts and misbehave around 3:25PM on Tuesday. There were a few services that were being slow or unresponsive, and we eventually narrowed down the culprit to the database server for infrastructure services. This db server is set up in a very resilient and highly available environment, and in the 221 data center (which has generator-backed UPS, unlike 240). We started debugging this, and it became evident that the system became unresponsive whenever network hosts tried to contact it. (It became so unresponsive, in fact, that it wouldn’t even respond locally as long as it was accepting network connections.)
The redundancy and high availability wasn’t helping us, since the entire environment over in the data center seemed to be affected. We couldn’t migrate the service to one of the other hypervisors, since they were all behaving the same way.
We called in the networking team after a couple of hours when it started to really smell like a network problem, and we all started working on it together. Unfortunately, as outlined in the PDF above, the actual problem was not presenting itself in any useful manner. By all indications, the network was working, and was not signaling errors indicating any problem.
After debugging into the night, rebooting various servers and hypervisors, we called it a night with the plan to start attacking it early in the morning, including moving the databases from 221 to 240. Just as we were wrapping up the data dump, Networking discovered the bad optics, and things sprang to life. Many services came back right away, for others we had the fallout of repercussions from our troubleshooting steps and reboots amidst bad networking. Most things came back okay, we’ve got one redundant RADIUS server that needs rebuilding.
Affected hosts were: most websites we host, many login machines (which mount NFS from said websites), Trouble tickets (rt.mcs.anl.gov), account management, rdp.mcs.anl.gov (Windows Terminal Server), and any other service that relies on coredb.mcs.anl.gov.
Thank you all for your patience during the outage, sorry it took so long to diagnose the actual problem. If you have any questions, by all means let me know.
More details will come later today once we’ve hammered out what happened, but everything is back (or should be back shortly). Thanks for your patience, sorry for the troubles, and stand by for details.
Your humble team of tech monkeys
Still struggling to get things back up. We’re moving the database services over to a different datacenter in the hopes it will be able to start providing service there. There’s a lot of data to move. I’ll provide updates direct to twitter from this point on, and reserve e-mail/blog posts for more detailed info. If you’re not already following, our twitter is @mcssys (https://twitter.com/mcssys).
We’re still battling an issue we have yet to diagnose. At this point, we’re not expecting things to come back tonight. We’re working to move the services off the affected infrastructure to bring the critical services back. We’re getting outage pages into place for affected websites.
Once we’re through this, I’ll provide a detailed explanation of how things are set up and what went wrong for those who are interested in that. At the moment, we’re still struggling to figure out what exactly has gone wrong, since there’s no indication as to what’s causing the service failures.
Our primary mysql database server is not allowing connections (or only sporadically allowing them), which is causing a number of services to fail including many websites. We’re working feverishly on getting it working again, but we have no ETA at this point since we’re still not able to get any real diagnostic info after multiple service and machine restarts.
More info as it comes available.
If you’re running MacOS 10.9 (Mavericks), please run software update to get the latest version (10.9.2). Aside from some bugfixes (including improved Exchange interopability), this contains a fix for a serious security issue with SSL certificate validation.
This update was just released, and should be available via the App Store/Software Update. It can also be found at Apple’s support downloads page here: http://support.apple.com/downloads/
The instructions for reserving equipment at https://wiki.mcs.anl.gov/IT/index.php/Mail#Reserving_Equipment has been updated to reflect the new addresses and information. Thanks!
This work is happening at noon today. Thanks.
As you may have heard, the lab is switching to a new system for time tracking. More announcements on that will be coming from others as we get closer to the switchover date, including training information.
In order to use the new app, there’s a piece of software (Citrix Receiver) that you need to have installed. We’ve already installed it on all the Macs we manage. If you self-manage your machine, however, it’s easy to install. The simplest way is to visit https://appgateway.anl.gov with your browser. It will detect if you have the app and offer to install it if necessary. Right now, this URL is only accessible from inside Argonne — I’m looking into whether or not this is intentional as I was led to believe this was an “access anywhere” solution.
The software is available for Mac, Windows, Linux, Android, and iOS. (To install for Android or iOS, use the Google Play store or Apple App store and search for “Citrix Receiver”).
Documentation related to Dayforce can be found on Inside Argonne at http://inside.anl.gov/tools/applications/dayforce including links to login to the system. You won’t be able to login to Dayforce until the division has been switched to use it, but there’s an “Inside Argonne” app there you can test things out with.
Specific instructions on the Citrix software can be found at http://inside.anl.gov/tools/applications/citrix-gateway including instructions on installing in Windows, Mac OS, Debian-based systems, and RPM-based systems. If using a mobile client, after opening the Citrix Receiver software, point it at https://appgateway.anl.gov for app and login information.
This work has been postponed to Wednesday. Sorry for the short notice.