Dispatches From The Geeks

News and Announcements from the MCS Systems Group

Network issues fixed

The cause of our issues was an external Denial Of Service attack based around NTP (Network Time Protocol). It was targeted at a server here that used to provide that service to the outside world. We haven’t run an externally-facing NTP server in years, certainly not since we moved from 221. Blocking all access to that host resolved the issue, but created new ones as we have hosts that are outside our firewall that rely on being able to talk to that host.

Eventually, the right combination of blocking and access was installed that accomplished a stop to the DOS while not overtaxing the CPUs on the routers to the point they were dropping packets.

Some services may need a kick, and we’re hitting them as our monitoring tells us, but if you see something out of the ordinary, please let us know at systems@mcs.anl.gov.


Written by Craig Stacey

March 10, 2014 at 2:55 pm

Posted in Uncategorized

We’re troubleshooting a networking issue that’s effecting connections on anything inside the MCS network and anything outside it. More details as they emerge.

Written by Craig Stacey

March 10, 2014 at 1:07 pm

Posted in Uncategorized

Summary of the service outage from March 4th and 5th

First of all, the actual cause of the problem was a bad network link.  It was difficult to find, and we were just about to begin migrating the services out of the 221 data center into 240 when it was finally discovered.

Here’s the link to Networking’s summary of what happened: https://slack-files.com/T025KBW9U-F025WFY2W-0c97c9

We discovered things were starting to throw alerts and misbehave around 3:25PM on Tuesday. There were a few services that were being slow or unresponsive, and we eventually narrowed down the culprit to the database server for infrastructure services.  This db server is set up in a very resilient and highly available environment, and in the 221 data center (which has generator-backed UPS, unlike 240).   We started debugging this, and it became evident that the system became unresponsive whenever network hosts tried to contact it.  (It became so unresponsive, in fact, that it wouldn’t even respond locally as long as it was accepting network connections.)

The redundancy and high availability wasn’t helping us, since the entire environment over in the data center seemed to be affected.  We couldn’t migrate the service to one of the other hypervisors, since they were all behaving the same way.

We called in the networking team after a couple of hours when it started to really smell like a network problem, and we all started working on it together.  Unfortunately, as outlined in the PDF above, the actual problem was not presenting itself in any useful manner.  By all indications, the network was working, and was not signaling errors indicating any problem.

After debugging into the night, rebooting various servers and hypervisors, we called it a night with the plan to start attacking it early in the morning, including moving the databases from 221 to 240.  Just as we were wrapping up the data dump, Networking discovered the bad optics, and things sprang to life.  Many services came back right away, for others we had the fallout of repercussions from our troubleshooting steps and reboots amidst bad networking.  Most things came back okay, we’ve got one redundant RADIUS server that needs rebuilding.

Affected hosts were: most websites we host, many login machines (which mount NFS from said websites), Trouble tickets (rt.mcs.anl.gov), account management, rdp.mcs.anl.gov (Windows Terminal Server), and any other service that relies on coredb.mcs.anl.gov.

Thank you all for your patience during the outage, sorry it took so long to diagnose the actual problem.  If you have any questions, by all means let me know.

Written by Craig Stacey

March 5, 2014 at 5:55 pm

Posted in Uncategorized

And we’re back!

More details will come later today once we’ve hammered out what happened, but everything is back (or should be back shortly). Thanks for your patience, sorry for the troubles, and stand by for details.

Your humble team of tech monkeys

Written by Craig Stacey

March 5, 2014 at 10:07 am

Posted in Uncategorized

Status update

Still struggling to get things back up. We’re moving the database services over to a different datacenter in the hopes it will be able to start providing service there. There’s a lot of data to move. I’ll provide updates direct to twitter from this point on, and reserve e-mail/blog posts for more detailed info. If you’re not already following, our twitter is @mcssys (https://twitter.com/mcssys).

Written by Craig Stacey

March 5, 2014 at 8:49 am

Posted in Uncategorized

MySQL and related services still down

We’re still battling an issue we have yet to diagnose. At this point, we’re not expecting things to come back tonight. We’re working to move the services off the affected infrastructure to bring the critical services back. We’re getting outage pages into place for affected websites.

Once we’re through this, I’ll provide a detailed explanation of how things are set up and what went wrong for those who are interested in that. At the moment, we’re still struggling to figure out what exactly has gone wrong, since there’s no indication as to what’s causing the service failures.

Written by Craig Stacey

March 4, 2014 at 9:25 pm

Posted in Uncategorized

Major outage affecting MCS systems

Our primary mysql database server is not allowing connections (or only sporadically allowing them), which is causing a number of services to fail including many websites. We’re working feverishly on getting it working again, but we have no ETA at this point since we’re still not able to get any real diagnostic info after multiple service and machine restarts.

More info as it comes available.

Written by Craig Stacey

March 4, 2014 at 4:57 pm

Posted in Uncategorized

Apple update available for Mavericks (10.9.2)

If you’re running MacOS 10.9 (Mavericks), please run software update to get the latest version (10.9.2). Aside from some bugfixes (including improved Exchange interopability), this contains a fix for a serious security issue with SSL certificate validation.

This update was just released, and should be available via the App Store/Software Update. It can also be found at Apple’s support downloads page here: http://support.apple.com/downloads/

Thank you.

Written by Craig Stacey

February 25, 2014 at 12:55 pm

Posted in Uncategorized

Resource Calendar Migration Complete

The instructions for reserving equipment at https://wiki.mcs.anl.gov/IT/index.php/Mail#Reserving_Equipment has been updated to reflect the new addresses and information. Thanks!

Written by Craig Stacey

February 19, 2014 at 6:16 pm

Posted in Uncategorized

Resource Calendars being migrated (Delayed until Feb 19)

This work is happening at noon today. Thanks.

Written by Craig Stacey

February 19, 2014 at 11:45 am

Posted in Uncategorized


Get every new post delivered to your Inbox.

Join 39 other followers