Dispatches From The Geeks

News and Announcements from the MCS Systems Group

Upcoming IT Maintenance Weekend

CIS is planning their next maintenance weekend from Friday, May 16 through Sunday, May 18. The actual scope of the work to be performed is not yet set, but please let us know if there are any significant impacts any sort of outage would have during these dates so we can plan accordingly.

Thanks!

Written by Craig Stacey

April 15, 2014 at 1:30 pm

Posted in Uncategorized

Be diligent in protecting your account, data, and equipment.

The lab is going through a cyber-security audit this month, which will have a few phases. They’re going to be trying any number of things to test our defenses and our ability to recognize and deal with cyber threats. Social engineering/phishing is almost certain to be part of their toolkit, and at some point they’ll even have physical access to the building.

I’ll send another reminder as we get closer, but I wanted to make sure everyone was in the right mindset. The recent phishing attack simulation demonstrated there were some holes and that people can be fooled. So, let’s be on our toes. Lock down or lock up any equipment that’s going to be unattended. Don’t fall for password reset mails. Or phone calls.

Remember — nobody ever needs to know your password. As a system administrator, if I need to know your password, I can change it to something I know and tell you want it is. You should never need to reveal your password to anyway.

Thanks!

Written by Craig Stacey

April 3, 2014 at 11:57 am

Posted in Uncategorized

Reminder: Windows XP (and Mac OS 10.6 Snow Leopard) End of Life.

Below is the notice I sent out last month on this. This is just a reminder that it’s happening today. I neglected to mention in the prior notice that older Mac OS installations would also fall into the End of Life rule, and thus anything running MacOS 10.6 or earlier will not be able to access the web from the lab’s wireless or VPN infrastructure. Sorry about that, it was an oversight on my part. If you have a machine that’s affected by this, let us know and we’ll work to get you upgraded or find a solution. Thanks!

Prior announcement:

Hi, folks. For those who don’t know, at the end of the month, Windows XP is finally going to be deemed “End of Life” by Microsoft. This means no more security updates, and it’s officially an unsupported OS. As part of security mitigations for this, the lab will be clamping down on Windows XP being exposed to the outside world. The first step in this is that come April, any web traffic that’s detected to be coming from Windows XP will be blocked by the lab’s web proxy. This will affect anyone on the VPN or using the lab’s wireless infrastructure, all of which is behind the web proxy. But also, we need to document any Windows XP systems that need to continue running. We have no systems that meet this criteria in our officially supported infrastructure, but if you have any that you’re using and rely on for your research, you should let us know if you haven’t already been approached on it. Any systems that need to remain running Windows XP need to be protected by a firewall, and we need to make sure that is indeed the case. Please report any Windows XP systems you need running to systems at mcs.anl.gov. Likewise, you can ask any questions at that address, or to me. Thanks!

Written by Craig Stacey

April 2, 2014 at 10:52 am

Posted in Uncategorized

Mailman list cleanup

Just a heads up that over the next couple of weeks, you may get some mail from me regarding mailman list cleanup. The messages are legit, there should be no embedded links, and they will merely inform you that unless you take action, the list mentioned in the subject will be retired.

As a first pass, I’m sending them to the list owners of any lists that haven’t seen any action since Dec 31 2011. More details will be in the message. If you manage multiple lists, you will get multiple messages. Sorry about that, but it’s the most efficient way to make sure everyone is notified.

Just want to make sure we don’t set off any phishing alarms. ūüôā

Thanks!

Written by Craig Stacey

March 13, 2014 at 4:34 pm

Posted in Uncategorized

Phishing attack on the lab

A number of people are getting mails from security@anl-cis.org telling them of weak passwords and sending them to a site to reset it. It looks very legit, but it is most definitely not. It’s a phishing attack.

If you visited the site in question and entered any information, let us (and cyber@anl.gov) know immediately.

Written by Craig Stacey

March 11, 2014 at 2:09 pm

Posted in Uncategorized

Network issues fixed

The cause of our issues was an external Denial Of Service attack based around NTP (Network Time Protocol). It was targeted at a server here that used to provide that service to the outside world. We haven’t run an externally-facing NTP server in years, certainly not since we moved from 221. Blocking all access to that host resolved the issue, but created new ones as we have hosts that are outside our firewall that rely on being able to talk to that host.

Eventually, the right combination of blocking and access was installed that accomplished a stop to the DOS while not overtaxing the CPUs on the routers to the point they were dropping packets.

Some services may need a kick, and we’re hitting them as our monitoring tells us, but if you see something out of the ordinary, please let us know at systems@mcs.anl.gov.

Thanks!

Written by Craig Stacey

March 10, 2014 at 2:55 pm

Posted in Uncategorized

We’re troubleshooting a networking issue that’s effecting connections on anything inside the MCS network and anything outside it. More details as they emerge.

Written by Craig Stacey

March 10, 2014 at 1:07 pm

Posted in Uncategorized

Summary of the service outage from March 4th and 5th

First of all, the actual cause of the problem was a bad network link.  It was difficult to find, and we were just about to begin migrating the services out of the 221 data center into 240 when it was finally discovered.

Here’s the link to Networking’s summary of what happened:¬†https://slack-files.com/T025KBW9U-F025WFY2W-0c97c9

We discovered things were starting to throw alerts and misbehave around 3:25PM on Tuesday. There were a few services that were being slow or unresponsive, and we eventually narrowed down the culprit to the database server for infrastructure services. ¬†This db server is set up in a very resilient and highly available environment, and in the 221 data center (which has generator-backed UPS, unlike 240). ¬† We started debugging this, and it became evident that the system became unresponsive whenever network hosts tried to contact it. ¬†(It became so unresponsive, in fact, that it wouldn’t even respond locally as long as it was accepting network connections.)

The redundancy and high availability wasn’t helping us, since the entire environment over in the data center seemed to be affected. ¬†We couldn’t migrate the service to one of the other hypervisors, since they were all behaving the same way.

We called in the networking team after a couple of hours when it started to really smell like a network problem, and we all started working on it together.  Unfortunately, as outlined in the PDF above, the actual problem was not presenting itself in any useful manner.  By all indications, the network was working, and was not signaling errors indicating any problem.

After debugging into the night, rebooting various servers and hypervisors, we called it a night with the plan to start attacking it early in the morning, including moving the databases from 221 to 240. ¬†Just as we were wrapping up the data dump, Networking discovered the bad optics, and things sprang to life. ¬†Many services came back right away, for others we had the fallout of repercussions from our troubleshooting steps and reboots amidst bad networking. ¬†Most things came back okay, we’ve got one redundant RADIUS server that needs rebuilding.

Affected hosts were: most websites we host, many login machines (which mount NFS from said websites), Trouble tickets (rt.mcs.anl.gov), account management, rdp.mcs.anl.gov (Windows Terminal Server), and any other service that relies on coredb.mcs.anl.gov.

Thank you all for your patience during the outage, sorry it took so long to diagnose the actual problem.  If you have any questions, by all means let me know.

Written by Craig Stacey

March 5, 2014 at 5:55 pm

Posted in Uncategorized

And we’re back!

More details will come later today once we’ve hammered out what happened, but everything is back (or should be back shortly). Thanks for your patience, sorry for the troubles, and stand by for details.

Love,
Your humble team of tech monkeys

Written by Craig Stacey

March 5, 2014 at 10:07 am

Posted in Uncategorized

Status update

Still struggling to get things back up. We’re moving the database services over to a different datacenter in the hopes it will be able to start providing service there. There’s a lot of data to move. I’ll provide updates direct to twitter from this point on, and reserve e-mail/blog posts for more detailed info. If you’re not already following, our twitter is @mcssys (https://twitter.com/mcssys).

Written by Craig Stacey

March 5, 2014 at 8:49 am

Posted in Uncategorized