Dispatches From The Geeks

News and Announcements from the MCS Systems Group

Upcoming Wireless Network Changes

Please see the following notice from CIS.

The short version of the story is that next week, you will notice a change in the wireless network names used onsite. Specifically instead of the many options you currently see, there will primarily only be two: Argonne-auth and Argonne-guest. As with the current setup, connections to the “auth” network require authenticating with your Argonne credentials and gain you access to a trusted VPN network. Connections to the “guest” network requires no authentication, but does require filling out a web registration form which you will see on your first browser connection.

CELS Systems staff will be visiting known AppleTVs in the conference rooms in 3178, 4172, and 4313 to make sure they’re on the new network, as well as updating the building kiosks.

WHAT ARE WE DOING?

On Sunday, November 9, 2014, CIS will be reducing the number of published Service Set Identifiers (SSIDs) on both the “Authenticated” and “Guest” wireless network networks. We are doing this to make it easier for Argonne employees and visitors to know which WiFi SSID to use.

Certain special and specific WiFi SSIDs at Argonne-Lodging, the DC office, and certain areas of the Advanced Photon Source will not be affected or changed. However, the “Authenticated” network SSID will change everywhere.
WHEN WILL THIS OCCUR?

Sunday, November 9, 2014, from 7:00 a.m. to 1:30 p.m.
WHAT DO YOU NEED TO DO?

After Sunday, November 9, 2014, employees and visitors will need to remove, delete, or ignore (“Forget This Network”) the following SSID names from the Wi-Fi Network Preferences or Managed Wireless Networks Profile (depending on operating system):

• ArgonneG-auth
• ArgonneA-Auth
• ArgonneG-guest
• ArgonneA-guest
• ANLTCSG-guest
• ANLTCSA-guest
• In the Argonne RAP office: RAP-OfficeG-guest

They will then need to connect and set up the new SSIDs: Argonne-auth (Argonne Employee) and/or Argonne-guest (Argonne Visitor). These new SSIDs will automatically support 5GHz and 2.4GHz frequencies (also known as 802.11a/n, 802.11b/g/n) and will allow devices to choose the best and least congested frequency on the WiFi SSID chosen.

If you have Apple TVs or other digital displays that depend on a wireless connection, those devices will also need to be reconnected to the appropriate network. CIS will be re-establishing wireless connectivity on these devices starting Monday, November 10th. If an employee needs assistance with a critical meeting that may use some of these devices (Mondo Pad, Visix digital displays or Apple TVs), please contact the Service Desk at ext. 2-9999, option 2.
WHAT IS THE EFFECT ON YOU?

Choosing a wifi network will be easier for Argonne employees and visitors while on site.

FOR MORE INFORMATION

Report issues with services after the maintenance is complete to the CIS Service Desk at ext. 2-9999, option 2.

Advertisements

Written by Craig Stacey

November 3, 2014 at 2:10 pm

Posted in Uncategorized

Summary of repo.anl-external.org downtime

Executive summary


VM Hypervisor showed signs of going bad on Friday night.  Mitigating steps were taken.  Server failed on Sunday night too late for anyone to be able to work on it until Monday.  Server was restored at 11:30AM.  Hypervisor is now stable, however critical services hosted on that hypervisor will be moved to more resilient hardware to reduce likelihood of downtime.

What happened


An MCS Virtual Server hypervisor (hereafter referred to as vserver8) had a system disk go into a bad state, taking down vserver8 and all Virtual Machines hosted on it.  Affected VMs were:

  • login1.mcs.anl.gov
  • login2.mcs.anl.gov
  • buildbot.mcs.anl.gov
  • pwca.alcf.anl.gov
  • horde.alcf.anl.gov
  • repo.anl-external.org

The short term fix


We noticed instability with the server on Friday night, as the hypervisor had gone offline and rebooted a couple of times.  A reboot seemed to clear the issue with the disk, though there did appear to be corruption in some previously retired VMs.  As a first step, we made sure we had a current backup of the data stored on repo.  We moved the IP addresses of login1 and login2 over to login3 and shut those VMs down.  buildbot, pwca and horde remained down to minimize the load on the hypervisor in the hopes of increasing the likelihood of it staying up through the weekend.  Once we had the data in repo.anl-external.org confidently duplicated, we brought it back up and kept our eye on the service, with the goal of migrating it to a new hypervisor on Monday.

The second failure


On Sunday night, the disk on vserver8 failed in a different manner than before.  Unfortunately, there was nobody available to handle the situation and thus it had to wait until morning. First thing in the morning, attempts were made to bring the VM back.  Due to the configuration of that machine, we were unable to recover from a bad system disk in the usual methods.  Ultimately, we had to burn a bootable linux LiveCD to boot the machine and initiate the data transfer to the new disk.

The progress on that transfer gave us an original estimate of about an hour to copy the data to the new disk, at which point the machine would be resurrected as it was before the crash.   Had it looked like it would take longer, our recovery path would have been to migrate the data from the repocafe backup made on Friday to a new disk pool and set up a new VM on that.  The data duplication looked to be the fastest path to recovery, so we continued on that route, since our backup on Friday would not have included any changes over the weekend.

The copy slowed down some, and ultimately finished just before 11:30 AM (about 1.5 hours after the initial estimate).  After the copy finished, the server was able to reboot and operate as normal with no loss of data.  It currently shows to be healthy.

Next steps


login1 and login2 are still directed at login3.  Later today, we will send a notice to users who may be affected when we move the IP addresses back to their original hosts.  If you logged into login1 or login2 since Friday evening, and are still logged in, you’ll be among the affected users.

repo.anl-external.org is currently up and stable, and we will begin the process of moving that to a more resilient VM infrastructure.  When we first deployed it, it was intended to be a “best effort” self-service SVN repo to ease collaborations with external users.  Because we never made that explicitly clear, and because it is *so* much easier to self-serve these sorts of things, many users gravitated towards using it as their primary SVN repo over the more “production level” svn.mcs.anl.gov.

Bearing this in mind, we’re reclassifying repo.anl-external.org as a critical service.  We’re going to move the VM to hardware that’s better designed to weather these sorts of failures and be able to move trivially from hypervisor to hypervisor as needed, which we currently do with other critical servers.  There will be an announced outage on this service as we migrate the last of the repository data to the new server.  We’ll make the bulk of this work happen in the background with the goal of having the outage only necessary to copy last minute data changes and ultimately move the VM.  There will be further updates on this as we progress, and we’ll coordinate to ensure the migration does not happen at a critical time.

Written by Craig Stacey

November 3, 2014 at 1:50 pm

Posted in Uncategorized

repo.anl-external.org is now back online. Will send detailed explanation of downtime later for those interested.

Written by Craig Stacey

November 3, 2014 at 11:48 am

Posted in Uncategorized

Progress slowed slightly. At current transfer rate we’re expecting recovery around 11:30.

Written by Craig Stacey

November 3, 2014 at 11:08 am

Posted in Uncategorized

Progress is being made on repo.anl-external.org. Current estimate is 11AM.

Written by Craig Stacey

November 3, 2014 at 10:29 am

Posted in Uncategorized

repo.anl-external.org is down

We are working to bring it back up. We’re hopeful it will be back online within the hour, I will send updates if that changes.

Written by Craig Stacey

November 3, 2014 at 9:05 am

Posted in Uncategorized

Things are mostly back

CIS successfully revived our router, and most MCS hosts and services are back online. We’re still looking over the monitoring to identify what are actual issues vs. transient errors that will self-correct (some servers may not be properly syncing their time via NTP, for example), and will be addressing them as we find them.

If your local machine is misbehaving at this point, your first course of action should be to reboot. If that does not fix the issue, you can contact the help desk (systems@mcs.anl.gov) and we’ll get you taken care of.

I’ll post a more in-depth summary of what happened, but a blow-by-blow overview is available at http://www.mcs.anl.gov/systems/blog (aka https://mcssys.wordpress.com). Thanks for your patience, and we’ll be sure to let you know of any findings we make on how to prevent this sort of outage in the future, or at least minimize the length of the downtime when there’s a catastrophic hardware failure like we had.

Written by Craig Stacey

October 30, 2014 at 2:16 pm

Posted in Uncategorized

MCS Network Outage to continue

The unexpected power issue that caused this outage has caused a failure in a critical component of the MCS router. We’re currently working with CIS on getting this back online, either with a replacement supervisor card, or an ad-hoc temporary solution until a replacement can be installed. We’re currently estimating another couple of hours of downtime.

This outage is affecting all wired hosts on MCS networks, including most servers with a .mcs.anl.gov address.

Hosts on wifi can talk to other Argonne and external hosts. If your machine is wifi capable, that is the recommended workaround for now.

Written by Craig Stacey

October 30, 2014 at 12:05 pm

Posted in Uncategorized

Things are slowly coming back. If you can use wifi instead of wireless, you should be able to connect to net.

Written by Craig Stacey

October 30, 2014 at 11:24 am

Posted in Uncategorized

A UPS failure has taken out various machines and networks.  The UPS is back online, but w e’re still waiting on other units to recover.  More info when we have it.

Written by Craig Stacey

October 30, 2014 at 10:57 am

Posted in Uncategorized