Dispatches From The Geeks

News and Announcements from the MCS Systems Group

Power work generally complete

Consider this a "mostly all clear" message. The work was completed, and we’ve got most systems up and running. There are a handful of systems giving us trouble at the moment (including some that didn’t go down during the outage, but started acting up when things came back), and we’re working on them.

Those of you who sit in 240 may need to reboot your linux desktop in the morning. If things aren’t working as expected, that’s your first step.

Keep an eye on @mcssys on Twitter (http://twitter.com/mcssys) for specific updates as we bring back the straggler services. Also, stand by for announcements relating to specific larger systems – they are not back yet. This announcement only pertains to the general computing environment in CELS/MCS.

Please report any issues to help@cels.anl.gov. Thanks.

Written by Craig Stacey

June 1, 2015 at 5:02 pm

Posted in Uncategorized

Power Outage work has begun

I will send an all-clear when things are back to normal. Updates will be posted to our blog (https://mcssys.wordpress.com) and twitter (@mcssys). Thanks!

Written by Craig Stacey

June 1, 2015 at 6:18 am

Posted in Uncategorized

Reminder: TCS Power Outage, June 1, 2015

(See http://wp.me/p3jwfN-7G for the first announcement on this.)

The important information is in the prior announcement, but I thought it might be helpful to enumerate the machines and services that are provided by CELS Systems that will remain available during the power outage. And, remember, almost all CIS-provided services (including e-mail) will be up.

* kerdap-2.mcs.anl.gov (MCS account authentication server)
* breeze-2.mcs.anl.gov (ALCF Userbase/Accounts)
* embassy.mcs.anl.gov (CELS Userbase)
* coredb-3.mcs.anl.gov (Internal MySQL)
* wind.mcs.anl.gov (web – wordpress, mediawiki)
* caveat.mcs.anl.gov (MPICH svn/trac)
* yubi-221.mcs.anl.gov (CELS One Time Password service)
* git.mcs.anl.gov
* owney.mcs.anl.gov (mailman mailing lists)
* newman.mcs.anl.gov (mail relay)
* kerdap.jlse.anl.gov (JLSE auth server)
* cyclone (xcollab, xjira)
* nginx.mcs.anl.gov (web proxy)
* typhoon.mcs.anl.gov (JLSE wiki)
* rdp.mcs.anl.gov (Windows terminal server)
* wilbur.mcs.anl.gov (MCS webpages not served by CIS Drupal)
* beehive0.mcs.anl.gov (waggle)
* app001.cels.anl.gov (collab.cels.anl.gov)
* davmail.mcs.anl.gov (CalDAV gateway to Exchange)
* squall.mcs.anl.gov (ALCF websites)
* variant.mcs.anl.gov (svn/trac)
* hub-221.mcs.anl.gov (CELS Radius server)
* rt.mcs.anl.gov (Ticketing system)
* gust.mcs.anl.gov (some wordpress sites)
If you don’t see it in the above list, it’s living here in 240 and will lose power.

We’ll send an "all-clear" notice to this list when things are back in operation generally. I’ll also be on-site and will post updates to our twitter feed (@mcssys).

Thanks!


Previous announcement:

As building 240 tenants know, there is major power work scheduled for June 1st that will take out power to the entire building. This, unfortunately, includes the data center in building 240, home to a number of computers. The administrators of specific systems (LCRC, Beagle, Magellan, etc.) will be notifying their users of what this means for them, but this announcement is a broader announcement of the general MCS and CELS computing infratructure and what will be affected.

The short answer is that it’s easier to say what won’t be affected. Mail services we provide (forwarding for mcs.anl.gov, alcf.anl.gov, cels.anl.gov, ci.uchicago.edu, etc.) and mailing lists will be unaffected. Most web sites we host (WordPress, Confluence, etc.) will remain up. We’ll notify site owners of any exceptions to this. CIS-provided services (e-mail, web, business systems) are generally unaffected. Externally hosted services (Box, Dayforce, TAMS) are unaffected.

Now for the info you really need — what will be down. All MCS/CELS file and compute servers will be down. This includes SSH logins (login.mcs.anl.gov), unix and Mac home file servers, linux compute servers, all desktops, and all networking in building 240.

The outage window for the power work is slated to be 7AM to 7PM on Monday, June 1. As such, we will begin taking systems down prior to that so they shut down cleanly. You can expect MCS/CELS compute systems to be down beginning by 6AM on that day. Once the power comes back on, most of our systems should come back fairly quickly, and we hope to be back to normal well within that outage window, though we are beholden to the power work being completed before we can start bringing things back.

Because your network files will not be available, we encourage you to make sure you have files and data you need locally for that day. Getting accustomed to working in Box can help with that. We’ll be contacting administrative users in the lead-up to this outage to assist in getting you comfortable working with Box and getting your files moved there. You can find more information at http://inside.anl.gov/services/box.

We’ll send more announcements on this as we get closer to the date. You can also keep up to date via twitter (@mcssys) or WordPress (https://mcssys.wordpress.com).

Thanks!

Written by Craig Stacey

May 27, 2015 at 5:58 pm

Posted in Uncategorized

repo.anl-external.org downtime post-mortem

What happened on Monday:

Overnight/early morning of Monday, May 11, the system disk for the server "repo.anl-external.org" detected major corruption and marked itself as read-only. This was not a physical disk in a physical machine, but a virtual disk being used by the virtual machine running the repocafe service. The corruption was not hardware-related, but appeared to be filesystem level corruption.

Efforts to fix the corruption on Monday morning were not successful, and it was evident we would need to build a new VM to host the service. Luckily, the virtual disk containing the actual data from the repositories was unaffected. We took advantage of the downtime to move the VM to a different virtual machine host running a more modern build that would provide higher reliability for the short term (with a longer term fix in mind, detailed below).

A new VM was built, repocafe software installed, and the configuration restored from backups. We learned that despite the backups of the configuration being performed as expected, the database the service uses to track user accounts was not being backed up properly. (We also learned it was hosted locally on the VM rather than on our DB server.)

We were eventually able to restore the database from tape and get it functional again in relatively short order. After internal testing showed it to be functioning as expected, we announced the service was back. A user reported commit e-mails were not working, which we quickly rectified by installing a missing perl module. Diagnostics indicated this was the only missing module, and svn check-ins were working properly once this was fixed.

What’s going to happen longer term:

You may recall we had a different failure involving this service in November, and at that time announced we would move it to a more resilient architecture. We were (and are) still working on that, though design and implementation decisions made when the service was initially stood up made the move problematic. We had rectified the issue that caused the November failure, but this most recent failure indicates that we really need to design this system better.

As such, we’re going to undo those hampering design decisions and fully roll the system into our top tier infrastructure. We’ll be working with repo owners as we get closer to doing this, but it should be complete by mid-summer. This move may involve a name change out of the anl-external.org namespace, however the self-service nature of repository creation and management will be maintained. As we finalize the details of the move, we’ll have more information for you on how it will look.

For now, continue to use the service as usual, and report any oddities to help@cels.anl.gov.

Thanks!

Written by Craig Stacey

May 13, 2015 at 9:34 am

Posted in Uncategorized

CIS Maintenance Weekend May15-17, 2015

Please see the below announcement from CIS for maintenance work happening this coming weekend. Note the sporadic outages they expect, including e-mail, on Saturday morning and afternoon.

WHAT ARE WE DOING?

Argonne’s quarterly IT maintenance weekend is scheduled for Friday, May 15th, thru Sunday, May 17th.

We are replacing key elements of our network infrastructure on Saturday, May 16th from 7am-3pm as part of the Network Lifecycle project.

During this time connections to core IT services including the intranet and internet will be sporadic until work is complete.

Specifically:

· Email Services including personal, calendaring and lists

· Cloud services such as Box and Blue Jeans (due to use of on-site authentication services)

· http://www.anl.gov and inside.anl.gov

· Dash

· VPN

· Wireless

· VOIP Phones

· Pager

· Public Address Services

In addition:

· Voice mail will be unavailable from 5 to 7 p.m., Friday, May 15th. During this time, voice mail messages will not be received nor will they be retrievable.

· All business systems and web applications will be unavailable throughout the duration of the maintenance weekend.

Note:

Not Affected/Minimally Affected:

· Networking that is not located behind the Tier 1 Firewall will NOT be affected.

· Metasys and SCADA systems may experience a short 3-5 minute outage window.

Networking services will be restored on Saturday afternoon and CIS will perform verification of IT services thru Sunday evening May 17th to ensure all services are functioning for business hours on Monday, May 18th.

WHY ARE WE DOING IT?

The replacement of the core switches will allow CIS to build better redundancy into the core network to provide better throughput, reduce risks of spanning tree issues and to reduce the network maintenance impact to the site in the future.

WHEN WILL THIS OCCUR?

May 15th, 2015, 5:00 p.m. thru May 17th, 2015, ~5:00 p.m.

We expect the network maintenance to be complete Saturday afternoon and will then be followed by a verification process after which an “all clear” message will be sent.

We expect application verification to be complete by Sunday at 5p.m. and an “all clear” message will then be sent.

WHAT IS THE EFFECT ON YOU?

Unless there is an unforeseen issue with the maintenance activities, you should not be affected outside of the maintenance window.

FOR MORE INFORMATION

Networking Issues:

If you do experience network issues after the “all clear” message on Saturday please send an email message to help outlining the specific issue along with contact information in case we need to follow up for more details. We will be managing the maintenance weekend within service now and monitoring the incoming message queue.

In the case of unforeseen issues regarding sending email to help you can also use the alternate address argonne

Core Services:

Report issues with core services after the maintenance is complete (Monday morning- May 18th) to the Argonne Service Desk at ext. 2-9999 option 2.

Respectfully Submitted,

Argonne Service Desk

Written by Craig Stacey

May 12, 2015 at 4:18 pm

Posted in Uncategorized

repo.anl-external.org is back

Please report any oddities you find, but we believe everything is back with no evidence of any data loss. I’ll post a full post mortem tomorrow.

Written by Craig Stacey

May 11, 2015 at 6:50 pm

Posted in Uncategorized

repo.anl-external.org still down, progress

We’re almost back, but have hit a snag with the database.  Hopefully, the next update I send will be a restore announcement.  Stand by.

Written by Craig Stacey

May 11, 2015 at 4:04 pm

Posted in Uncategorized

repo.anl-external.org down. Working on it, stand by for updates.

Written by Craig Stacey

May 11, 2015 at 10:08 am

Posted in Uncategorized

TCS Power Outage, June 1, 2015

As building 240 tenants know, there is major power work scheduled for June 1st that will take out power to the entire building. This, unfortunately, includes the data center in building 240, home to a number of computers. The administrators of specific systems (LCRC, Beagle, Magellan, etc.) will be notifying their users of what this means for them, but this announcement is a broader announcement of the general MCS and CELS computing infratructure and what will be affected.

The short answer is that it’s easier to say what won’t be affected. Mail services we provide (forwarding for mcs.anl.gov, alcf.anl.gov, cels.anl.gov, ci.uchicago.edu, etc.) and their mailing lists will be unaffected. Most web sites we host (WordPress, Confluence, etc.) will remain up. We’ll notify site owners of any exceptions to this. CIS-provided services (e-mail, web, business systems) are generally unaffected. Externally hosted services (Box, Dayforce, TAMS) are unaffected.

Now for the info you really need — what will be down. All MCS/CELS file and compute servers will be down. This includes SSH logins (login.mcs.anl.gov), unix and Mac home file servers, linux compute servers, all desktops, and all networking in building 240.

The outage window for the power work is slated to be 7AM to 7PM on Monday, June 1. As such, we will begin taking systems down prior to that so they shut down cleanly. You can expect MCS/CELS compute systems to be down beginning by 6AM on that day. Once the power comes back on, most of our systems should come back fairly quickly, and we hope to be back to normal well within that outage window, though we are beholden to the power work being completed before we can start bringing things back.

Because your network files will not be available, we encourage you to make sure you have files and data you need locally for that day. Getting accustomed to working in Box can help with that. We’ll be contacting administrative users in the lead-up to this outage to assist in getting you comfortable working with Box and getting your files moved there. You can find more information at http://inside.anl.gov/services/box.

We’ll send more announcements on this as we get closer to the date. You can also keep up to date via twitter (@mcssys) or WordPress (https://mcssys.wordpress.com).

Thanks!

Written by Craig Stacey

May 8, 2015 at 10:55 am

Posted in Uncategorized

Unexpected outage of computer servers

Due to power work being performed in the data center the following computer servers were unexpectedly taken down and will remain down for the duration of the work which is expected to last approximately 3.5 hours:

octopus.mcs.anl.gov
octagon.mcs.anl.gov
cg.mcs.anl.gov

There was also a brief outage of login1 and login2.mcs.anl.gov due to the work. We apologize for any inconvenience or disruption this may cause.

Written by Craig Stacey

May 6, 2015 at 9:53 am

Posted in Uncategorized

Follow

Get every new post delivered to your Inbox.

Join 55 other followers