Archive for the ‘Uncategorized’ Category
What happened on Monday:
Overnight/early morning of Monday, May 11, the system disk for the server "repo.anl-external.org" detected major corruption and marked itself as read-only. This was not a physical disk in a physical machine, but a virtual disk being used by the virtual machine running the repocafe service. The corruption was not hardware-related, but appeared to be filesystem level corruption.
Efforts to fix the corruption on Monday morning were not successful, and it was evident we would need to build a new VM to host the service. Luckily, the virtual disk containing the actual data from the repositories was unaffected. We took advantage of the downtime to move the VM to a different virtual machine host running a more modern build that would provide higher reliability for the short term (with a longer term fix in mind, detailed below).
A new VM was built, repocafe software installed, and the configuration restored from backups. We learned that despite the backups of the configuration being performed as expected, the database the service uses to track user accounts was not being backed up properly. (We also learned it was hosted locally on the VM rather than on our DB server.)
We were eventually able to restore the database from tape and get it functional again in relatively short order. After internal testing showed it to be functioning as expected, we announced the service was back. A user reported commit e-mails were not working, which we quickly rectified by installing a missing perl module. Diagnostics indicated this was the only missing module, and svn check-ins were working properly once this was fixed.
What’s going to happen longer term:
You may recall we had a different failure involving this service in November, and at that time announced we would move it to a more resilient architecture. We were (and are) still working on that, though design and implementation decisions made when the service was initially stood up made the move problematic. We had rectified the issue that caused the November failure, but this most recent failure indicates that we really need to design this system better.
As such, we’re going to undo those hampering design decisions and fully roll the system into our top tier infrastructure. We’ll be working with repo owners as we get closer to doing this, but it should be complete by mid-summer. This move may involve a name change out of the anl-external.org namespace, however the self-service nature of repository creation and management will be maintained. As we finalize the details of the move, we’ll have more information for you on how it will look.
For now, continue to use the service as usual, and report any oddities to firstname.lastname@example.org.
Please see the below announcement from CIS for maintenance work happening this coming weekend. Note the sporadic outages they expect, including e-mail, on Saturday morning and afternoon.
WHAT ARE WE DOING?
Argonne’s quarterly IT maintenance weekend is scheduled for Friday, May 15th, thru Sunday, May 17th.
We are replacing key elements of our network infrastructure on Saturday, May 16th from 7am-3pm as part of the Network Lifecycle project.
During this time connections to core IT services including the intranet and internet will be sporadic until work is complete.
· Email Services including personal, calendaring and lists
· Cloud services such as Box and Blue Jeans (due to use of on-site authentication services)
· http://www.anl.gov and inside.anl.gov
· VOIP Phones
· Public Address Services
· Voice mail will be unavailable from 5 to 7 p.m., Friday, May 15th. During this time, voice mail messages will not be received nor will they be retrievable.
· All business systems and web applications will be unavailable throughout the duration of the maintenance weekend.
Not Affected/Minimally Affected:
· Networking that is not located behind the Tier 1 Firewall will NOT be affected.
· Metasys and SCADA systems may experience a short 3-5 minute outage window.
Networking services will be restored on Saturday afternoon and CIS will perform verification of IT services thru Sunday evening May 17th to ensure all services are functioning for business hours on Monday, May 18th.
WHY ARE WE DOING IT?
The replacement of the core switches will allow CIS to build better redundancy into the core network to provide better throughput, reduce risks of spanning tree issues and to reduce the network maintenance impact to the site in the future.
WHEN WILL THIS OCCUR?
May 15th, 2015, 5:00 p.m. thru May 17th, 2015, ~5:00 p.m.
We expect the network maintenance to be complete Saturday afternoon and will then be followed by a verification process after which an “all clear” message will be sent.
We expect application verification to be complete by Sunday at 5p.m. and an “all clear” message will then be sent.
WHAT IS THE EFFECT ON YOU?
Unless there is an unforeseen issue with the maintenance activities, you should not be affected outside of the maintenance window.
FOR MORE INFORMATION
If you do experience network issues after the “all clear” message on Saturday please send an email message to help outlining the specific issue along with contact information in case we need to follow up for more details. We will be managing the maintenance weekend within service now and monitoring the incoming message queue.
In the case of unforeseen issues regarding sending email to help you can also use the alternate address argonne
Report issues with core services after the maintenance is complete (Monday morning- May 18th) to the Argonne Service Desk at ext. 2-9999 option 2.
Argonne Service Desk
Please report any oddities you find, but we believe everything is back with no evidence of any data loss. I’ll post a full post mortem tomorrow.
We’re almost back, but have hit a snag with the database. Hopefully, the next update I send will be a restore announcement. Stand by.
As building 240 tenants know, there is major power work scheduled for June 1st that will take out power to the entire building. This, unfortunately, includes the data center in building 240, home to a number of computers. The administrators of specific systems (LCRC, Beagle, Magellan, etc.) will be notifying their users of what this means for them, but this announcement is a broader announcement of the general MCS and CELS computing infratructure and what will be affected.
The short answer is that it’s easier to say what won’t be affected. Mail services we provide (forwarding for mcs.anl.gov, alcf.anl.gov, cels.anl.gov, ci.uchicago.edu, etc.) and their mailing lists will be unaffected. Most web sites we host (WordPress, Confluence, etc.) will remain up. We’ll notify site owners of any exceptions to this. CIS-provided services (e-mail, web, business systems) are generally unaffected. Externally hosted services (Box, Dayforce, TAMS) are unaffected.
Now for the info you really need — what will be down. All MCS/CELS file and compute servers will be down. This includes SSH logins (login.mcs.anl.gov), unix and Mac home file servers, linux compute servers, all desktops, and all networking in building 240.
The outage window for the power work is slated to be 7AM to 7PM on Monday, June 1. As such, we will begin taking systems down prior to that so they shut down cleanly. You can expect MCS/CELS compute systems to be down beginning by 6AM on that day. Once the power comes back on, most of our systems should come back fairly quickly, and we hope to be back to normal well within that outage window, though we are beholden to the power work being completed before we can start bringing things back.
Because your network files will not be available, we encourage you to make sure you have files and data you need locally for that day. Getting accustomed to working in Box can help with that. We’ll be contacting administrative users in the lead-up to this outage to assist in getting you comfortable working with Box and getting your files moved there. You can find more information at http://inside.anl.gov/services/box.
We’ll send more announcements on this as we get closer to the date. You can also keep up to date via twitter (@mcssys) or WordPress (https://mcssys.wordpress.com).
Due to power work being performed in the data center the following computer servers were unexpectedly taken down and will remain down for the duration of the work which is expected to last approximately 3.5 hours:
There was also a brief outage of login1 and login2.mcs.anl.gov due to the work. We apologize for any inconvenience or disruption this may cause.
We have fixed the problem with attachments and images, as well as the firewall issues. Access to https://collab.cels.anl.gov should work as expected. I’ve updated the banner on collab.mcs.anl.gov to point to the new site, and we will later revert the old site to the pointer to the new URL.
If you find any issues with your Confluence spaces, please let us know at email@example.com.
Thanks for your patience.
We discovered attachments and images did not migrate with the site. As such, we’re bringing collab.mcs.anl.gov back for now, and turning off collab.cels.anl.gov. For the time being, please consider collab.mcs.anl.gov read only. If we don’t have a fix in place within the next couple of hours, we’ll fully release collab.mcs.anl.gov back to read/write and take another stab at a full migration at a later date.
An announcement will be made to this list when this is finalized. For minor updates, if you’re not already you can keep an eye on @mcssys on Twitter (https://twitter.com/mcssys) for further developments.
Some issues remain on the upgrade from collab.mcs.anl.gov (Confluence v4) to collab.cels.anl.gov (Confluence v5). Primarily, the firewall exceptions to allow access from offsite are not yet in place. I’ve put in a request to have that done as soon as possible, and with luck should be in place before too long.
Secondly, any custom images you’ve uploaded into your space may not have transferred properly. We still have the old database and server, so please take a look at your spaces in Confluence and let us know if there’s data, images, or info you need recovered.
We’ll send another announcement once the firewall issue is fixed, but in the interim is should be accessible from any internal network, including the VPN.