Please note this will affect all machines in the datacenter. Primary websites will stay up, as will cryptocard, database servers, mailing lists, and RT, as we have hosted those servers in another building. Compute servers, file servers, and clusters hosted in 240 will be down. The power will not return until 5PM on Sunday, at which point we can start turning things back on.
It’s time to conduct our User Feedback survey. (Well, past time, really, since it’s been over a year.)You can find the survey here: http://www.surveymonkey.com/s/KVTHY6Y (also linked off our blog (systemsblog.mcs.anl.gov), and the IT Wiki (mcs.anl.gov/help)). The survey is open until 10/31, and all answers are completely anonymous. I’ve simplified the questions somewhat and went a little more free-form with the answers. There are six questions in total. Thanks.
(earlier, blog only)> I’ve identified and replaced the failed server, and authentication is working again. We’ve got some NFS issues lingering that I’m tracking down that are preventing people from logging into the login servers. I’ll post another update once that’s taken care of. Update: NFS has been restored. Some machines may need a reboot. login2.mcs.anl.gov was rebooted in the process of debugging this. login1.mcs.anl.gov was done in by a combination of the service outage and a user job on the login node that ran the machine out of memory. login3 and login4 remained up.
Zimbra issues seem to have cleared up as soon as the authentication service was restored. This outage affected:
CELS Zimbra users
MCS LDAP/Kerberos authentications (including MCS Unix workstations)
MCS License servers. We will investigate why backup authentication methods did not help, as well as why CI Zimbra users were affected. Thanks for your patience.
I’ve identified and replaced the failed server, and authentication is working again. We’ve got some NFS issues lingering that I’m tracking down that are preventing people from logging into the login servers. I’ll post another update once that’s taken care of.
We’re tracking down a problem in MCS that’s affecting all logins for MCS users, as well as Zimbra authentication. Zimbra may have other issues — it’s not clear if this is caused by the outage on the MCS side or not at this point. I have reports CI users are unable to authenticate for mail as well.We’re working on this. See http://systemsblog.mcs.anl.gov for updates, and thanks for your patience.
Begin forwarded message:> Subject: Emergency VPN Maintenance
>> Impact: All VPN Users
>> Begin: August 4th, 5:15PM
>> End: August 4th, 5:30PM
>>> We are experiencing hardware failure on our VPN, it is currently up and running but in a very unstable condition. We have received the new hardware and will be swapping it out this evening at 5:15 in order to minimize impact to users. When this work begins all users will be disconnected from the VPN, and VPN will be unavailable until this work is completed. Another email will be sent when all work is completed.
>> Please always submit new requests for CIS assistance to email@example.com or
>> call the CIS Helpdesk @ 2-9999.
>>> For assistance regarding this communication and project, please contact:
>> Brandon Siegel
>>> Thank you for your patience in this matter.
I’m getting asked more about upgrades to Mac OS 10.7 (OS X Lion). Here’s the situation, as I know it:* The lab has a list of over 700 interested clients interested in upgrading to Lion (this includes everyone MCS Systems reports who’s using OS X).
* The lab does not yet have a method to purchase Lion individually, let alone a group purchase.
* The lab is trying to light a fire under Apple, but Apple has not yet addressed how to handle Lion updates and the various DOE labs. By the terms of Lion’s licensing and the App Store, if you buy Lion for a home machine, it can be installed on all the computers you use with that App Store account (Section 1.B.ii). So if you feel like you can’t wait, you have that option. But even if you jump the gun and place a requisition for a copy, it won’t go anywhere until the lab and Apple figure this out. I’ll send an update when I have this. BTW: if you’re interested in Mac OS and don’t already subscribe, you should subscribe to http://lists.mcs.anl.gov/mailman/listinfo/mac-users — lots of good tips and discussions show up there.
Power has been restored and we are working to restore services.
A good portion of the ANL site has lost power, including TCS.
The mailing list and RT server has been brought back. It suffered a network failure, which has been rectified.==