Well, that didn’t work. The server went down again. We’re going to move the VMs to another hypervisor for now. This will result in some degraded performance, but it should at least stay up.(I’m sending this while the list server is down, but also posting it to http://systemsblog.mcs.anl.gov, which is hosted offsite.)
This weekend is a planned maintenance weekend for CIS, and some work being done will affect us. Specifically:– Short network interruptions (less than 60 seconds per interruption) (Saturday)
— These short rolling outages will affect mail, calendar, web, etc. Any servers we have in the 221 data center.
– Wireless network interruptions (Saturday)
– All business systems (most of Sat.) MCS is also performing maintenance on some of its systems this weekend. Specifically: – MCS-hosted mailing list server upgrade. (5PM on Friday through 7PM on Saturday). — During this outage, mail sent to mailing lists will queue up and be held until the new server is operating. This covers all mailing lists hosted in the following domains: lists.kbase-group.org
lists.igsb.anl.gov There will be brief interruption of the RT service and the macintosh file server, silver.mcs.anl.gov on Saturday morning. We do not anticipate these systems to be offline for more than a few minutes.
Between 8 and noon there will be intermittent outages that could prevent logging in to mail, workstations, login.mcs.anl.gov, and compute nodes. The actual outage window should be less than that window.
If these outages pose an extreme hardship for you, please let us know ASAP.Also, our Confluence instance is now production. The new hostname is https://collab.mcs.anl.gov. You can request an account on it via the accounts interface at https://accounts.mcs.anl.gov. Thanks!
A critical server crashed in the middle of the night, taking down CI authentication. Normally, this should not cause too big a problem, as we have backup authentication servers. However, there appears to be a misconfiguration on the Zimbra servers that was causing it to fail on the backup servers. This caused a cascading problem which made the Zimbra servers unresponsive. Mail was still coming in, but nobody could login to check it.The initial failure has been fixed (the authentication server is now back up), and we’re digging through the mess trying to make sure we fully understand why the other servers didn’t work as expected. We’ll have this bolted down such that the next failure will result in a proper fallback to redundant servers. Sorry for the inconvenience.
We’re aware of the problem with authentication to e-mail and are working on the solution. It will be fixed for some very soon. CI users will have a slightly longer delay while we fix the authentication server. Sorry for the trouble.
RT (trouble ticket system) is currently down. We’re working to bring it back — it should be back shortly, after which we’ll send more details.
We identified the issue and turned on the mail queue again. New tickets are now being created. Any tickets or correspondence sent during the outage from someone without an account on RT would not have generated a ticket. This issue is fixed, and we’ll be pushing through the missing messages over the next 30 minutes or so. If you maintain or read an RT queue on our system, you should see this backlog of messages start shortly.Sorry about this.
We’re tracking down a problem in RT that has prevented new tickets from being created. Will send an update when the issue is fixed. We have all the messages, so nothing will be lost, only delayed.
The MCS NFS fileservers will be undergoing maintenance from 6AM CST December 27 to 6PM CST December 28.
We are moving home directories and our softenv environment to a new server during this window. This work
is being performed to provide performance improvements and more available space.
During this period you may not be able to logon to linux workstations, and many of our filesystems will
be marked readonly. The linux workstations, login and compute servers will be restarted at the completion
of this maintenance.
Mathematics & Computer Science Division
Argonne National Laboratory
The new copier/printers are being installed today. We’re in the process of getting them set up. We’ll send an announcement with a link to instructions when everything’s all set and ready to go. You should be able to use the copy function now, but the rest will require a little time before it’s all online.Thanks!
This outage and upgrade is going ahead. The time frame is from noon until 5PM. Happy Thanksgiving!