Dispatches From The Geeks

News and Announcements from the MCS Systems Group

Maintenance window for GitLab services 2017-07-27 12:00 – 13:00 CDT

A maintenance window of our GitLab services has been planned for Thursday 2017-07-27 from 12:00 to 13:00 CDT in order to update the software.

GitLab services provided by gitlab.cels.anl.gov and xgitlab.cels.anl.gov will be unavailable during maintenance.

Service is expected to be restored by 13:00 CDT, and may be available sooner than that.

If this window poses undue inconvenience please let us know, we can reschedule if needed.

Advertisements

Written by Craig Stacey

July 27, 2017 at 8:59 am

Posted in Uncategorized

ALL-CLEAR: CELS NFS File server work is Done

Greetings,
All services should be back online, if you experience any trouble with your linux workstation please reboot it. If you are still experiencing any trouble please contact the CELS Help Desk in the morning. Thank you for your patience.

Written by Adam Max Trefonides

July 11, 2017 at 7:21 pm

Posted in Uncategorized

NOTICE: CELS NFS File Server Outage Tonight

Reminder:

Tonight, July 11, 2017 at a little after 5 PM we will be taking our main home directory file server offline so that we can replace the system board. We expect the outage to last about 2 hours.

Which users will this effect?
Anyone using the NFS file services in the CELS/MCS environment.
If you ssh in to login.mcs.anl.gov or log in to a linux workstation at the console in bldg. 240 it’s likely you will be effected.

Which users will this not effect?
This should not effect most of our Mac users, such as Administrative Assistants. This should not effect most of the users in Bio.

Which systems will this effect?
Any system that NFS mounts file systems from sto10.mcs.anl.gov. This includes MCS Linux workstations, compute servers, and the login.mcs.anl.gov systems.

Which services will this effect?
NFS mounted home directories will be unavailable. Users will not be able to initiate new ssh sessions to any of the effected systems and any open sessions will likely suspend or fail. Most NFS shared project directories will also be inaccessible.

What should you do?
Before 5 PM tonight you should save any open work and log out of any open console or ssh sessions. After the outage it may be necessary to reboot your linux workstation.

We will send an “All-Clear” alert vi e-mail to this list and post to the systems blog and twitter once the work has been completed.

Systems Blog:
https://mcssys.wordpress.com

Twitter:
@mcssys
https://twitter.com/mcssys

Written by Adam Max Trefonides

July 11, 2017 at 11:07 am

Posted in Uncategorized

Reduced Help Desk coverage this week

Due to a combination of illness and vacations, we won’t have someone physically at the CELS help desk this week (or at least the first today, we’ll see as the week goes on).

Please send e-mail for problems you may have. If you have issues with using e-mail or other Argonne-provided services (dayforce, workday, ANL passwords, wifi, etc), you can also visit the BIS Help Desk next to the west entrance of TCS, or call x9999.

Thanks.

Written by Craig Stacey

July 3, 2017 at 9:33 am

Posted in Uncategorized

Final Update: BIO server back online

We were building the replacement disk array overnight (it was still building when I got in this morning), and I decided to see if I couldn’t get the original array into a useful state. Lo and behold, it had become responsive again overnight, and was back online. I did a few tests, and all looks good, so I’ve brought the server back online. Y and Z drives should be fully functional.

I’m starting a project to get the Y and Z drives moved from that system. It’s quite old and I’m concerned for its reliability. I’ll be talking with BIS about what’s appropriate to be hosted there vs. in our storage systems here.

Written by Craig Stacey

June 28, 2017 at 9:18 am

Posted in Uncategorized

Update 2: BIO server outage

The disk array housing the Y and Z drives appears to have failed in a catastrophic way. We’re reconfiguring an unused array tonight and will start to restore files from tape backup tomorrow. Once the restore starts, we should be able to provide an ETA.

Written by Craig Stacey

June 27, 2017 at 4:40 pm

Posted in Uncategorized

Update: BIO server outage

All servers but one (BIOWINFS, the file server for Y and Z drives) are back. The storage array for BIOWINFS is in a bad state. We’re still trying to bring it back. No ETA yet. If we can’t get it back, we will need to restore from tape backups to new storage. I will post updates as we have them.

Written by Craig Stacey

June 27, 2017 at 2:14 pm

Posted in Uncategorized

BIO server outage

Power work in TCS has unexpectedly affected the servers hosted in TCS. We’ve restored power to the affected servers and disks and are in the process of recovering the machines. Will send an all-clear when this is complete. This affects the shared drives Y, Z, and linux fileservers. Any services not functioning as expected are likely caused by this. I hope to have things back within the hour, if not sooner.

Written by Craig Stacey

June 27, 2017 at 11:38 am

Posted in Uncategorized

Maintenance window for CELS GitLab services 2017-06-28 12:00 – 13:00

A maintenance window of our GitLab services has been planned for

2017-06-28 (Wednesday) from 12:00 to 13:00 (CDT) in order to update the

software.

GitLab services provided by gitlab.cels.anl.gov and xgitlab.cels.anl.gov

will be unavailable during maintenance.

Service is expected to be restored by 13:00, and may be available sooner

than that.

If this window poses undue inconvenience please let us know, we can

reschedule if needed.

Chris

Written by Craig Stacey

June 26, 2017 at 1:16 pm

Posted in Uncategorized

MCS compute nodes

The scheduled reboot of the compute nodes has been completed.

brad

From: "Fritz, Brad" <fritz@anl.gov>
Date: Thursday, June 22, 2017 at 12:14 PM
To: "cels-systems-announce@lists.anl.gov" <cels-systems-announce@lists.anl.gov>
Subject: MCS compute nodes

We are planning to reboot the MCS compute nodes on June 23rd, starting at 9am CST, to apply updates.

Servers will be rebooted in the following order, one at a time.

  • compute001.mcs.anl.gov
  • thwomp.mcs.anl.gov
  • stomp.mcs.anl.gov
  • crush.mcs.anl.gov
  • crank.mcs.anl.gov
  • steamroller.mcs.anl.gov
  • grind.mcs.anl.gov
  • churn.mcs.anl.gov
  • trounce.mcs.anl.gov
  • thrash.mcs.anl.gov
  • vanquish.mcs.anl.gov
  • octagon.mcs.anl.gov
  • octopus.mcs.anl.gov

Only one compute node should be down at any given time.

Brad

Written by Craig Stacey

June 23, 2017 at 11:20 am

Posted in Uncategorized