Dispatches From The Geeks

News and Announcements from the MCS Systems Group

Dispatch 2008.4: The Aftermath

(There’s a beer theme going here, bear with me.  It’s American Craft Beer Week: http://www.americancraftbeerweek.org.)Well, folks, I’m past due for my general update on All Things Systems.  I’d like to make this a longer note, but I’m pressed for time.  Also, I’m a bar right now (thank you, gods of wireless!) and the beer is awful good, which means the more I write the less comprehensible I become.So, a brief rundown on the topics, and then we’ll dive in to the update-y goodness!* Mail migration post-mortem* Print server upgrade post-mortem* SSH key problem* HPC Sysadmin Search, the Home GameMail migration post mortem:I will mince no words.  This didn’t go near as well as I’d hoped.  But it went better than it could have.  I’d like to take a minute to explain what happened, and why it happened.Back when we first started planning the migration to the new server, I did a little reading to find out how other organizations handled such a move.  The last time *we* did a move like this, it was back in 1998/99, and we had not nearly as much mail as we do now.  The largest mailbox was no more than 2GB of space, for instance (winner in the largest mailbox post-migration this time — 15GB).It turns out the common method for doing this is the following process:* Set up new mail server.* On a predetermined date, switch delivery to the new mail server.* Give users instructions on how to move mail from old to new server.Simple.  However, since the server we were moving to was Zimbra, we had other options.  People had gone through this before, and had written tools to take care of the migration on behalf of the users.  Marvelous!  We could take on all of the “heavy lifting” in the process, and migrate everyone’s mail to the new server, and be heroes to all!Alas.Alas, alas.Okay, I’m ordering another beer.Mmmm.Well, the people who wrote these tools wrote them to solve their particular problems.  They are presented without warranty, and rightly so.  Turns out some mailboxes on our system are a little beefier than these folks encountered.  And the inevitable scalability problem ensues.  Couple this with a number of false starts as the attempt to actually move the huge pile of data to the new server goes awry at various stages and you wind up where we ended up — what was planned to be a 12 hour painless migration turns into a multi-week affair with even more false starts and your humble narrator turning into a hopeless alcoholic.But at least the beer is good.I digress.The sad irony is that, in the end, we ended up doing just what we hoped to avoid – switching the delivery location and leaving it up to the users to move their own mail.  But, with the added baggage of a less-than-complete migration for some users, and a lot of duplicated data on the server.It’s not pretty.  We know.So where are we now?  Well, aside from one user who I’m in the process of helping, and another who I’ll help once their laptop is rebuilt, we feel like we’re done with the migration of actual mail.  If this statement comes as a surprise to you and you’d like some help migrating from the old server to the new server, send a note to systems@mcs.anl.gov and we’ll help you out.I apologize completely and sincerely, but there were some communications problems in the beginning of this process, and you may have had the impression we were magically going to fix your mail without any interaction from you.  That’s my fault — I had assumed everyone on the team was as aware of things as I was and that was incorrect.  When it gets late at night and things are going crazy, you can forget that decisions you make aren’t magically made obvious to everyone else.  So, needless to say, but I’ll say it anyway, we learned a number of lessons on intra-group communications with this one.So, modulo any users we need to help finalize the migration, we’ve pretty moved all user mailbox delivery onto the new server.  This is good!  We’re ironing out some issues on the new server, and there’s an upgrade coming in the next couple of weeks to clear up some residual bugs, but in general things seem to be going okay.We have to migrate the trouble ticket system and the mailing lists onto the new server, and then we’ll be done with the old server once and for all.  As I promised earlier, we’ll archive all that data should it turn out we need it again.  I’ll send announcements on the mailing list upgrade and trouble ticket system upgrade as we approach them in the coming weeks.  I can say that they won’t be near as troublesome as the actual mail move, since there’s no substantial amount of data to be moved.  Mailing list archives (as of the switch) will stay where they are, with new archives being integrated into the new mailman server.  As for trouble tickets, the old Req archives will continue to be available, but new tickets will be put into RT.I really can’t understate how much we all appreciate the support you’ve all given and the patience you’ve shown as we muddle through this.  I’ve had my share of very bad days throughout this whole process, and without fail someone would either stop by and say something supportive, or send a note with kind words, and it would completely turn the day around.Without exaggeration, you are a truly fantastic bunch of people to support.  So, thank you.  Sincerely.And that’s not the beer talking.Now, another beer before onto the next section…Print Server upgrade post mortem:Of the upgrades we did on that fateful weekend, the print server upgrade was the least invasive, but not without its problems.  Ken is working hard on the solution to these problems, and we hope to have things stabilized soon.  Generally, things went as expected, but some networks (such as the super secret not-yet-production authenticated wireless) are having trouble connecting.  Also, printing from Preview or Microsoft Office on the Mac in landscape mode seems to have issues.  We’re pursuing parallel tracks on solving this, and hope to have a solution soon.SSH key problem:By now you’ve seen the mail from Rick Bradshaw on the SSH key regeneration we’ve had to do.  A bad decision by a Debian developer some time back resulted in a weak cryptography on the SSH key generation, and the keys are what makes the encryption work and be useful.  If you have issues, please let us know at systems@mcs.anl.gov and we’ll help you out.HPC Sysadmin Search:David Ressman, current administrator of a couple of HPC (high Performance Computing) clusters is moving to ALCF to do some even higher performance computing systems administration.  This means we’re looking for an awesome HPC sysadmin to take over his duties.  If you know of someone who you think is a worthwhile candidate, please let me know!That’s it for the all-to-brief Dispatch this month.  Another one will come in June.  I’m taking off for a little R & R and will be back in the office on Tuesday, May 20.Hope to see (and/or meet) you at the CELS picnic!


Written by Craig Stacey

May 16, 2008 at 7:35 am

Posted in Uncategorized

Tagged with

%d bloggers like this: