Mike Rios has been working tirelessly on this issue from the start, loosing sleep and I suspect some wits, (kidding). Thanks Mike for all your work!He has just sent out an note on the current state of afairs vis a vis Zimbra, along with a link to a page that they will be using to post updates:all.We are still experiencing an issue with our Zimbra mail system. At present,Zimbra runs normally for a short while, then response time to the users willbegin to degrade after about 15-30 minutes until the system is virtuallyunusable. Mail delivery is still happening; the front-end is the only thingbeing impacted by this issue.We have taken to restarting the module that is responsible for the user responseand mail interface every 15 minutes, at :00, :15, :30, and :45 on the clock.The restart process takes about 90 seconds, during which time reading andsending mail, along with the Zimbra web interface, will be affected. Thisstrategy has allowed us to “limp” along while we work with Zimbra in finding asolution for the problem we have.Zimbra engineers are working with us, examining our log files and going throughtheir code. There is at present no estimated time to repair for this issue.Zimbra understands that this is a critical issue for us, and has a number ofpeople working this issue and keeping us informed of their progress. Whatinformation we receive will be communicated on this list. In addition, we willbe keeping a wiki page up-to-date with information as we have it:https://wiki.inside.anl.gov/inside/Zimbra/Current_IssuesIf there are any questions regarding this outage or any other issues related tothis outage, please don’t hesitate to direct them to this list or any of theArgonne people involved.Thank you for your support and patience!mike rios.
As many of you know, the CIS Zimbra service has been “mis-behaving” since roughly 10pm on Tuesday, October 12th.The symptoms of this behavior: poor responsiveness leading to an unresponsive system – typically within 20-30 minutes of last restart.We have a temporary “work-around” in place until Zimbra has a fix for us – we restart the affected process once every 15 minutes. The restart takes around 90 seconds, during which time IMAP and web-based access is unavailable.After doing our own investigation into the issue, we have been working around the clock with Zimbra support on this issue since around 2AM this morning. Zimbra has several developers working on this, and are following several leads. It is at their highest level of priority and we have stressed to them how important it is that we get this solved quickly.As new information becomes available, we will pass it along. Please feel free to contact me directly to address any questions and concerns.Thank you for your continued patience.
At the moment, we’re still waiting for some fix from Zimbra. We’re giving them until Sunday evening before we start trying more drastic measures. We felt this time frame was acceptable given that mail is generally working, except for a 90 second IMAP outage every 15 minutes on the quarter hour.
The Zimbra mail and calendar service is having serious service issues. Work has been going on all night, and engineers at Zimbra are helping. At this point we have no ETA on when things will be stable. More details as they emerge.
CIS will be performing an update to the Zimbra service on Saturday between 9AM and 5PM on Saturday, July 17. During this window, you will not be able to receive new mail, send mail through Zimbra, or check for mail on the server. Also, Calendars will not be available during this window.Any mail sent during this window will cue up and be delivered once the server is back online. We do not expect any loss of mail or bounced mail.This upgrade is migrating the server from a 32-bit version to the 64-bit version, which will allow us to enhance performance. Also, the 32-bit versions will be end-of-life soon, and will no longer receive support.In August, we are planning to upgrade the server from Zimbra 5.0.23 to Zimbra 6.x. We expect this to fix a number of bugs, so that’s something to look forward to.Sorry for any inconvenience.
Here’s the talk I gave in early June on the state of systems. Alas, the video didn’t quite make it — audio was too quiet and it cut out before the end. But if you want to talk about anything you see in there, please come see me!Systems Talk 2010
Let me tell a couple of true stories of how social networking can be used to cause you and your friends harm.The first happened to a friend of mine. He’s sitting on his computer, browsing facebook, and he gets a chat request from a friend. This friend claimed to be in London, and needed money desperately as he had been robbed. My friend was smart enough to recognize this might not be legit, and started asking questions. Of course, because this guy had access to all the data in facebook, he could be fairly convincing in his answers (minus, of course, the lag time in looking up the information). As you may guess, my friend did not wire any money. Turns out this is a scam that’s getting more and more common.The second story happened to an acquaintance of this same friend. However, in her case, it was her account that was compromised. A Yahoo mail account, which was used to send mail to her friends asking for money. We don’t quite know how successful this one was. We do know the malicious user deleted all her e-mails from her account.Neither of these incidents happened at Argonne, just so you know.I tell you these stories to remind you that you need to be on your toes. In this day and age of social networking and information sharing, we’re putting a lot of information out there than can be used against us in many ways. I was startled when I visited pipl.com and searched for myself — all this information is out there, scraped off of webpages, social networking sites, Usenet… you name it. Someone armed with that information might be able to pull off a convincing job of pretending to be me. Convincing enough to scam someone else out of money or information they shouldn’t have.So be careful what you put out there. Keep your passwords strong, lengthy, diverse, and private. Don’t reuse them.Here’s a couple of links that were passed on to me today from ANL’s Cyber Security Program Office. The first is available on-site only, and is written by Mike Skwarek, the Cyber Security Program Manager and Deputy CIO for the lab. I recommend reading then, as there’s good advice in there.
Hey, folks.Just a friendly reminder that we’ll be upgrading the Zimbra mail and calendar server on Saturday. Your inbox, calendar, and webmail will not be accessible between 9AM and 2PM. No mail will be lost, only delayed during the outage window.This upgrade will address some locked mailbox issues we’ve seen sporadically, as well as allow Android Exchange e-mail syncing to work correctly.Thanks!
Hey, gang.Communication is an important thing. Without it, we’d all just be a bunch of meat sticks making random noises and gestures at each other. A part of communication that’s important to any service organization is feedback. Sometimes, that feedback is immediate and candid, sometimes it’s given given after the fact, and sometimes it’s solicited. While the focus of this post is on soliciting feedback, I really do hope everyone knows you do not need to wait for some survey or visit to offer input and ideas. We’re your Systems Group. When my door is open, which is generally the case, I’m an available set of ears. If you’ve got ideas for improvements, if you’ve got praise for someone, if you’ve got a complaint about someone, or you just want to talk about what we’re doing, I’m always willing and happy to talk. So, please, consider the lines of communication always open. I was going to make some joke about TCP ports and stateful connections, but that’s a sad kind of geeky.Anyway, it’s a new calendar year, we’re in our new building, and John Tesh hasn’t turned into a giant lizard and started terrorizing Tokyo, so it seems a good time to solicit some feedback. Over the coming months, I’m going to be visiting with many of you looking for your opinions and ideas, but I’d like to get a general sense of things, too. Many years back, we did a user satisfaction survey with the division. It proved to be useful information to have, so I’d like to do it again.At your leisure, please visit the survey (hosted at surveymonkey.com): http://www.surveymonkey.com/s/6CWNRWJ. It’s just 10 questions, and the survey is open until COB on Friday, January 29. Your answers are as anonymous as you want them to be – no IP addresses are collected, and no identifying information is asked. Unfiltered feedback is what I want.It’s been a while since we’ve done one of these things, and it’s something we should do more often. At least annually, I think. In any case, I’ll talk about the results in February.Thanks!
Wow, what a day. Most of us got here over 15 hours ago, some of us are still here. I’m a little punch drunk, so I’ll give a quick summary and perhaps a better post mortem later in the week.
- All equipment was moved without incident. Boyer-Rosene (the folks who moved all of us into our offices) were simply fantastic, very professional, and a joy to deal with!
- The bulk of the day was spent putting the machines back together, plugging in all the whosits and whatsits and things that go *ping*
- We had a bit of a network scare around 7, just when we thought we were back. Corby and Linda trudged over to 221 and got things working again after some fighting with some very old networking hardware.
- We*believe no mail was bounced, and things seem to be trickling through now and will continue to do so overnight
- We also believe just about all critical infrastructure is back up. We’ll catch what we missed either tomorrow or Monday
- We finally retired the following machines and services:
- Our old Windows 2000 domain
- Our old mail servers, some of which date back to the 90’s
- Our old tape library, that’s been handling our tape backups since… well, since before I’ve been employed here.
- Your “windows password” is no more. You now have a single MCS password, used for logging into everything we run except Zimbra. And, soon, Zimbra will also use that password
The Core really looks like a datacenter now. It’s filled with racks of machines, all chugging along. We have some work to do, still, and some cleanup in there (along with a hefty amount of cleanup in 221). Once it’s all done, we’ll have a little “open house” over lunch some day when you can stop by and we can show you all the cool things about the room and what’s running in there, plus what will be running in there in the years to come.A big, heartfelt “thank you” to everyone who came in to help today, and throughout the weeks since we started this move: Corby Schmitz, Linda Winkler, Max Trefonides, Hunter Matthews, John Valdes, Jason Hedden, John Roberts, Rick Bradshaw, Ti Leggett, Ken Raffenetti, Dave Goodell, Darius Buntinas, Rob Latham, Jared Wilkenning, Narayan Desai, Pavan Balaji, Rinku Gupta, David Ressman. If I’ve forgotten someone, please let me know! This whole move was a huge thing, and praise is deserved.I don’t want to minimize anyone’s effort in this, but I feel I must call out two people who have put in a really extraordinary effort in making this all happen. First is Hunter, who spent almost every day since we moved to 240 back over in the old machine room getting gear ready to move, all while learning about and rebuilding a new piece of the infrastructure that recently fell under his responsibility. I also want to call out Rick, who really took the lead on the huge organizational burden of this move, and did a fantastic job, also while doing a fantastic job fighting the NFS server issues that plagued us (while coming up with a top-notch and zippy service improvement).Lastly, some entertainment for you. I’ve posted a couple of videos to facebook, but for those of you who aren’t on there, check out the following videos. This is what happens during a long day in the datacenter:
It’s here! It’s underway! Things are moving!Move II: October 16-18
- Full Disclosure (already down)
- Breadboard (already down)
- kbT compute cluster (already down)
- I2U2 resources (already down)
- LCRC DDN storage system (already down)
- MCS Core Computing infrastructure (going down at around 4-4:30 PM today)
First thing Saturday morning, the movers will move the reamining equipment into the Core, at which point we’ll start hooking things back up and have things operational as soon as possible. You should generally not expect the resource to be available until the Monday following the move (though we’ll obviously strive for getting things up as quickly as possible).MCS Computing resources:
- October 16
- 4:00-4:30 PM: general core computing resources go down. At this point, all login nodes, compute machines, file servers, etc., will be offline as we pull cables and prepare the machines for the move.
- Note: This will affect mail service. Reading and sending mail will unaffected, receiving of new mail will be delayed, but we’re working to keep it as short a downtime as possible.
- October 17
- Very early in the morning, movers start moving gear to the Core.
- If all goes well with no snags, we should be operational before 5:00 PM, however the outage window is still until Monday in case things go other than smoothly.
There you have it! We’re shooting to keep downtimes to a minimum and make this weekend go as smoothly for you (and *us*) as possible. Sorry for any inconvenience this may cause you.See you at the TCS reception at 3!