| the lady with the hole in her stocking ( @ 2008-02-08 19:04:00 |
| Entry tags: | teh unix, work |
IBM SAN: a caveat
I tend to be vague about the technical details of my job. Exclusive to this blog post: things to know if you are considering dropping a couple hundred grand on IBM servers and SAN stuff!
The equipment I run was bought before I was hired. I came from an office whose work groups were very strictly segmented. Merely glancing askance at a database or a screwdriver could get you a stern talking-to. So it was very nice to come to a small facility where I got my hands on everything. However, I hadn't dealt with hardware, much less IBM stuff, for a long time, or ever. It's taken a long time to determine what difficulties were due to my lack of familiarity and what was just dumb (bad) luck, but it's just been hassle after hassle, and our uptime sucks. I shudder at the thought of the TCO on this stuff. It might be that my magnetic personality, so delightful in conversation, is just too much for electronic equipment. I might be terribly unlucky. I really don't know if my experience is typical. But the failures I've seen are of a distinct pattern, which I shall describe forthwith.
The work stuff consists of about 2 racks of 1-3 u servers, some blades, some switches, some trays of disk, and a san controller. We've got your average app servers, web server, firewall, vpn, backups, the usual suspects. Some servers have been just fine, while others have required firmware upgrade after part replacement after firmware upgrade. Our most expensive single server, a $22k x366 has never actually been in production for real because it crashes and burn every 2-6 months. Sometimes it acts hacked, sometimes like there's a bad cpu, sometimes like the scsi bus is busted. The upshot of this is that it took about 2 YEARS for the firmware on this box to become stable. I say "stable" with a smirk...the last big crash-n-update we had was probably about a month ago, so we're due for another failure before too long.
Now onto the SAN. We have a DS4300 controller, some EXP710 disk enclosures, and an EXP100 SATA enclosure. The disk is arranged in mostly raid 5 arrays with some hot spares, and I have to say before going further, that the actual data has been remarkably safe. I really haven't seen corruption or disappearing data, except for this one time, but it might have been my fault for letting the nfs server run on the dying x366 and not getting it off sooner.
However, availability has been a disaster. Every so often, let's say every 3-6 weeks, production falls on its face b/c the NFS server starts serving out read-only mounts. The logs contains buckets of spurious scsi errors and the SAN will often report media scrubs, diagnostic runs, and path redundancy errors. For a while, I thought it had to be something on the server...bad version of NFS? Bad disk drivers? Bug in ext3? Bad kernel level? I banged my head against that wall for a while. It's clear now this behavior comes from the SAN. Precisely what causes it, I have no idea. IBM support is usually vague in the style of, "omg that firmware level is like super dangerous and backrev and you need to upgrade NOW or your shit will fail forever". Which is kind of awesome, considering, um, they publish the firmware. Granted, IBM doesn't actually MAKE very much stuff anymore, they just buy parts from other vendors, rename them so you can't google them or find downloads for them easily, build them into really fugly cases (though parts are really swappable, props on the component placement design), and mark the price up a bajillion percent. So some of this blame goes to upstream manufacturers.
However!
I do NOT expect to pay a bajillion dollars to have a SAN that is such a delicate Princess and her pea, that faeries, sunspots, bad karma, or a slight breeze in the data center can cause my production NFS server to flop.
It gets better. Often when this happens, one of the logical volumes will come back WITH NO PARTITION TABLE. This problem kept me up all night one time, after which my bolder and wiser co-worker just fdisked in a new table defining all the space on the volume as partition 1, which was correct. But when I found that out I was like HOLY SHIT WHAT THE HELL DID YOU DO YOU HAVE BALLZ THE SIZE OF TEXAS OH YOU MEAN IT WORKED? And poof, the partition mounts as good ol' ext3 and all the data is healthy. However, I doubt I have to point out that a supposedly fully redundant SAN should not turn its head and cough, resulting in dead mounts and eaten partition tables. DO NOT WANT.
This last time, the mysterious errant voltages caused all these symptoms (yes, we've actually entertained the idea of a power conditioner in the rack...another thing you shouldn't need in a data center with decent PDUs) and I sighed out yet another hardware service call, yawned through the requisite firmware upgrade advice, and managed to barely lift one eyelid at the recommendation to reseat some modules. What caught my attention was that the enclosure ID selector (like an old-fashioned scsi-id selector...remember those?) is supposed to go from 00 to 77, indicating the loop id in the tens column, and place in the loop in the ones column. We have 5 enclosures, so we needed numbering like 01, 02, 03, 04, 05. But the dial on the back of this thing only went up to 72, or in our case, 02. That's right, the ones column just stopped advancing at 2. I thought I was crazy, thought maybe it was reversed for this model, that maybe there was something about these trays, that maybe the limitation was there because of some requirement about how to chain devices or....you can see where this is going. I had all sorts of crazy hypotheses. Turned out it was just BROKEN. Called to have that replaced. We pulled the fibre cables to remove the device from the loop, because yanking this part, or changing the id, could damage the SAN.
But yanking the cables caused yet another path redundancy incident, scsi errors, NFS r/o, production down. Anyone get the joke? A path redundancy error after pulling a path? HAHA? At this point, no one in the office actually expects services to be up, so what the hell ever. We replaced the part, checked the cabling, checked the firmware levels, I recreated 2 partition tables, and we're back up today. We also tested pulling cables after the fix, and we got errors (as expected) but no failures, which was a pleasnt surprise. Yawn, 1 all-day outage and a few smaller outages this week, all unplanned. Awesome. Maybe it will be stable now? Now that our gear is starting to age and glancing toward end-of-life, maybe the firmware is finally mature. I'll find out over the coming weeks, I guess.
Then there is IBM linguistics. Just in my environment, we have 2 programs called Storage Manager, one to manage devices in the SAN, and one for backups, a Tivoli program. As mentioned, they maddeningly rebrand and rename, so it took me a long time to figure out that a FastT card was just a QLogic hba. Instead of lun masking, they say "storage partitioning", which to me is a COMPLETELY meaningless phrase. It could apply to, like, anything in computing, and unless you're down that particular rabbit hole, that wording gives no indication of what it's actually about. IBM loves the vague.
My experience with IBM is that the only way to make it work is to drink the Kool-Aid and try not to breathe too hard. Anyone have history with IBM hardware that confirms or contradicts this experience?