the lady with the hole in her stocking ([info]steph99) wrote,
@ 2008-02-08 19:04:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Entry tags:teh unix, work

IBM SAN: a caveat
I tend to be vague about the technical details of my job. Exclusive to this blog post: things to know if you are considering dropping a couple hundred grand on IBM servers and SAN stuff!




The equipment I run was bought before I was hired. I came from an office whose work groups were very strictly segmented. Merely glancing askance at a database or a screwdriver could get you a stern talking-to. So it was very nice to come to a small facility where I got my hands on everything. However, I hadn't dealt with hardware, much less IBM stuff, for a long time, or ever. It's taken a long time to determine what difficulties were due to my lack of familiarity and what was just dumb (bad) luck, but it's just been hassle after hassle, and our uptime sucks. I shudder at the thought of the TCO on this stuff. It might be that my magnetic personality, so delightful in conversation, is just too much for electronic equipment. I might be terribly unlucky. I really don't know if my experience is typical. But the failures I've seen are of a distinct pattern, which I shall describe forthwith.

The work stuff consists of about 2 racks of 1-3 u servers, some blades, some switches, some trays of disk, and a san controller. We've got your average app servers, web server, firewall, vpn, backups, the usual suspects. Some servers have been just fine, while others have required firmware upgrade after part replacement after firmware upgrade. Our most expensive single server, a $22k x366 has never actually been in production for real because it crashes and burn every 2-6 months. Sometimes it acts hacked, sometimes like there's a bad cpu, sometimes like the scsi bus is busted. The upshot of this is that it took about 2 YEARS for the firmware on this box to become stable. I say "stable" with a smirk...the last big crash-n-update we had was probably about a month ago, so we're due for another failure before too long.

Now onto the SAN. We have a DS4300 controller, some EXP710 disk enclosures, and an EXP100 SATA enclosure. The disk is arranged in mostly raid 5 arrays with some hot spares, and I have to say before going further, that the actual data has been remarkably safe. I really haven't seen corruption or disappearing data, except for this one time, but it might have been my fault for letting the nfs server run on the dying x366 and not getting it off sooner.

However, availability has been a disaster. Every so often, let's say every 3-6 weeks, production falls on its face b/c the NFS server starts serving out read-only mounts. The logs contains buckets of spurious scsi errors and the SAN will often report media scrubs, diagnostic runs, and path redundancy errors. For a while, I thought it had to be something on the server...bad version of NFS? Bad disk drivers? Bug in ext3? Bad kernel level? I banged my head against that wall for a while. It's clear now this behavior comes from the SAN. Precisely what causes it, I have no idea. IBM support is usually vague in the style of, "omg that firmware level is like super dangerous and backrev and you need to upgrade NOW or your shit will fail forever". Which is kind of awesome, considering, um, they publish the firmware. Granted, IBM doesn't actually MAKE very much stuff anymore, they just buy parts from other vendors, rename them so you can't google them or find downloads for them easily, build them into really fugly cases (though parts are really swappable, props on the component placement design), and mark the price up a bajillion percent. So some of this blame goes to upstream manufacturers.


However!

I do NOT expect to pay a bajillion dollars to have a SAN that is such a delicate Princess and her pea, that faeries, sunspots, bad karma, or a slight breeze in the data center can cause my production NFS server to flop.



It gets better. Often when this happens, one of the logical volumes will come back WITH NO PARTITION TABLE. This problem kept me up all night one time, after which my bolder and wiser co-worker just fdisked in a new table defining all the space on the volume as partition 1, which was correct. But when I found that out I was like HOLY SHIT WHAT THE HELL DID YOU DO YOU HAVE BALLZ THE SIZE OF TEXAS OH YOU MEAN IT WORKED? And poof, the partition mounts as good ol' ext3 and all the data is healthy. However, I doubt I have to point out that a supposedly fully redundant SAN should not turn its head and cough, resulting in dead mounts and eaten partition tables. DO NOT WANT.

This last time, the mysterious errant voltages caused all these symptoms (yes, we've actually entertained the idea of a power conditioner in the rack...another thing you shouldn't need in a data center with decent PDUs) and I sighed out yet another hardware service call, yawned through the requisite firmware upgrade advice, and managed to barely lift one eyelid at the recommendation to reseat some modules. What caught my attention was that the enclosure ID selector (like an old-fashioned scsi-id selector...remember those?) is supposed to go from 00 to 77, indicating the loop id in the tens column, and place in the loop in the ones column. We have 5 enclosures, so we needed numbering like 01, 02, 03, 04, 05. But the dial on the back of this thing only went up to 72, or in our case, 02. That's right, the ones column just stopped advancing at 2. I thought I was crazy, thought maybe it was reversed for this model, that maybe there was something about these trays, that maybe the limitation was there because of some requirement about how to chain devices or....you can see where this is going. I had all sorts of crazy hypotheses. Turned out it was just BROKEN. Called to have that replaced. We pulled the fibre cables to remove the device from the loop, because yanking this part, or changing the id, could damage the SAN.

But yanking the cables caused yet another path redundancy incident, scsi errors, NFS r/o, production down. Anyone get the joke? A path redundancy error after pulling a path? HAHA? At this point, no one in the office actually expects services to be up, so what the hell ever. We replaced the part, checked the cabling, checked the firmware levels, I recreated 2 partition tables, and we're back up today. We also tested pulling cables after the fix, and we got errors (as expected) but no failures, which was a pleasnt surprise. Yawn, 1 all-day outage and a few smaller outages this week, all unplanned. Awesome. Maybe it will be stable now? Now that our gear is starting to age and glancing toward end-of-life, maybe the firmware is finally mature. I'll find out over the coming weeks, I guess.

Then there is IBM linguistics. Just in my environment, we have 2 programs called Storage Manager, one to manage devices in the SAN, and one for backups, a Tivoli program. As mentioned, they maddeningly rebrand and rename, so it took me a long time to figure out that a FastT card was just a QLogic hba. Instead of lun masking, they say "storage partitioning", which to me is a COMPLETELY meaningless phrase. It could apply to, like, anything in computing, and unless you're down that particular rabbit hole, that wording gives no indication of what it's actually about. IBM loves the vague.



My experience with IBM is that the only way to make it work is to drink the Kool-Aid and try not to breathe too hard. Anyone have history with IBM hardware that confirms or contradicts this experience?



(Post a new comment)


[info]approachmdnight
2008-02-09 02:52 am UTC (link)
I don't have experience with IBM's SAN stuff aside from being contacted by some recruiter to work with it. They were offering big money, so I assumed it would be in the pain in the ass. It sounds like I was right.

IBM does seem to love making up their own terms for stuff. I think that dates back to the mainframe days. My mother works for an investment firm that uses a lot of mainframe apps and I never have a clue WTF she's talking about.

My former employer used an EMC SAN for hosting a bunch of VM images and storing databases. It seemed to work well for them. They also used IBM's rack-mount servers without issue, so I have no clue what's up with yours. This was a few years ago, though; their quality might have gone downhill.

Oh, and the practice of rebranding off-the-shelf components to sell at 5x the cost for "enterprise" systems is nothing new. Someone used to sell a 10/100 Ethernet card for SGI systems which was nothing but a 3Com EISA card with modified firmware and a holy shit price tag. Luckily, some kind soul hacked the drivers so you can just use a $10 3Com card from eBay -- if you still care about having Fast Ethernet in your Indigo2, that is.

Rewriting the partition table on the volume was a ballsy move indeed.

Does your employer have a good support contract with IBM? I thought the point of buying this overpriced gear was so you could call them up at 3AM and be like "fix this shit. now." I wouldn't place the blame on anyone but IBM. At the prices they charge, they should do extensive testing to make sure it actually works. :)

This post reminded me why I'd much rather write code than do IT work. At least then your hands aren't (as) tied by incompetent vendors.

(Reply to this)


(Anonymous)
2008-02-09 05:07 pm UTC (link)
wow. I'm going to hug my netapps next time I see them.

(Reply to this)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…