So, I haven’t blogged in a while. I guess I should make all of the usual excuses about being busy (which is true), etc. But the fact of the matter is that I really haven’t had a whole heck of a lot that I thought would be of interest, certainly there wasn’t a lot that interested me!
But now, I have something that really get my juices flowing. The new IBM XIV. I don’t know if you’ve heard about this wonderful new storage platform from the folks at IBM, but I’m starting to bump into a lot of flolks that are either looking seriously at one, or have one, or more, on the floor now. It’s got some great pluses:
- It’s dirt cheap. On top of that, I heard that IBM is willing to do whatever it takes on price to get you to buy one of these boxes, to the point that they are practically giving them away. And, as someone I know and love once said “what part of free, isn’t free”?
- Fiber channel performance from a SATA box. I guess that’s one of the ways that they are using to keep the price so low.
- Teir 1 performance and reliability at a significantly lower price point.
So, that’s the deal, but like with everything in this world, there’s no free lunch. Yes, that’s right, I hate to break it to you folks, but you really can’t get something for nothing. The question to ask yourself is, is the XIV really too good to be true? The answer is yes, it is.
But the title of this blog is pretty harsh, don’t you think? Well, I think that once you understand that the real price you are paying for the “almost free’ XIV could be your career, or at least your job, then you might start to understand where I’m coming from. How can that be? Well, I think that in most shops, if you are the person who brought in a storage array which eventually causes a multi-day outage in your most critical systems that your job is going to be in jeopardy. And that’s what could happen to you if you buy into all of the above from IBM regarding the XIV.
What are you talking about Joerg?!? IBM says that the XIV is “self healing”, and that it can rebuild the lost data on a failed drive in 30 minutes or less. So how can what your said be true? Well folks, here’s the dirty little secret that IBM doesn’t want you to know about the XIV. Due to its architecture if you ever lose two drives in the entire box (not a shelf, not a RAID group, the whole box all 180 drives) within 30 minutes of each other, you lose all of the data on the entire array. Yup, that’s right, all your tier 1 applications are now down, and you will be reloading them from tape. This is a process that could take you quite some time, I’m betting days if not weeks to complete. That’s right, SAP down for a week, Exchange down for 3 days, etc. Again, do you think that if you brought that box in after something like that your career at this company wouldn’t be limited?
So, IBM will tell you that the likely hood of that happening is very small, almost infinitesimal. And they are right, but it’s not zero, so you are the one taking on that risk. Here’s another thing to keep in mind. Studies done at large data centers have show that disk drives don’t fail in a completely random way. They actually fail in clusters, so the chances of a second drive failing within the 30 minute window after that first drive failed are actually a lot higher than IBM would like you to believe. But, hey, let’s keep in mind that we play the risk game all the time with RAID protected arrays, right? But the big difference here is that the scope of the data loss is so much greater. If I have a failure in a 4+1 RAID-5 raid group, I’m going to lose some LUNs, and I’m going to have to reload that data from tape. However, it’s not the entire array! So I’ve had a much smaller impact across my Tier 1 applications, and the recovery from that should be much quicker. With the XIV, all my Teir 1 applications are down, and they have to all be reloaded from tape.
Just so you don’t think that I’m entirely negative about the XIV let me say that what I really object to here is the use of a XIV with Tier 1 applications or even Tier 2 applications. If you want to use one for Tier 3 applications (i.e. archive data) I think that makes a lot of sense. Having your archive down for a week or two won’t have much in the way of a negative impact on your business, unlike having your Tier 1 or Tier 2 applications down. The once exception to that I can think of is VTL. I would never use a XIV as the disks behind a VTL. Ca you imagine what would happen if you lost all of the data in your VTL? Let’s hope that you have second copies of the data!
Finally, one of the responses from IBM to all of this is “just replicate the XIV if your that worried”. They right, but that doubles the cost of storage, right?
and what are the odds of double cache failures on traditional arrays? The only true risk free solution is a replicated one
have you ever used or implemented XIV? Is your blog a true testimonial based on first hand experience or is this just your personal opinion on a technology that haven't really tried and tested yet? The reason for my queries is because we are currently using XIV in our datacenter and none of this things you claim are true to be honest. We are a happy user, infact my team got fat bonuses when had XIV in place and we are currently experiencing 30% improvement on performance compare to our legacy storage. I would like to request for documented proof of your claims on this particular technology, hope you can furnish for my personal reference.
https://www.ibm.com/developerworks/mydeveloperw…
You do not lose the entire array with a double disk failure.
I am an EMC customer that runs Symmetrix DMX2-1000 with 144 10K drives with data wide stripped all Raid5 3+1 in 1 Storage Pool. This configuration was validated by EMC and a Large VAR 5 years ago. Any one disk might have hypers from 20 + servers. If I we loose 2 disks in < 4 hours we are most likely going to have major data loss which will warrant failing over the entire Array to the Identically configured 1 million $ DR Symmetrix array that EMC sold us…
I guess someone I work with should be worried about the RGE (Resume Generating Event) as well.
It's unfortunate that this kind of unsubstatiated, flatly inaccurate and misleading FUD remains posted for all to see to this day. The good news is that the many, many users all over the world who have made the switch to XIV technology are smart enough to do their homework and learn that the storage industry is changing… for the better.
It's unfortunate that this kind of unsubstatiated, flatly inaccurate and misleading FUD remains posted for all to see to this day. The good news is that the many, many users all over the world who have made the switch to XIV technology are smart enough to do their homework and learn that the storage industry is changing… for the better.
If you use meta devices or LUSE you are using data from across mutiple array groups anyway. This statement “Studies done at large data centers have show that disk drives don’t fail in a completely random way” is inaccurate also. The studies were done on conventional arrays where several known entities come into play in the second ‘synthetic disk failure’ scenario. The studies where not done on XIV which is completely different.
We’ve lost entire modules. Not only did we not lose data, but we maintained data redundancy.