Monday, November 16, 2009

How to protect yourself from RAID-related Unrecoverable Read Errors (UREs)

If you are ever rebuilding a RAID system, Unrecoverable Read Error (URE) is one term you don't want to learn about the hard way. As the name implies, a URE makes for a really bad day, as it can stop a RAID rebuild in its track, essentially making the entire RAID volume unusable.


I won't go into a lot of detail about the "why" behind what causes a URE because many other very smart people have already done a good job of explaining it. (Admittedly, some of the warnings might be sensationalist, and there seems to be some confusion on terminology, but do your own math to see if you might have a serious problem.) What I will do is provide you with tips on how to make sure that you don't fall victim to these errors.

1. Don't use RAID 5 if you plan to use large non-enterprise-grade SATA disks.

When it comes to reliability, enterprise grade disks are generally at least one order of magnitude more reliable than their non-enterprise counterparts. If you believe what has been written about UREs, as the size of disks increases and as more disks are added to RAID 5 arrays, the likelihood of total data loss across the entire RAID volume begins to get into dangerous territory.

If you're trying to build a 14 disk RAID array using 1.5TB or 2TB SATA disks you bought at Best Buy, consider using RAID 6 or RAID 10 instead of RAID 5. RAID 6's dual parity mechanism provides additional cushion in the event of drive failures (at the cost of performance), while RAID 10 setups can lose up to half the disks before incurring data loss.


If you're really intent on creating a large array and want minimal RAID overhead, you could even consider RAID 50, which is a RAID 0 of RAID 5 arrays. If you were going to use RAID 50 for that 14 disk array, you'd create three four-disk RAID 5 arrays and do RAID 0 across those and use the remaining two disks as spares.

2. Don't use cheap hardware if you need to push limits.

Yes, budgets are tight, but don't risk your data. Buy enterprise-grade disks such as SAS, fibre channel disks, or at the very least high-grade SATA disks with a higher mean time between failure rating.


If you're using disks that have a bit error rate of 1 in 10^14, you're using a cheap disk. Enterprise-class disks will have a bit error rate of 1 in 10^15 or, better yet, 1 in 10^16, which makes this class of disk much less susceptible to UREs.

Note: I'm not necessarily saying that you should buy SAS disks rather than SATA disks, but do look for disks with a reasonable bit error rate.

3. If you need a lot of disks for spindle performance, use smaller capacities.

There are times when quantity outweighs capacity. For example, when your disk system is constrained by IOPS, throwing more disk spindles at the solution can fix the problem, so in some cases, you might want to have a bunch of disks in a single RAID array. If you do, use smaller capacity disks.


One of the main concerns around the possibility of encountering a URE lies in the sheer amount of time it takes to rebuild after the loss of a massive disk, such as a 1TB or 2TB disk. When disk sizes were smaller, the window of opportunity for data loss was a lot smaller since smaller disks rebuild more quickly. As disk sizes continue to grow, this window of opportunity for data loss grows, and the problem will become more serious unless manufacturers can manage to produce drives with lower bit error rates.

4. Backups, backups, backups

Someone actually asked me once why we still do backups when we use RAID on all of our servers. No matter how reliable disks get, and no matter how far away you are from even the potential of a URE, nothing replaces reliable backups. If you ever run into a URE, you'll need reliable backups.

5. Be patient

Storage needs constantly increase, and RAID 6 is simply not the answer for everyone that becomes uncomfortable with RAID 5. We have a market need and someone, somewhere will come along and fill that need with a new way to provide robust redundancy. We've also seen newer data protection methods in use in products, such as Windows Home Server and Drobo's line of devices, and it's only a matter of time before similar methods - or completely new methods - become available for enterprise gear.


REFERENCES

How to protect yourself from RAID-related Unrecoverable Read Errors (UREs), by Scott Lowe, November 16th, 2009
, www.Techrepublic.com




No comments:

Post a Comment