Anda di halaman 1dari 30

An Introduction to RAID The Need for RAID Data Striping & Redundancy Different Types of RAID Cost &

Performance Issues Reliability Issues in RAID

Implementations Problems

RAID (redundant array of independent disks, originally redundant array of inexpensive disks[1][2]) is a storage technology that combines multiple disk drive components into a logical unit. Data is distributed across the drives in one of several ways called "RAID levels", depending on what level of redundancy and performance (via parallel communication) is required. RAID is an example of storage virtualization and was first defined by David Patterson, Garth A. Gibson, and Randy Katz at the University of California, Berkeley in 1987.[3] Marketers representing industry RAID manufacturers later attempted to reinvent the term to describe a redundant array of independent disks as a means of dissociating a low-cost expectation from RAID technology.[4] RAID is now used as an umbrella term for computer data storage schemes that can divide and replicate data among multiple physical drives. The physical drives are said to be in a RAID array,[5] which is accessed by the operating system as one single drive. The different schemes or architectures are named by the word RAID followed by a number (e.g., RAID 0, RAID 1). Each scheme provides a different balance between two key goals: increase data reliability and increase input/output performance.

There are number of different RAID levels: Level 0 -- Striped Disk Array without Fault Tolerance: Provides data striping(spreading out blocks of each file across multiple disk drives) but no redundancy. This improves performance but does not deliver fault tolerance. If one drive fails then all data in the array is lost. Level 1 -- Mirroring and Duplexing: Provides disk mirroring. Level 1 provides twice the read transaction rate of single disks and the same write transaction rate as single disks. Level 2 -- Error-Correcting Coding: Not a typical implementation and rarely used, Level 2 stripes data at the bit level rather than the block level. Level 3 -- Bit-Interleaved Parity: Provides byte-level striping with a dedicated parity disk. Level 3, which cannot service simultaneous multiple requests, also is rarely used. Level 4 -- Dedicated Parity Drive:A commonly used implementation of RAID, Level 4 provides blocklevel striping (like Level 0) with a parity disk. If a data disk fails, the parity data is used to create a replacement disk. A disadvantage to Level 4 is that the parity disk can create write bottlenecks. Level 5 -- Block Interleaved Distributed Parity:Provides data striping at the byte level and also stripe error correction information. This results in excellent performance and good fault tolerance. Level 5 is one of the most popular implementations of RAID. Level 6 -- Independent Data Disks with Double Parity: Provides block-level striping with parity data distributed across all disks. Level 0+1 -- A Mirror of Stripes:Not one of the original RAID levels, two RAID 0 stripes are created, and a RAID 1 mirror is created over them. Used for both replicating and sharing data among disks. Level 10 -- A Stripe of Mirrors:Not one of the original RAID levels, multiple RAID 1 mirrors are created, and a RAID 0 stripe is created over these. Level 7: A trademark of Storage Computer Corporation that adds caching to Levels 3 or 4. RAID S: (also called Parity RAID) EMC Corporation's proprietary striped parity RAID system used in its Symmetrix storage systems.

[edit] New RAID classification


In 1996, the RAID Advisory Board introduced an improved classification of RAID systems.[citation needed]It divides RAID into three types:

Failure-resistant (systems that protect against loss of data due to drive failure). Failure-tolerant(systems that protect against loss of data access due to failure of any single component). Disaster-tolerant (systems that consist of two or more independent zones, either of which provides access to stored data).

The original "Berkeley" RAID classifications are still kept as an important historical reference point and also to recognize that RAID levels 06 successfully define all known data mapping and protection schemes for disk-based storage systems. Unfortunately, the original classification

caused some confusion due to the assumption that higher RAID levels imply higher redundancy and performance; this confusion has been exploited by RAID system manufacturers, and it has given birth to the products with such names as RAID-7, RAID-10, RAID-30, RAID-S, etc. Consequently, the new classification describes the data availability characteristics of a RAID system, leaving the details of its implementation to system manufacturers.

Failure-resistant disk systems (FRDS) (meets a minimum of criteria 16) 1. Protection against data loss and loss of access to data due to drive failure 2. Reconstruction of failed drive content to a replacement drive 3. Protection against data loss due to a "write hole" 4. Protection against data loss due to host and host I/O bus failure 5. Protection against data loss due to replaceable unit failure 6. Replaceable unit monitoring and failure indication Failure-tolerant disk systems (FTDS) (meets a minimum of criteria 115) 7. Disk automatic swap and hot swap 8. Protection against data loss due to cache failure 9. Protection against data loss due to external power failure 10. Protection against data loss due to a temperature out of operating range 11. Replaceable unit and environmental failure warning 12. Protection against loss of access to data due to device channel failure 13. Protection against loss of access to data due to controller module failure 14. Protection against loss of access to data due to cache failure 15. Protection against loss of access to data due to power supply failure Disaster-tolerant disk systems (DTDS) (meets a minimum of criteria 121) 16. 17. 18. 19. 20. 21. Protection against loss of access to data due to host and host I/O bus failure Protection against loss of access to data due to external power failure Protection against loss of access to data due to component replacement Protection against loss of data and loss of access to data due to multiple drive failures Protection against loss of access to data due to zone failure Long-distance protection against loss of data due to zone failure

NEED
The need for RAID can be summarized in two points given below. The two keywords are Redundant and Array.

An arrayof multiple disks accessed in parallel will give greater throughput than a single disk. Redundant data on multiple disks provides fault tolerance.

Provided that the RAID hardware and software perform true parallel accesses on multiple drives, there will be a performance improvement over a single disk. With a single hard disk, you cannot protect yourself against the costs of a disk failure, the time required to obtain and install a replacement disk, reinstall the operating system, restore files from backup tapes, and repeat all the data entry performed since the last backup was made. With multiple disks and a suitable redundancy scheme, your system can stay up and running when a disk fails, and even while the replacement disk is being installed and its data restored.

To create an optimal cost-effective RAID configuration, we need to simultaneously achieve the following goals:

Maximize the number of disks being accessed in parallel. Minimize the amount of disk space being used for redundant data. Minimize the overhead required to achieve the above goals.

Basic RAID Organizations

There are many types of RAID and some of the important ones are introduced below:

Non-Redundant (RAID Level 0)


A non-redundant disk array, or RAID level 0, has the lowest cost of any RAID organization because it does not employ redundancy at all. This scheme offers the best performance since it never needs to update redundant information. Surprisingly, it does not have the best performance. Redundancy schemes that duplicate data, such as mirroring, can perform better on reads by selectively scheduling requests on the disk with the shortest expected seek and rotationaldelays. Without, redundancy, any single disk failure will result in data-loss. Nonredundant disk arrays are widely used in super-computing environments where performance and capacity, rather than reliability, are the primary concerns. Sequential blocks of data are written across multiple disks in stripes, as follows:

source: Reference 2

The size of a data block, which is known as the "stripe width", varies with the implementation, but is always at least as large as a disk's sector size. When it comes time to read back this sequential data, all disks can be read in parallel. In a multi-tasking operating system, there is a high probability that even non-sequential disk accesses will keep all of the disks working in parallel.

Mirrored (RAID Level 1)


The traditional solution, called mirroring or shadowing, uses twice as many disks as a nonredundant disk array. whenever data is written to a disk the same data is also written to a redundant disk, so that there are always two copies of the information. When data is read, it can be retrieved from the disk with the shorter queuing, seek and rotational delays. If a disk fails, the other copy is used to service requests. Mirroring is frequently used in database applications where availability and transaction time are more important than storage efficiency.

source: Reference 2

Memory-Style(RAID Level 2)
Memory systems have provided recovery from failed components with much less cost than mirroring by using Hamming codes. Hamming codes contain parity for distinct overlapping subsets of components. In one version of this scheme, four disks require three redundant disks, one less than mirroring. Since the number of redundant disks is proportional to the log of the total number of the disks on the system, storage efficiency increases as the number of data disks increases. If a single component fails, several of the parity components will have inconsistent values, and the failed component is the one held in common by each incorrect subset. The lost information is recovered by reading the other components in a subset, including the parity component, and setting the missing bit to 0 or 1 to create proper parity value for that subset. Thus, multiple redundant disks are needed to identify the failed disk, but only one is needed to recover the lost information. In you are unaware of parity, you can think of the redundant disk as having the sum of all data in the other disks. When a disk fails, you can subtract all the data on the good disks form the parity disk; the remaining information must be the missing information. Parity is simply this sum modulo 2. A RAID 2 system would normally have as many data disks as the word size of the computer, typically 32. In addition, RAID 2 requires the use of extra disks to store an error-correcting code for redundancy. With 32 data disks, a RAID 2 system would require 7 additional disks for a Hamming-code ECC. Such an array of 39 disks was the subject of a U.S. patent granted to Unisys Corporation in 1988, but no commercial product was ever released. For a number of reasons, including the fact that modern disk drives contain their own internal ECC, RAID 2 is not a practical disk array scheme.

source: Reference 2

Bit-Interleaved Parity (RAID Level 3)


One can improve upon memory-style ECC disk arrays by noting that, unlike memory component failures, disk controllers can easily identify which disk has failed. Thus, one can use a single parity rather than a set of parity disks to recover lost information. In a bit-interleaved, parity disk array, data is conceptually interleaved bit-wise over the data disks, and a single parity disk is added to tolerate any single disk failure. Each read request accesses all data disks and each write request accesses all data disks and the parity disk. Thus, only one request can be serviced at a time. Because the parity disk contains only parity and no data, the parity disk cannot participate on reads, resulting in slightly lower read performance than for redundancy schemes that distribute the parity and data over all disks. Bit-interleaved, parity disk arrays are frequently used in applications that require high bandwidth but not high I/O rates. They are also simpler to implement than RAID levels 4, 5, and 6. Here, the parity disk is written in the same way as the parity bit in normal Random Access Memory (RAM), where it is the Exclusive Or of the 8, 16 or 32 data bits. In RAM, parity is used to detect single-bit data errors, but it cannot correct them because there is no information available to determine which bit is incorrect. With disk drives, however, we rely on the disk controller to report a data read error. Knowing which disk's data is missing, we can reconstruct it as the Exclusive Or (XOR) of all remaining data disks plus the parity disk.

source: Reference 2

As a simple example, suppose we have 4 data disks and one parity disk. The sample bits are:

Disk 0 0

Disk 1 1

Disk 2 1

Disk 3 1

Parity 1

The parity bit is the XOR of these four data bits, which can be calculated by adding them up and writing a 0 if the sum is even and a 1 if it is odd. Here the sum of Disk 0 through Disk 3 is "3", so the parity is 1. Now if we attempt to read back this data, and find that Disk 2 gives a read error, we can reconstruct Disk 2 as the XOR of all the other disks, including the parity. In the example, the sum of Disk 0, 1, 3 and Parity is "3", so the data on Disk 2 must be 1.

Block-Interleaved Parity (RAID Level 4)


The block-interleaved, parity disk array is similar to the bit-interleaved, parity disk array except that data is interleaved across disks of arbitrary size rather than in bits. The size of these blocks is called the striping unit. Read requests smaller than the striping unit access only a single data disk. Write requests must update the requested data blocks and must also compute and update the parity block. For large writes that touch blocks on all disks, parity is easily computed by exclusive-or'ing the new data for each disk. For small write requests that update only one data disk, parity is computed by noting how the new data differs from the old data and applying those differences to the parity block. Small write requests thus require four disk I/Os: one to write the new data, two to read the old data and old parity for computing the new parity, and one to write the new parity. This is referred to as a read-modify-write procedure. Because a block-interleaved, parity disk array has only one parity disk, which must be updated on all write operations, the parity disk can easily become a bottleneck. Because of this limitation, the block-interleaved distributed parity disk array is universally preferred over the block-interleaved, parity disk array.

source: Reference 2

Block-Interleaved Distributed-Parity (RAID Level 5)

The block-interleaved distributed-parity disk array eliminates the parity disk bottleneck present in the block-interleaved parity disk array by distributing the parity uniformly over all of the disks. An additional, frequently overlooked advantage to distributing the parity is that it also distributes data over all of the disks rather than over all but one. This allows all disks to participate in servicing read operations in contrast to redundancy schemes with dedicated parity disks in which the parity disk cannot participate in servicing read requests. Block-interleaved distributed-parity disk array have the best small read, large write performance of any redundancy disk array. Small write requests are somewhat inefficient compared with redundancy schemes such as mirroring however, due to the need to perform read-modify-write operations to update parity. This is the major performance weakness of RAID level 5 disk arrays. The exact method used to distribute parity in block-interleaved distributed-parity disk arrays can affect performance. Following figure illustrates left-symmetric parity distribution.

Each square corresponds to a stripe unit. Each column of squares corresponds to a disk. P0 computes the parity over stripe units 0, 1, 2 and 3; P1 computes parity over stripe units 4, 5, 6, and 7 etc. (source: Reference 1) A useful property of the left-symmetric parity distribution is that whenever you traverse the striping units sequentially, you will access each disk once before accessing any disk device. This property reduces disk conflicts when servicing large requests.

source: Reference 2

P+Q redundancy (RAID Level 6)


Parity is a redundancy code capable of correcting any single, self-identifying failure. As large disk arrays are considered, multiple failures are possible and stronger codes are needed. Moreover, when a disk fails in parity-protected disk array, recovering the contents of the failed disk requires successfully reading the contents of all non-failed disks. The probability of encountering an uncorrectable read error during recovery can be significant. Thus, applications with more stringent reliability requirements require stronger error correcting codes. Once such scheme, called P+Q redundancy, uses Reed-Solomon codes to protect against up to two disk failures using the bare minimum of two redundant disk arrays. The P+Q redundant disk arrays are structurally very similar to the block-interleaved distributed-parity disk arraysand operate in much the same manner. In particular, P+Q redundant disk arrays also perform small write operations using a read-modify-write procedure, except that instead of four disk accesses per write requests, P+Q redundant disk arrays require six disk accesses due to the need to update both the `P' and `Q' information.

Striped Mirrors (RAID Level 10)


RAID 10 was not mentioned in the original 1988 article that defined RAID 1 through RAID 5. The term is now used to mean the combination of RAID 0 (striping) and RAID 1 (mirroring). Disks are mirrored in pairs for redundancy and improved performance, then data is striped across multiple disks for maximum performance. In the diagram below, Disks 0 & 2 and Disks 1 & 3 are mirrored pairs. Obviously, RAID 10 uses more disk space to provide redundant data than RAID 5. However, it also provides a performance advantage by reading from all disks in parallel while eliminating the write penalty of RAID 5. In addition, RAID 10 gives better performance than RAID 5 while a failed drive remains unreplaced. Under RAID 5, each attempted read of the failed drive can be performed only by reading all of the other disks. On RAID 10, a failed disk can be recovered by a single read of its mirrored pair.

source: Reference 2

Tool to calculate storage efficiency given the number of disks and the RAID level (source: Reference 3)

RAID Systems Need Tape Backups


It is worth remembering an important point about RAID systems. Even when you use a redundancy scheme like mirroring or RAID 5 or RAID 10, you must still do regular tape backups of your system. There are several reasons for insisting on this, among them:

RAID does not protect you from multiple disk failures. While one disk is off line for any reason, your disk array is not fully redundant. Regular tape backups allow you to recover from data loss that is not related to a disk failure. This includes human errors, hardware errors, and software errors.

BACK / HOME / NEXT


There are three important considerations while making a selection as to which RAID level is to be used for a system viz. cost, performance and reliability. There are many different ways to measure these parameters for eg. performance could be measured as I/Os per second per dollar, bytes per second or response time. We could also compare systems at the same cost, the same total user capacity, the same performance or the same reliability. The method used largely depends on the application and the reason to compare. For example, in transaction processing applications the primary base for comparison would be I/Os per second per dollar while in scientific applications we would be more interested in bytes per second per dollar. In some heterogeneous systems like file servers both I/O per second and bytes per second may be important. Sometimes it is important to consider reliability as the base for comparison. Taking a closer look at the RAID levels we observe that most of the levels are similar to each other. RAID level 1 and RAID level 3 disk arrays can be viewed as a subclass of RAID level 5 disk arrays. Also RAID level 2 and RAID level 4 disk arrays are generally found to be inferior to RAID level 5 disk arrays. Hence the problem of selecting among RAID levels 1 through 5 is a subset of the more general problem of choosing an appropriate parity group size and striping unit for RAID level 5 disk arrays.

Some Comparisons
Given below is a table that compares the throughput of various redundancy schemes for four types of I/O requests. The I/O requests are basically reads and writes which are divided into small (reads & writes) and large ones. Remembering the fact that our data has been spread over multiple disks (data striping), a small refers to an I/O request of one striping unit while a large I/O request refers to requests of one full stripe (one stripe unit from each disk in an error

correcting group).

RAID Type
RAID Level 0 RAID Level 1 RAID Level 3 RAID Level 5 RAID Level 6

Small Read
1 1 1/G 1 1

Small Write
1 1/2 1/G max (1/G,1/4) max (1/G,1/6)

Large Read
1 1 (G-1)/G 1 1

Large Write
1 1/2 (G-1)/G (G-1)/G (G-2)/G

Storage Efficiency
1 1/2 (G-1)/G (G-1)/G (G-2)/G

G : The number of disks in an error correction group.

The table above tabulates the maximum throughput per dollar relative level 0 for RAID levels 0, 1, 3, 5 and 6. For practical purposes we consider RAID levels 2 & 4 inferior to RAID level 5 disk arrays, so we don't show the comparisons. The cost of a system is directly proportional to the number of disks it uses in the disk array. Thus the table shows us that given equivalent cost RAID level 0 and RAID level 1 systems, the RAID level 1 system can sustain half the number of small writes per second that a RAID level 0 system can sustain. Equivalently the cost of small writes is twice as expensive in a RAID level 1 system as in a RAID level 0 system. The table also shows storage efficiency of each RAID level. The storage efficiency is approximately inverse the cost of each unit of user capacity relative to a RAID level 0 system. The storage efficiencyis equal to the performance/cost metric for large writes.

source: Reference 1

The figures above graph the performance/cost metrics from the table above for RAID levels 1, 3, 5 and 6 over a range of parity group sizes. The performance/cost of RAID level 1 systems is equivalent to the performance/cost of RAID level 5 systems when the parity group size is equal to 2. The performance/cost of RAID level 3 systems is always less than or equal to the performance/cost of RAID level 5 systems. This is expected given that a RAID level 3 system is a subclass of RAID level 5 systems derived by restricting the striping unit size such that all requests access exactly a parity stripe of data. Since the configuration of RAID level 5 systems is not subject to such a restriction, the performance/cost of RAID level 5 systems can never be less than that of an equivalent RAID level 3 system. Of course such generalizations are specific to the models of disk arrays used in the above experiments. In reality, a specific implementation of a

RAID level 3 system can have better performance/cost than a specific implementation of a RAID level 5 system. The question of which RAID level to use is better expressed as more general configuration questions concerning the size of the parity group and striping unit. For a parity group size of 2, mirroring is desirable, while for a very small striping unit RAID level 3 would be suited. The figure below plots the performance/cost metrics from the table above for RAID levels 3, 5 & 6.

source: Reference 1

Reliability
BACK / HOME
Reliability of any I/O system has become as important as its performance and cost. This part of the tutorial:

Reviews the basic reliability provided by a block-interleaved parity disk array Lists and discusses three factors that can determine the potential reliability of disk arrays.

Redundancy in disk arrays is motivated by the need to fight disk failures. Two key factors MTTF(MeanTime-to-Failure) and MTTR(Mean-Time-to-Repair) are of primary concern in estimating the reliability of any disk. Following are some formulae for the mean time between failures :

RAID level 5
MTTF(disk) 2 -----------------N*(G-1)*MTTR(disk)

Disk array with two redundant disk per parity group (eg: P+Q redundancy) MTTF(disk) 3 ------------------------N*(G-1)*(G-2)* (MTTR(disk) 2 )

N - total number of disks in the system G - number of disks in the parity group

Factors affecting Reliability

Three factors that can dramatically affect the reliability of disk arrays are:

System crashes Uncorrectable bit-errors Correlated disk failures

System Crashes
System crash refers to any event such as a power failure, operator error, hardware breakdown, or software crash that can interrupt an I/O operation to a disk array. Such crashes can interrupt write operations, resulting in states where the data is updated and the parity is not updated or vice versa. In either case, parity is inconsistent and cannot be used in the event of a disk failure. Techniques such as redundant hardware and power supplies can be applied to make such crashes less frequent. System crashes can cause parity inconsistencies in both bit-interleaved and block-interleaved disk arrays, but the problem is of practical concern only in block-interleaved disk arrays. For, reliability purposes, system crashes in block-interleaved disk arrays are similar to disk failures in that they may result in the loss of the correct parity for stripes that were modified during the crash.

Uncorrectable bit-errors
Most uncorrectable bit-errors are generated because data is incorrectly written or gradually damaged as the magnetic media ages. These errors are detected only when we attempt to read the data. Our interpretation of uncorrectable bit error rates is that they represent the rate at which errors are detected during reads from the disk during the normal operation of the disk drive. One approach that can be used with or without redundancy is to try to protect against bit errors by predicting when a disk is about to fail. VAXsimPLUS, a product from DEC, monitors the warnings issued by disks and notifies an operator when it feels the disk is about to fail.

Correlated disk failures


Causes: Common environmental and manufacturing factors. For example, an accident might sharply increase the failure rate for all disks in a disk array for a short period of time. In general, power surges, power failures and simply switching the disks on and off can place stress on the electrical components of all affected disks. Disks also share common support hardware; when this hardware fails, it can lead to multiple, simultaneous disk failures. Disks are generally more likely to fail either very early or very late in their lifetimes. Early failuresare frequently caused by transient defects which may not have been detected during the manufacturer's burn-in process. Late failures occur when a disk wears out. Correlated disk failures greatly reduce the reliability of disk arrays by making it much more likely that an initial disk failure will be closely followed by additional disk failures before the failed disk can be reconstructed.

Mean-Time-To-Data-Loss(MTTDL)
Following are some formulae to calculate the mean-time-to-data-loss(MTTDL). In a blockinterleaved parity-protected disk array, data loss is possible through the following three common ways:

double disk failures system crash followed by a disk failure disk failure followed by an uncorrectable bit error during reconstruction

The above three failure modes are the hardest failure combinations, in that we, currently, don't have any techniques to protect against them without sacrificing performance.

RAID Level 5

Double Disk Failure

MTTF(disk1) * MTTF(disk2)

----------------------N * (G-1) * MTTR(disk) MTTF(system) * MTTF(disk) System Crash + Disk Failure ----------------------N * MTTR(system) MTTF(disk) Disk Failure + Bit Error ----------------------N * (1 - ( p(disk)) (G-1) ) Software RAID harmonic sum of the above harmonic sum of above excluding system crash + disk failure

Hardware RAID

Failure Characteristics for RAID Level 5 Disk Arrays (source: Reference 1)

P+Q disk Array

MTTF(disk) * (MTTF(disk2) * MTTF(disk3) Triple Disk Failures ---------------------------------N * (G-1) * (G-2) * MTTR(disk) 2 MTTF(system) * MTTF(disk) System Crash + Disk Failure -------------------------N * MTTR(system) Double disk failure + Bit error MTTF(disk) * MTTF(disk2)

---------------------------------N*(G-1)*(1-(p(disk)) (G-2) )* MTTR(disk) Software RAID Hardware RAID harmonic sum of the above harmonic sum excluding system crash +disk failure

Failure characteristics for a P+Q disk array (source: Reference 1)

p(disk) = The probability of reading all sectors on a disk (derived from disk size, sector size, and BER)

Tool for Reliability Using the Above Equations.


(source: Reference 3)

[edit] Implementations
The distribution of data across multiple drives can be managed either by dedicated computer hardware or by software. A software solution may be part of the operating system, or it may be part of the firmware and drivers supplied with a hardware RAID controller.

[edit] Software-based RAID


Software RAID implementations are now provided by many operating systems. Software RAID can be implemented as:

a layer that abstracts multiple devices, thereby providing a single virtual device (e.g. Linux's md). a more generic logical volume manager (provided with most server-class operating systems, e.g. Veritas or LVM). a component of the file system (e.g. ZFS or Btrfs).

[edit] Volume manager support

Server class operating systems typically provide logical volume management, which allows a system to use logical[jargon]volumes which can be resized or moved. Often, features like RAID or snapshots are also supported.

Vinumis a logical volume manager supporting RAID-0, RAID-1, and RAID-5. Vinum is part of the base distribution of the FreeBSDoperating system, and versions exist for NetBSD, OpenBSD, and DragonFly BSD.

Solaris SVM supports RAID 1 for the boot filesystem, and adds RAID 0 and RAID 5 support (and various nested combinations) for data drives. Linux LVMsupports RAID 0 and RAID 1. HP's OpenVMSprovides a form of RAID 1 called "Volume shadowing", giving the possibility to mirror data locally and at remote cluster systems.

[edit] File system support

Some advanced file systemsare designed to organize data across multiple storage devices directly (without needing the help of a third-party logical volume manager).

ZFSsupports equivalents of RAID 0, RAID 1, RAID 5 (RAID Z), RAID 6 (RAID Z2), and a triple parity version RAID Z3, and any nested combination of those like 1+0. ZFS is the native file system on Solaris, and also available on FreeBSD. Btrfssupports RAID 0, RAID 1, and RAID 10 (RAID 5 and 6 are under development).

[edit] Other support

Many operating systems provide basic RAID functionality independently of volume management.

Apple's Mac OS X Server[19]and Mac OS X[20]support RAID 0, RAID 1, and RAID 1+0. FreeBSDsupports RAID 0, RAID 1, RAID 3, and RAID 5, and all nestings via GEOM modules[21][22]and ccd.[23] Linux's md supports RAID 0, RAID 1, RAID 4, RAID 5, RAID 6, and all nestings.[24][25]Certain reshaping/resizing/expanding operations are also supported.[26] Microsoft's server operating systems support RAID 0, RAID 1, and RAID 5. Some of the Microsoft desktop operating systems support RAID such as Windows XP Professional which supports RAID level 0 in addition to spanning multiple drives but only if using dynamic disks and volumes. Windows XP can be modified to support RAID 0, 1, and 5.[27] NetBSDsupports RAID 0, RAID 1, RAID 4, and RAID 5, and all nestings via its software implementation, named RAIDframe. OpenBSDaims to support RAID 0, RAID 1, RAID 4, and RAID 5 via its software implementation softraid. FlexRAID(for Linux and Windows) is a snapshot RAID implementation.

Software RAID has advantages and disadvantages compared to hardware RAID. The software must run on a host server attached to storage, and the server's processor must dedicate processing time to run the RAID software; the additional processing capacity required for RAID 0 and RAID 1 is low, but parity-based arrays require more complex data processing during write or integrity-checking operations. As the rate of data processing increases with the number of drives in the array, so does the processing requirement. Furthermore, all the buses between the processor and the drive controller must carry the extra data required by RAID, which may cause congestion. Fortunately, over time, the increase in commodity CPU speed has been consistently greater than the increase in drive throughput;[28]the percentage of host CPU time required to saturate a given

number of drives has decreased. For instance, under 100% usage of a single core on a 2.1 GHz Intel "Core2" CPU, the Linux software RAID subsystem (md) as of version 2.6.26 is capable of calculating parity information at 6 GB/s; however, a three-drive RAID 5 array using drives capable of sustaining a write operation at 100 MB/s only requires parity to be calculated at the rate of 200 MB/s, which requires the resources of just over 3% of a single CPU core. Furthermore, software RAID implementations may employ more sophisticated algorithms than hardware RAID implementations (e.g. drive scheduling and command queueing), and thus, may be capable of better performance. Another concern with software implementations is the process of booting the associated operating system. For instance, consider a computer being booted from a RAID 1 (mirrored drives); if the first drive in the RAID 1 fails, then a first-stage boot loadermight not be sophisticated enough to attempt loading the second-stage boot loaderfrom the second drive as a fallback. In contrast, a RAID 1 hardware controller typically has explicit programming to decide that a drive has malfunctioned and that the next drive should be used. At least the following second-stage boot loaders are capable of loading a kernelfrom a RAID 1:

LILO (for Linux). Some configurations of the GRUB. The boot loader for FreeBSD.[29] The boot loader for NetBSD.

For data safety, the write-back cacheof an operating system or individual drive might need to be turned off in order to ensure that as much data as possible is actually written to secondary storage before some failure (such as a loss of power); unfortunately, turning off the write-back cache has a performance penalty that can be significant depending on the workload and command queuing[jargon]support. In contrast, a hardware RAID controller may carry a dedicated batterypowered write-back cache of its own, thereby allowing for efficient operation that is also relatively safe. Fortunately, it is possible to avoid such problems with a software controller by constructing a RAID with safer components; for instance, each drive could have its own battery or capacitor on its own write-back cache, and the drive could implement atomicityin various ways, and the entire RAID or computing system could be powered by a UPS, etc.[citation needed] Finally, a software RAID controller that is built into an operating system usually uses proprietary data formats and RAID levels, so an associated RAID usually cannot be shared between operating systems as part of a multi bootsetup. However, such a RAID may be moved between computers that share the same operating system; in contrast, such mobility is more difficult when using a hardware RAID controller because both computers must provide compatible hardware controllers. Also, if the hardware controller fails, data could become unrecoverable unless a hardware controller of the same type is obtained. Most software implementations allow a RAID to be created from partitionsrather than entire physical drives. For instance, an administrator could divide each drive of an odd number of drives into two partitions, and then mirror partitions across drives and stripe a volume across the mirrored partitions to emulate IBM's RAID 1E configuration. Using partitions in this way also

allows for constructing multiple RAIDs in various RAID levels from the same set of drives. For example, one could have a very robust RAID 1 for important files, and a less robust RAID 5 or RAID 0 for less important data, all using the same set of underlying drives. (Some BIOS-based controllers offer similar features, e.g. Intel Matrix RAID.) Using two partitions from the same drive in the same RAID puts data at risk if the drive fails; for instance:

A RAID 1 across partitions from the same drive makes all the data inaccessible if the single drive fails. Consider a RAID 5 composed of 4 drives, 3 of which are 250 GB and one of which is 500 GB; the 500 GB drive is split into 2 partitions, each of which is 250 GB. Then, a failure of the 500 GB drive would remove 2 underlying 'drives' from the array, causing a failure of the entire array.

[edit] Hardware-based RAID


Hardware RAID controllers use proprietary data layouts, so it is not usually possible to span controllers from different manufacturers. They do not require processor resources, the BIOS can boot from them, and tighter integration with the device driver may offer better error handling. On a desktop system, a hardware RAID controller may be an expansion cardconnected to a bus (e.g., PCI or PCIe), a component integrated into the motherboard; there are controllers for supporting most types of drive technology, such as IDE/ATA, SATA, SCSI, SSA, Fibre Channel, and sometimes even a combination. The controller and drives may be in a stand-alone enclosure, rather than inside a computer, and the enclosure may be directly attachedto a computer, or connected via a SAN. Most hardware implementations provide a read/write cache, which, depending on the I/O workload, improves performance. In most systems, the write cache is non-volatile (i.e. batteryprotected), so pending writes are not lost in the event of a power failure. Hardware implementations provide guaranteed performance, add no computational overhead to the host computer, and can support many operating systems; the controller simply presents the RAID as another logical drive.

[edit] Firmware/driver-based RAID


A RAID implemented at the level of an operating system is not always compatible with the system's boot process, and it is generally impractical for desktop versions of Windows (as described above). However, hardware RAID controllers are expensive and proprietary. To fill this gap, cheap "RAID controllers" were introduced that do not contain a dedicated RAID controller chip, but simply a standard drive controller chip with special firmware and drivers; during early stage bootup, the RAID is implemented by the firmware, and once the operating system has been more completely loaded, then the drivers take over control. Consequently, such controllers may not work when driver support is not available for the host operating system.[30] Initially, the term "RAID controller" implied that the controller does the processing. However, while a controller without a dedicated RAID chip is often described by a manufacturer as a

"RAID controller", it is rarely made clear that the burden of RAID processing is borne by a host computer's central processing unit rather than the RAID controller itself. Thus, this new type is sometimes called "fake" RAID; Adapteccalls it a "HostRAID". Moreover, a firmware controller can often only support certain types of hard drive to form the RAID that it manages (e.g. SATA for an Intel Matrix RAID, as there is neither SCSI nor PATA support in modern Intel ICH southbridges; however, motherboard makers implement RAID controllers outside of the southbridge on some motherboards).

[edit] Hot spares


Both hardware and software RAIDs with redundancy may support the use of a hot sparedrive; this is a drive physically installed in the array which is inactive until an active drive fails, when the system automatically replaces the failed drive with the spare, rebuilding the array with the spare drive included. This reduces the mean time to recovery(MTTR), but does not completely eliminate it. As with non-hot-spare systems, subsequent additional failure(s) in the same RAID redundancy group before the array is fully rebuilt can cause data loss. Rebuilding can take several hours, especially on busy systems. It is sometimes considered that if drives are procured and installed at the same time, several drives are more likely to fail at about the same time than for unrelated drives, so rapid replacement of a failed drive is important.[citation needed]RAID 6 without a spare uses the same number of drives as RAID 5 with a hot spare and protects data against failure of up to two drives, but requires a more advanced RAID controller. Further, a hot spare can be shared by multiple RAID sets.

[edit] Data scrubbing / Patrol read


Data scrubbing is periodic reading and checking by the RAID controller of all the blocks in a RAID, including those not otherwise accessed. This allows bad blocks to be detected before they are used.[31] An alternate name for this is patrol read. This is defined as a check for bad blocks on each storage device in an array, but which also uses the redundancy of the array to recover bad blocks on a single drive and reassign the recovered data to spare blocks elsewhere on the drive.[32]

[edit] Reliability terms


Failure rate Two different kinds of failure rates are applicable to RAID systems. Logical failure is defined as the loss of a single drive and its rate is equal to the sum of individual drives' failure rates. System failure is defined as loss of data and its rate will depend on the type of RAID. For RAID 0 this is equal to the logical failure rate, as there is no redundancy. For other types of RAID, it will be less than the logical failure rate, potentially very small, and its exact value will depend on the type of

RAID, the number of drives employed, the vigilance and alacrity of its human administrators, and chance (improbable events do occur, though infrequently). Mean time to data loss (MTTDL) In this context, the average time before a loss of data in a given array.[33]Mean time to data loss of a given RAID may be higher or lower than that of its constituent hard drives, depending upon what type of RAID is employed. The referenced report assumes times to data loss are exponentially distributed, so that 63.2% of all data loss will occur between time 0 and the MTTDL. Mean time to recovery (MTTR) In arrays that include redundancy for reliability, this is the time following a failure to restore an array to its normal failure-tolerant mode of operation. This includes time to replace a failed drive mechanism and time to re-build the array (to replicate data for redundancy). Unrecoverable bit error rate (UBE) This is the rate at which a drive will be unable to recover data after application of cyclic redundancy check (CRC) codes and multiple retries. Write cache reliability Some RAID systems use RAMwrite cache to increase performance. A power failure can result in data loss unless this sort of drive bufferhas a supplementary battery to ensure that the buffer has time to write from RAM to secondary storage before the drive powers down. Atomic write failure Also known by various terms such as torn writes, torn pages, incomplete writes, interrupted writes, non-transactional, etc.

[edit] Problems with RAID

[edit] Correlated failures


The theory behind the error correction in RAID assumes that failures of drives are independent. Given these assumptions, it is possible to calculate how often they can fail and to arrange the array to make data loss arbitrarily improbable. There is also an assumption that motherboard failures won't damage the hard drive and that hard drive failures occur more often than motherboard failures. In practice, the drives are often the same age (with similar wear) and subject to the same environment. Since many drive failures are due to mechanical issues (which are more likely on older drives), this violates those assumptions; failures are in fact statistically correlated. In

practice, the chances of a second failure before the first has been recovered (causing data loss) is not as unlikely as for random failures. In a study including about 100 thousand drives, the probability of two drives in the same cluster failing within one hour was observed to be four times larger than was predicted by the exponential statistical distributionwhich characterizes processes in which events occur continuously and independently at a constant average rate. The probability of two failures within the same 10-hour period was twice as large as that which was predicted by an exponential distribution.[34] A common assumption is that "server-grade" drives fail less frequently than consumer-grade drives. Two independent studies (one by Carnegie Mellon Universityand the other by Google) have shown that the "grade" of a drive does not relate to the drive's failure rate.[35][36] In addition, there is no protection circuitry between the motherboard and hard drive electronics, so a catastrophic failure of the motherboard can cause the harddrive electronics to fail. Therefore, taking elaborate precautions via RAID setups ignores the equal risk of electronics failures elsewhere which can cascade to a hard drive failure. For a robust critical data system, no risk can outweigh another as the consequence of any data loss is unacceptable.

[edit] Atomicity
This is a little understood and rarely mentioned failure mode for redundant storage systems that do not utilize transactional features. Database researcher Jim Gray wrote "Update in Place is a Poison Apple"[37]during the early days of relational database commercialization. However, this warning largely went unheeded and fell by the wayside upon the advent of RAID, which many software engineers mistook as solving all data storage integrity and reliability problems. Many software programs update a storage object "in-place"; that is, they write a new version of the object on to the same secondary storage addresses as the old version of the object. While the software may also log some delta information elsewhere, it expects the storage to present "atomic write semantics," meaning that the write of the data either occurred in its entirety or did not occur at all. However, very few storage systems provide support for atomic writes, and even fewer specify their rate of failure in providing this semantic. Note that during the act of writing an object, a RAID storage device will usually be writing all redundant copies of the object in parallel, although overlapped or staggered writes are more common when a single RAID processor is responsible for multiple drives. Hence an error that occurs during the process of writing may leave the redundant copies in different states, and furthermore may leave the copies in neither the old nor the new state. The little known failure mode is that delta logging relies on the original data being either in the old or the new state so as to enable backing out the logical change, yet few storage systems provide an atomic write semantic for a RAID. While the battery-backed write cache may partially solve the problem, it is applicable only to a power failure scenario. Since transactional support is not universally present in hardware RAID, many operating systems include transactional support to protect against data loss during an interrupted write. Novell

NetWare, starting with version 3.x, included a transaction tracking system. Microsoft introduced transaction tracking via the journaling feature in NTFS. ext4 has journaling with checksums; ext3 has journaling without checksums but an "append-only" option, or ext3cow (Copy on Write). If the journal itself in a filesystem is corrupted though, this can be problematic. The journaling in NetApp WAFLfile system gives atomicity by never updating the data in place, as does ZFS. An alternative method to journaling is soft updates, which are used in some BSDderived system's implementation of UFS. This can present as a sector read failure. Some RAID implementations protect against this failure mode by remapping the bad sector, using the redundant data to retrieve a good copy of the data, and rewriting that good data to the newly mapped replacement sector. The UBE (Unrecoverable Bit Error) rate is typically specified at 1 bit in 1015 for enterprise class drives (SCSI, FC, SAS), and 1 bit in 1014for desktop class drives (IDE/ATA/PATA, SATA). Increasing drive capacities and large RAID 5 redundancy groups have led to an increasing inability to successfully rebuild a RAID group after a drive failure because an unrecoverable sector is found on the remaining drives. Double protection schemes such as RAID 6 are attempting to address this issue, but suffer from a very high write penalty.

[edit] Write cache reliability


The drive system can acknowledge the write operation as soon as the data is in the cache, not waiting for the data to be physically written. This typically occurs in old, non-journaled systems such as FAT32, or if the Linux/Unix "writeback" option is chosen without any protections like the "soft updates" option (to promote I/O speed whilst trading-away data reliability). A power outage or system hang such as a BSODcan mean a significant loss of any data queued in such a cache. Often a battery is protecting the write cache, mostly solving the problem. If a write fails because of power failure, the controller may complete the pending writes as soon as restarted. This solution still has potential failure cases: the battery may have worn out, the power may be off for too long, the drives could be moved to another controller, and the controller itself could fail. Some systems provide the capability of testing the battery periodically, however this leaves the system without a fully charged battery for several hours. An additional concern about write cache reliability exists, specifically regarding devices equipped with a write-back cachea caching system which reports the data as written as soon as it is written to cache, as opposed to the non-volatile medium.[38]The safer cache technique is write-through, which reports transactions as written when they are written to the non-volatile medium.

[edit] Equipment compatibility


The methods used to store data by various RAID controllers are not necessarily compatible, so that it may not be possible to read a RAID on different hardware, with the exception of RAID 1, which is typically represented as plain identical copies of the original data on each drive. Consequently a non-drive hardware failure may require the use of identical hardware to recover

the data, and furthermore an identical configuration has to be reassembled without triggering a rebuild and overwriting the data. Software RAID however, such as implemented in the Linux kernel, alleviates this concern, as the setup is not hardware dependent, but runs on ordinary drive controllers, and allows the reassembly of an array. Additionally, individual drives of a RAID 1 (software and most hardware implementations) can be read like normal drives when removed from the array, so no RAID system is required to retrieve the data. Inexperienced data recovery firms typically have a difficult time recovering data from RAID drives, with the exception of RAID1 drives with conventional data structure.

[edit] Data recovery in the event of a failed array


With larger drive capacities the odds of a drive failure during rebuild are not negligible. In that event, the difficulty of extracting data from a failed array must be considered. Only a RAID 1 (mirror) stores all data on each drive in the array. Although it may depend on the controller, some individual drives in a RAID 1 can be read as a single conventional drive; this means a damaged RAID 1 can often be easily recovered if at least one component drive is in working condition. If the damage is more severe, some or all data can often be recovered by professional data recovery specialists. However, other RAID levels (like RAID level 5) present much more formidable obstacles to data recovery.

[edit] Drive error recovery algorithms


Many modern drives have internal error recovery algorithms that can take upwards of a minute to recover and re-map data that the drive fails to read easily. Frequently, a RAID controller is configured to dropa component drive (that is, to assume a component drive has failed) if the drive has been unresponsive for 8 seconds or so; this might cause the array controller to drop a good drive because that drive has not been given enough time to complete its internal error recovery procedure. Consequently, desktop drives can be quite risky when used in a RAID, and so-called enterprise classdrives limit this error recovery time in order to obviate the problem. A fix specific to Western Digital's desktop drives used to be known: A utility called WDTLER.exe could limit a drive's error recovery time; the utility enabled TLER (time limited error recovery), which limits the error recovery time to 7 seconds. Around September 2009, Western Digital disabled this feature in their desktop drives (e.g., the Caviar Black line), making such drives unsuitable for use in a RAID.[39] However, Western Digital enterprise class drives are shipped from the factory with TLER enabled. Similar technologies are used by Seagate, Samsung, and Hitachi. Of course, for nonRAID usage, an enterprise class drive with a short error recovery timeout that cannot be changed is therefore less suitable than a desktop drive.[39] In late 2010, the Smartmontoolsprogram began supporting the configuration of ATA Error Recovery Control, allowing the tool to configure many desktop class hard drives for use in a RAID.[39]

[edit] Recovery time is increasing

Drive capacity has grown at a much faster rate than transfer speed, and error rates have only fallen a little in comparison. Therefore, larger capacity drives may take hours, if not days, to rebuild. The re-build time is also limited if the entire array is still in operation at reduced capacity.[40]Given a RAID with only one drive of redundancy (RAIDs 3, 4, and 5), a second failure would cause complete failure of the array. Even though individual drives' mean time between failure(MTBF) have increased over time, this increase has not kept pace with the increased storage capacity of the drives. The time to rebuild the array after a single drive failure, as well as the chance of a second failure during a rebuild, have increased over time.[41]

[edit] Operator skills, correct operation


In order to provide the desired protection against physical drive failure, a RAID must be properly set up and maintained by an operator with sufficient knowledge of the chosen RAID configuration, array controller (hardware or software), failure detection and recovery. Unskilled handling of the array at any stage may exacerbate the consequences of a failure, and result in downtime and full or partial loss of data that might otherwise be recoverable. Particularly, the array must be monitored, and any failures detected and dealt with promptly. Failure to do so will result in the array continuing to run in a degraded state, vulnerable to further failures. Ultimately more failures may occur, until the entire array becomes inoperable, resulting in data loss and downtime. In this case, any protection the array may provide merely delays this. The operator must know how to detect failures or verify healthy state of the array, identify which drive failed, have replacement drives available, and know how to replace a drive and initiate a rebuild of the array. In order to protect against such issues and reduce the need for direct onsite monitoring, some server hardware includes remote management and monitoring capabilities referred to as Baseboard Management, using the Intelligent Platform Management Interface. A server at a remote site which is not monitored by an onsite technician can instead be remotely managed and monitored, using a separate standalone communications channel that does not require the managed device to be operating. The Baseboard Management Controllerin the server functions independent of the installed operating system, and may include the ability to manage and monitor a server even when it is in its "powered off / standby" state.
[edit] Hardware labeling issues

The hardware itself can contribute to RAID array management challenges, depending on how the array drives are arranged and identified. If there is no clear indication of which drive is failed, an operator not familiar with the hardware might remove a non-failed drive in a running server, and destroy an already degraded array.

A controller may refer to drives by an internal numbering scheme such as 0, 1, 2... while an external drive mounting frame may be labeled 1, 2, 3...; in this situation drive #2 as identified by the controller is actually in mounting frame position #3.

For large arrays spanning several external drive frames, each separate frame may restart the numbering at 1, 2, 3... but if the drive frames are cabled together, then the second row of a 12drive frame may actually be drive 13, 14, 15... SCSI ID's can be assigned directly on the drive rather than through the interface connector. For direct-cabled drives, it is possible for the drive ID's to be arranged in any order on the SCSI cable, and for cabled drives to swap position keeping their individually-assigned ID, even if the server's external chassis labeling indicates otherwise. Someone unfamiliar with a server's management challenges could swap drives around while the power is off without causing immediate damage to the RAID array, but which misleads other technicians at a later time that are assuming failed drives are in the original locations.

[edit] Other problems


While RAID may protect against physical drive failure, the data is still exposed to operator, software, hardware and virus destruction. Many studies[42]cite operator fault as the most common source of malfunction, such as a server operator replacing the incorrect drive in a faulty RAID, and disabling the system (even temporarily) in the process.[43]Most well-designed systems include separate backup systems that hold copies of the data, but do not allow much interaction with it. Most copy the data and remove the copy from the computer for safe storage. Hardware RAID controllers are really just small computers running specialized software. Although RAID controllers tend to be very thoroughly tested for reliability, the controller software may still contain bugs that cause damage to data in certain unforseen situations. The controller software may also have time-dependent bugs that don't manifest until a system has been operating continuously, beyond what is a feasible time-frame for testing, before the controller product goes to market.

[edit] History
Norman Ken Ouchi at IBM was awarded a 1978 U.S. patent 4,092,732[44]titled "System for recovering data stored in failed memory unit." The claimsfor this patent describe what would later be termed RAID 5 with full stripewrites. This 1978 patent also mentions that drive mirroring or duplexing (what would later be termed RAID 1) and protection with dedicated parity (that would later be termed RAID 4) were prior art at that time. The term RAID was first defined by David A. Patterson, Garth A. Gibsonand Randy Katzat the University of California, Berkeley, in 1987. They studied the possibility of using two or more drives to appear as a single device to the host system and published a paper: "A Case for Redundant Arrays of Inexpensive Disks (RAID)"in June 1988 at the SIGMOD conference.[3] This specification suggested a number of prototype RAID levels, or combinations of drives. Each had theoretical advantages and disadvantages. Over the years, different implementations of the RAID concept have appeared. Most differ substantially from the original idealized RAID levels, but the numbered names have remained. This can be confusing, since one implementation of

RAID 5, for example, can differ substantially from another. RAID 3 and RAID 4 are often confused and even used interchangeably. One of the early uses of RAID 0 and 1 was the Crosfield Electronics Studio 9500 page layout system based on the Python workstation. The Python workstation was a Crosfield managed international development using PERQ3B electronics, benchMark Technology's Viper display system and Crosfield's own RAID and fibre-optic network controllers. RAID 0 was particularly important to these workstations as it dramatically sped up image manipulation for the pre-press markets. Volume production started in Peterborough, England in early 1987.

[edit] Non-RAID drive architectures


Main article: Non-RAID drive architectures

Non-RAID drive architecturesalso exist, and are often referred to, similarly to RAID, by standard acronyms, several tongue-in-cheek. A single drive is referred to as a SLED (Single Large Expensive Disk/Drive), by contrast with RAID, while an array of drives without any additional control (accessed simply as independent drives) is referred to, even in a formal context such as equipment specification, as a JBOD (Just a Bunch Of Disks). Simple concatenation is referred to as a "span".

Anda mungkin juga menyukai