Factors rebuilding a degraded RAID
Whitepaper
Abstract
Table of Content
PRODUCT FAMILIES COVERED BY THIS DOCUMENT ... 3
1.
WHAT HAPPENS IF ONE OR MORE DISKS IN A RAID ARRAY FAIL? ... 4
2.
2.1 DISK FAILURE IN A RAID1 OR RAID1+0 ... 42.2 DISK FAILURE IN A RAID3 OR RAID3+0(DEDICATED PARITY RAID) ... 5
2.3 DISK FAILURE IN A RAID5 OR RAID6(DISTRIBUTED PARITY RAID) ... 5
2.4 DISK FAILURE IN A RAID5 WITH HOT-SPARE DISK ... 6
THE KEY FACTORS INFLUENCING THE REBUILD TIME ... 6
3.
3.1 THE COMPUTING POWER OF THE RAID CONTROLLER ... 63.1.1 The computing power bottleneck ... 7
3.1.2 Recommended configurations for EonStor DS G6 and G7 storage systems ... 7
3.2 THE WRITE PERFORMANCE OF A SINGLE DRIVE IN A RAID ARRAY ... 8
3.2.1 NL-SAS vs. SATA III hard drive ... 8
3.2.2 SATA vs. SATA III hard drive with Native Command Queuing (NCQ) enabled ... 9
3.2.3 The drive bottleneck ... 9
3.3 THE CAPACITY OF A FAILED HARD DRIVE ... 10
ESTIMATING THE REBUILD TIME ... 11
4.
4.1 REBUILD TIME, IF COMPUTING POWER IS THE BOTTLENECK ... 114.2 REBUILD TIME, IF WRITE THROUGHPUT IS THE BOTTLENECK ... 12
APPENDIX: ... 13
5.
5.1 INFORTREND WEB LINKS ... 13Product families covered by this document
1.
This document applies to the following Infortrend storage system families:
EonStor DS G6 Family
EonStor DS G7 Family
What happens if one or more disks in a RAID array fail?
2.
Single disks or single level striped arrays will lose all data in case of a disk failure. You have to replace the failed disk and then restore the lost data from a backup before you can continue operation.
Opting for a mirrored (RAID 1 and RAID 1+0) or parity array (RAID 5 / 6 and RAID 5+0 / 6+0); despite lower performance and higher TCA / TCO can solve this problem.
There is an extremely small chance that 2 disks
fail simultaneously, but if data protection and integrity are highest priority and cost and performance minor, consider using a RAID 6 or RAID 6+0 to be prepared for that eventuality.
2.1 Disk failure in a RAID 1 or RAID 1+0
The failed disk has to be replaced by a new disk first. The RAID controller will write then all data on the mirror to the newly installed disk (Rebuild), to be considered as a disk copy from the mirror to the new disk. There is no need to restore a previous made back-up, with possible outdated data. The RAID controller is copying the data in the background, so operation will be continued.
Figure 1: RAID 0, disk failure
2.2 Disk failure in a RAID 3 or RAID 3+0 (Dedicated Parity RAID)
If one disk fails, the data can be rebuild by reading all remaining disks (all but the failed one) and writing the rebuilt data to the newly replaced disk. Writing to the newly replaced single disk is enough to rebuild the array. There are actually two factors that can impact the rebuild of a degraded RAID (A RAID will be “degraded”, if one or more disks fail). If the dedicated parity disk fails, the rebuilding process is a matter of recalculating the parity info by reading all remaining data and writing the parity to the new dedicated disk. If a data disk fails, the data need to be rebuild, based on the remaining data and the parity. This is the most time-consuming part of rebuilding a degraded RAID.
2.3 Disk failure in a RAID 5 or RAID 6 (Distributed Parity RAID)
If a disk fails, the data can be rebuild by reading all remaining disks, rebuilding the data, recalculating the parity information and writing the data and parity information to the new disk. This is time-consuming. The rebuild time is related to drive capacity and number of drives in the RAID array or RAID sub-arrays (RAID 5+0 or RAID 6+0), further to the computing power of the controller.
Figure 3: RAID 3, disk failure
2.4 Disk failure in a RAID 5 with hot-spare disk
When an RAID array is not protected by a hot spare disk, the failed disk has to be removed and replaced by a new one. The controller will detect the new disk and start rebuilding the RAID array.
Using a hot-spare disk will overcome the replacement procedure. In case of disk failure a hot spare disk is automatically incorporated into the RAID array and takes over the failed disk.
The key factors influencing the rebuild time
3.
There are 3 main factors that affect the duration of rebuild time:
The computing power of the RAID controller
The write performance of a drive in a RAID array
The capacity of the failed drive
3.1 The computing power of the RAID controller
The computing power of a RAID controller is the maximum throughput per second that can be run in XOR. As shown in the table below, EonStor DS G7 storage systems reduce the rebuild time to 62% of the rebuild time of an EonStor DS G6 storage system in a RAID 5 configuration with 15 hard drives.
Model Computing
power
Disk Write Performance(*1)
Configuration
(RAID 5) Rebuild time
Time saving in % G6 850 MB/s NL SAS Toshiba 1TB (122 MB/s) 7 HDDs 2 hr. 35 min. 15 HDDs 4 hr. 54 min. G7 1350 MB/s 7 HDDs 2 hr. 35 min. 15 HDDs 3 hr. 5 min.
38%
Table 1: Computing power of the RAID controller
Figure 6: ESDS G6 Number of HDD recommendation
3.1.1 The computing power bottleneck
The computing power of a RAID controller is limited (see Table 1). The work load of the RAID controller will increase as more drives will be added to a RAID array / RAID sub-array. The rebuild time will increase significant too then.
Example:
1. Configuration:
EonStor DS G6 computing power = 850 Mbps EonStor DS G7 computing power = 1350 Mbps
RAID 5 array = 6 x SAS 6G 400 GB SSD w/ 253.04 MB/s max. seq. WRITE throughput
2. Calculation:
Max. total READ throughput = (drive # - failed drive #) * Max. WRITE throughput Max. total READ throughput = (6 - 1)* 253.04 MB/s = 1265.20 MB/s
Computing power ≡ Total READ throughput (in rebuild process)
3. Result:
EonStor DS G6:
850 Mbps < 1265.20 MB/s
EonStor DS G7:
1350 Mbps > 1265.20 MB/s
3.1.2 Recommended configurations for EonStor DS G6 and G7 storage systems
1. EonStor DS G6:
We recommend using less or equal 8 hard drives in a RAID array / sub-array for a suitable rebuild time. If you want to use more than 8 hard drives in a RAID 5 or RAID 6, we recommend using RAID 5+0 or RAID 6+0 and balance the hard drive number among the RAID sub-arrays (≤ 8 hard drives per
Figure 7: ESDS G7 Number of HDD recommendation
2. EonStor DS G7:
A suitable rebuild time will be achieved using equal or less than 16 hard drives in a RAID array or a RAID sub-array. If you want to use a RAID 5 with more than 10 hard drives, we recommend using RAID 5+0. Even the difference of rebuild time using 10 or 16 hard drives is quite small; we suggest using less than 10 drives in a RAID 5 array or RAID 5+0 sub-array to minimize the risk of a 2nd hard drive failure. Using RAID 6, we recommend using less or equal 16 hard drives as the RAID protects against 2 drive failures. Consider using RAID 6+0 when using more than 16 hard drives.
3.2 The write performance of a single drive in a RAID array
The maximum write performance of a single drive in the RAID array is the maximum throughput per second written to the Hot-Spare disk.
3.2.1 NL-SAS vs. SATA III hard drive
The table below displays the time saving using the same EonStor DS G7 storage system in RAID 5, once with NL-SAS hard drives and once with SATA III hard drives. Using NL-SAS hard drives will reduce the rebuild time to 43% of the time using SATA III hard drives.
Model Computing
power Write performance
(*1) Configuration
(RAID 5) Rebuild time
Time saving in % G7 1350 MB/s SATA Hitachi 2TB (43 MB/s w/o NCQ) 7 HDDs 13 hrs 30 mins NL SAS Toshiba 2TB (122 MB/s) 7 HDDs 5 hrs 50 mins 57%
Table 2: Write performance of NL-SAS vs. SATA hard drive
3.2.2 SATA vs. SATA III hard drive with Native Command Queuing (NCQ) enabled
A RAID controller that supports Native Command Queuing (NCQ) for SATA hard drives is able to reduce the rebuild time significant compared to a RAID controller that is not supporting NCQ. The table shows an EonStor DS G6 storage system without NCQ support compared to an EonStor DS G7 storage system with NCQ enabled. The computing power in this case is not relevant, as the rebuild time using 7 hard drives is the same on EonStor DS G6 and G7 (see 3.1).
Model Computing
power Write performance
(*1) Configuration
(Raid 5) Rebuild time
Time saving in % G6 850 MB/s SATA Hitachi 2TB (43 MB/s w/o NCQ) 7 HDDs 13 hrs 30 mins G7 1350 MB/s SATA Hitachi 2TB (130 MB/s w/ NCQ) 7 HDDs 4 hrs 54 mins 64%
Table 3: Write performance of SATA hard drive w/o NCQ vs. w/ NCQ
3.2.3 The drive bottleneck
In general the maximum READ throughput of a single hard drive is higher than the max. WRITE throughput.
NL SAS Max. Seq. Read Throughput Max. Seq. Write Throughput(*1)
Toshiba 1TB
MK1001TRKB 147 MB/s 122 MB/s
Table 4: READ throughput vs. WRITE throughput
In Rebuild process the write throughput is equal to the read throughput. If one data is read from a single disk per second, one data is written to the Hot-Spare disk per second. If two data are read from a single disk per second, two data are written to the Hot-Spare disk per second.
Example:
1. Configuration:
A.) EonStor DS G7 w/ 7 x NL-SAS 6G HDD (1TB), max. seq. WRITE throughput 122 MB/s
2. Calculation:
Max. total READ throughput = (drive # - failed drive #) * Max. WRITE throughput A.) Max. total READ throughput = (7 - 1) * 122 MB/s = 732 MB/s
B.) Max. total READ throughput = (7 - 1) * 43 MB/s = 258 MB/s
3. Result:
A.): 1350 Mbps > 732 MB/s B.): 1350 Mbps > 258 MB/s
Configuration Drive Max. Write
Throughput(*1) Rebuild Time Test Environment
A.) NL SAS 6G
Toshiba 1TB 122 MB/s 2 hours 09 minutes EonStor DS G7
Raid 5 LD: 1 Hard Drive: 7
Hot-Spare: 1
B.) SATAIII
Hitachi 1TB 43 MB/s 6 hours 09 minutes
Table 5: WRITE throughput SATA III hard drive vs. NL-SAS hard drive
3.3 The capacity of a failed hard drive
The capacity of a failed hard drive is another key factor affecting the rebuild time. As higher the capacity of the failed hard drive, as longer is the rebuild time. Using a hard drive with smaller capacity (ex. 1TB hard drive) can decrease, the rebuild time to 64% using a 2 TB hard drive.
Drive Type Model Configuration
(RAID 5) Rebuild time Time saving in %
NL SAS Toshiba 2TB G7 7 HDDs 5 hrs 49 mins NL SAS Toshiba 1TB 7 HDDs 2 hrs 35 mins 56%
Table 6: Capacity of a failed hard drive
Max. total WRITE throughput = Bottleneck
Estimating the Rebuild Time
4.
The rebuild time can be estimated using the following formulas.
4.1 Rebuild time, if computing power is the bottleneck
The following formula is suitable using one of the following configurations:
Drives per RAID array / sub-array EonStor DS G7 EonStor DS G6
SAS 6G SSD ≤ 6 ≤ 3
NL-SAS HDD ≤ 11 ≤ 7
SATA II / III HDD ≤ 11 ≤ 7
Table 7: Suitable configuration
DC = Drive capacity in MB (#GB * 10003 ÷ 10242) DW = Drive Write Throughput (*1)
Example:
1. Configuration:
Drive Max. Write Throughput Test Environment
NL SAS 6G Toshiba 1TB 122 MB/s EonStor DS G7 Raid 5 LD: 1 Hard Drive: 7 Hot-Spare: 1 Table 8: Rebuild Time example configuration
2. Calculation:
Rebuild Time in minutes = (953,674 ÷ 122) ÷ 60 3. Result:
EonStor DS G7, Rebuild Time = 130 min.
Rebuild time = DC ÷ DW
4.2 Rebuild time, if write throughput is the bottleneck
In all other cases the rebuild time can be estimated by using the following formula.
DC = Drive capacity in MB DN = Number of drives DF = Number of failed drives
CP = Computing power (EonStor DS G6 = 850 MB/s, EonStor DS G7 = 1,350 MB/s)
Example:
1. Configuration:
Configuration EonStor Model Max. Computing
Power Test Environment
A.) G7 1350 MB/s Hitachi SSD 400GB
Raid 5 LD: 1 Drives: 7 Hot-Spare: 1
B.) G6 850 MB/s
Table 9: Rebuild Time example configuration 2. Calculation:
A.) Rebuild Time in minutes = (381,470 ÷ (1350 ÷ (7-1))) ÷ 60 B.) Rebuild Time in minutes = (381,470 ÷ (850 ÷ (7-1))) ÷ 60
3. Result:
A.) EonStor DS G7, Rebuild Time = 28 min. B.) EonStor DS G6, Rebuild Time = 45 min.
Appendix:
5.
5.1 Infortrend Web links
Infortrend Home:
http://www.infortrend.com
Infortrend Support and Resources:
http://www.infortrend.com/global/Support/Support
EonStor DS overview:
http://www.infortrend.com/global/products/families/ESDS
EonStor DS G7 overview:
http://www.infortrend.com/Event2011/2011_Global/201112_ESDS_G7/ESDS_G7.html
5.2 Drive Vendor links
Links to drive vendors, whose drives were used for testing: