Chapter 7. Disk subsystem

(1)

Chapter 7. Disk subsystem

Ultimately, all data must be retrieved from and stored to disk. Disk accesses are usually measured in milliseconds, whereas memory and PCI bus operations are measured in nanoseconds or microseconds. Disk operations are typically thousands of times slower than PCI transfers, memory accesses, and LAN transfers. For this reason the disk subsystem can easily become the major bottleneck for any server configuration.

Disk subsystems are also important because the physical orientation of data stored on disk has a dramatic influence on overall server performance. A detailed understanding of disk subsystem operation is critical for effectively solving many server performance bottlenecks.

A disk subsystem consists of the physical hard disk and the controller. A disk is made up of multiple platters coated with magnetic material to store data. The entire platter assembly mounted on a spindle revolves around the central axis. A head assembly mounted on an arm moves to and fro (linear motion) to read the data stored on the magnetic coating of the platter.

The linear movement of the head is referred to as the seek. The time it takes to move to the exact track where the data is stored is called seek time. The rotational movement of the platter to the correct sector to present the data under the head is calledlatency. The ability of the disk to transfer the requested data is called thedata transfer rate.

The most widely used drive technology today in servers is SCSI (Small Computer System Interface). IBM’s flagship SCSI controller is the ServeRAID-4H adapter. Besides SCSI, other storage technologies are available, such as:

• SSA (Serial Storage Architecture) • FC-AL (Fibre Channel Arbitrated Loop) • EIDE (Enhanced Integrated Drive Electronics)

For performance reasons, do not use EIDE disks in your server. The EIDE interface does not handle multiple simultaneous I/O requests very

efficiently and so is not suited to a server environment. The EIDE interface uses more server CPU capacity than SCSI.

We recommend you limit EIDE use to CD-ROM and tape devices.

(2)

In this redbook we will focus only on SCSI and Fibre Channel.

7.1 SCSI bus overview

The SCSI bus has evolved into the predominant server disk connection technology. Several different versions of SCSI exist. The table below contains all versions covered by the current SCSI specification.

Table 11. SCSI specifications

7.1.1 SCSI

First implemented as an ANSI standard in 1986, the Small Computer System Interface defines an 8-bit interface with a burst-transfer rate of 5 MBps with a 5 MHz clock (that is, 1 byte transferred per clock cycle). SCSI cable lengths are limited to 6 meters.

7.1.2 SCSI-2

The SCSI-2 standard was released by ANSI in 1996 and allowed for better performance than the original SCSI interface. It defines extensions that allow for 16-bit transfers and twice the data transfer due to a 10 MHz clock. The 8-bit interface is called SCSI-2 Fast and the 16-bit interface is called SCSI-2 Fast/Wide.

In addition to the faster speed, SCSI-2 also introduced new command sets to improve performance when multiple requests are issued from the server. The trade-off with increased speed was shorter cable length. The 10 MHz SCSI-2 interface supported a maximum of 3 meter cable lengths.

SCSI Standard Bus Clock Speed 50-pin Narrow (8-bit) / 68-pin Wide (16-bit)

Maximum Cable Length

SCSI 5 MHz 5 MBps 6 meters

SCSI-2 Fast 10 MHz 10 MBps / 20 MBps 3 meters

Ultra SCSI 20 MHz 20 MBps / 40 MBps 1.5 meters

Ultra2 SCSI 40 MHz 40 MBps / 80 MBps 12 meters (LVD)

(3)

7.1.3 Ultra SCSI

Ultra SCSI is an update to the SCSI-2 interface offering faster data transfer rates and was introduced in 1996. It is a subset of the SCSI-3 parallel interface (SPI) standard currently under development within the X3T10 SCSI committee.

The clock speed was doubled again to 20 MHz and provides a data transfer speed up to 40 MBps with a 16-bit data width, while maintaining the backward compatibility with SCSI and SCSI 2. Although the data transfer can be done at 20 MHz (that is 40 MBps wide), all SCSI commands are issued at 10 MHz to maintain compatibility. This means that the maximum bandwidth is less than 31 MBps, even with 64 KB blocks.

Once again, with the increased speed, cable lengths were halved to 1.5 meters maximum.

7.1.4 Ultra2 SCSI

Ultra2 SCSI usesLow Voltage Differential(LVD) signalling, which is designed to improve SCSI bus signal quality, enabling faster transfer rates and longer cable lengths. Ultra2 SCSI doubles the clock speed to 40 MHz. It employs the same concept as the older Differential SCSI specification where two signal lines are used to transmit each of the 8 or 16 bits, one signal the negative of the other. See Figure 34. At the receiver, one signal, A+ is subtracted from the other, A- (that is, thedifferentialis taken) which effectively removes spikes and other noise from the original signal. The result is A± as shown in Figure 34.

Figure 34. Differential SCSI

Differential components tend to be more expensive than similar single-ended SCSI components, and differential termination requires a lot of power, generating significant heat levels. Because of the large voltage swings (20 Volts) and high power requirements, current differential transceivers cannot

A+

A-A

+_ 0 1 0 -1 0 1

(4)

be integrated onto the SCSI chip, but must be additional external components.

LVD has differential's advantages of long cables and noise immunity without the power and integration problems. Because LVD uses a small (1.1 Volts) voltage swing, LVD transceivers can be implemented in CMOS, allowing them to be built into the SCSI chip, reducing cost, board area, power requirements, and heat.

The use of LVD allows cable lengths to be up to 12 meters.

7.1.5 Ultra3 SCSI

The maximum theoretical throughput of Ultra3 160/m SCSI can reach 160 MBps on each SCSI channel. Ultra3 160/m uses the same clock frequency as Ultra2 SCSI, but data transfers occur on both rising and falling edges of the clock signal, effectively doubling the throughput. This feature is calleddouble transitionclocking.

Note: double transition clocking requires LVD signalling. On a single-ended SCSI bus, clocking will revert to Single Transition mode. If you use a mixture of Ultra3 and Ultra2 devices on an LVD-enabled SCSI bus, there is no need for all devices use run at Ultra2 speed: the Ultra3 SCSI devices will still operate at the Ultra3 (160 MBps) speed.

Additionally, Ultra3 160/m SCSI can use CRC to ensure data integrity and is therefore far more reliable than older SCSI implementations which only support parity control.

Domain validationis another feature of Ultra3 160/m SCSI. It is performed during the SCSI bus initialization and the intent is to ensure devices on the SCSI bus (=domain) can reliably transfer data at negotiated speed. Only Ultra3 capable devices can use domain validation.

Note: Ultra3 160/m is a subset of Ultra3 SCSI. It supports double transition clocking, CRC and domain validation, but does not include all Ultra3 SCSI features, like packetization or quick arbitration.

7.1.6 SCSI controllers and devices

There are two basic types of SCSI controller designs — array and non-array. A standard non-array SCSI controller allows connection of SCSI disk drives to the PCI bus. Each drive is presented to the operating system as an individual, physical drive.

(5)

Figure 35 shows a typical non-array controller. The SCSI bus (an internal cable typically) is terminated on both ends. The SCSI controller (or host adapter) normally has one of the end terminators integrated within its electronics, so only one physical terminator is required.

The SCSI bus can contain different device types, such as disk, CD-ROM and tape all on the same bus. However, most non-disk devices conform to the slower SCSI and SCSI-2 Fast standards. So, if I/O to a CD-ROM or tape drive is required, the entire SCSI bus would have to switch to the slower speed during that access, which dramatically affects performance.

This would not be much of a problem if the CD-ROM is not used for

production purposes (that is, the CD-ROM is not a LAN resource available to users) and the tape drive is only accessed after hours, when performance is not critical.

If at all possible, we recommend you do not attach CD-ROMs or tape drives to the same SCSI bus as disk devices. Fortunately, most Netfinity servers have the standard CD-ROM on the EIDE bus.

Figure 35. Non-array SCSI configuration

The array controller, a more advanced design, contains hardware designed to combine multiple SCSI disk drives into a larger single logical drive.

Combining multiple SCSI drives into a larger logical drive greatly improves I/O performance compared to single-drive performance. Most array controllers employ fault-tolerant techniques to protect valuable data in the event of a disk drive failure. Array controllers are installed in almost all servers because of these advantages.

Note: Although there are many array controller technologies, each

possessing unique characteristics, this redbook includes details and tuning information specific to the IBM ServeRAID array controller.

Host

System

SCSI Host Adapter

Controller

SCSI Bus (Cable)

Controller Disk System Bus T T Controller Disk Controller Disk Controller CD-ROM

(6)

7.2 SCSI IDs

Figure 36. SCSI ID priority

With the introduction of SCSI-2, a total of 16 devices can be connected to a single SCSI bus. To uniquely identify each device, each is assigned a SCSI ID from 0 to 15. One of these is the SCSI controller itself and it is assigned ID 7.

Because the 16 devices share a single data channel, only one device can use the bus at a time. When two SCSI devices attempt to control the bus, the SCSI IDs determine who wins according to a priority scheme, as shown in Figure 36.

The highest priority ID is that of the controller. Next are the low order IDs from 6 to 0 and then the high order IDs from 15 to 8.

Although this priority scheme allows backward compatibility, it can result in negative system

performance if your devices are configured incorrectly. Narrow (8-bit) devices with lower IDs will automatically preempt use of the bus by the faster F/W devices with addresses greater than 7. This is especially important when CD-ROMs and tape drives are placed on the same SCSI bus as F/W disk drives.

Note: With the use of hot-swap drives, the SCSI ID is automatically set by the hot-swap backplane. Typically, the only change is whether the backplane assigns high-order IDs or low-order IDs.

7.3 Disk array controller architecture

Almost all server disk controllers implement the SCSI communication between the disk controller and disk drives. SCSI is an intelligent interface that allows simultaneous processing of multiple I/O requests. This is the single most important advantage for using SCSI controllers on servers. Servers must process multiple independent requests for I/O. SCSI’s ability to concurrently process many different I/O operations makes it the optimal choice for servers.

SCSI array controllers consist of the following primary components: • PCI bus interface/controller

SCSI ID Priority 7 (Highest) Controller 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 (Lowest)

(7)

• SCSI bus controller(s) and SCSI bus(es) • Microprocessor

• Memory (microprocessor code and cache buffers)

• Internal bus (connects PCI interface, microprocessor, and SCSI controllers)

Figure 37. Architecture of a disk array controller

7.4 Disk array controller operation

The SCSI-based disk array controller is a PCI busmaster initiator with capability to master the PCI bus to gain direct access to server main memory. The following sequence outlines the fundamental operations that occur when a disk-read operation is performed:

1. The server operating system generates a disk I/O read operation by building an I/O control block command in memory. The I/O control block contains the READ command, a disk address called aLogical Block Address(LBA), a block count or length, and the main memory address where the read data from disk is to be placed (destination address). 2. The operating system generates an interrupt to tell the disk array

controller that it has an I/O operation to perform. This interrupt initiates execution of the disk device driver. The disk device driver (executing on the server’s CPU) addresses the disk array controller and sends it the address of the I/O control block and a command instructing the disk array controller to fetch the I/O control block from memory.

3. The disk array controller initiates a PCI bus transfer to copy the I/O control block from server memory into its local adapter memory. The on-board microprocessor executes instructions to decode the I/O control block

SCSI Controller

PCI Bus Controller

Microprocessor Memory Microcode Data Buffers SCSI Bus Cache SCSI Disk Drives

Internal

Bus

(8)

command, to allocate buffer space in adapter memory to temporarily store the read data, and to program the SCSI controller chip to initiate access to the SCSI disks containing the read data. The SCSI controller chip is also given the address of the adapter memory buffer that will be used to temporarily store the read data.

4. At this point, the SCSI controller arbitrates for the SCSI bus, and when bus access is granted, a read command, along with the length of data to be read, is sent to the SCSI drives that contain the read data. The SCSI controller disconnects from the SCSI bus and waits for the next request. 5. The target SCSI drive begins processing the read command by initiating the disk head to move to the track containing the read data (called aseek operation). The average seek time for current high-performance SCSI drives is about 5 to 7 milliseconds.

This time is derived by measuring the average amount of time it takes to position the head randomly from any track to any other track on the drive. The actual seek time for each operation can be significantly longer or shorter than the average. In practice, the seek time depends upon the distance the disk head must move to reach the track containing the read data.

6. After the seek time elapses, and the head reaches its destination track, the head begins to read aservo track(adjacent to the data track). A servo track is used to direct the disk head to accurately follow the minute variations of the data signal encoded within the disk surface. The disk head also begins to read the sector address information to identify the rotational position of the disk surface. This allows the head to know when the requested data is about to rotate underneath the head. The time that elapses between the point when the head settles and is able to read the data track, and the point when the read data arrives is called therotational latency. Most disk drives have a specified average rotational latency, which is half the time it takes to traverse one complete revolution. It is half the rotational time because on average, the head will have to wait a half revolution to access any block of data on a track.

The average rotational latency of a 7200 RPM drive is about 4

milliseconds, whereas the average rotational latency of a 10,000 RPM drive is about 3 milliseconds. The actual latency depends upon the angular distance to the read data when the seek operation completes, and the head can begin reading the requested data track.

7. When the read data becomes available to the read head, it is transferred from the head into a buffer contained on the disk drive. Usually this buffer is large enough to contain a complete track of data.

(9)

8. The disk drive has the ability to be a SCSI bus initiator or SCSI bus target (similar terminology used for PCI). Now the controller logic in the disk drive arbitrates to gain access to the SCSI bus, as an initiator. When the bus becomes available, the disk drive begins to burst the read data into buffers on the adapter SCSI controller chip. The adapter SCSI controller chip then initiates a DMA (direct memory access) operation to move the read data into a cache buffer in array controller memory.

9. When the transfer of read data into disk array cache memory is complete, the disk controller becomes an initiator and arbitrates to gain access to the PCI bus. Using the destination address that was supplied in the original I/O control block as the target address, the disk array controller performs a PCI data transfer (memory write operation) of the read data into server main memory.

10.When the entire read transfer to server memory has completed, the disk array controller generates an interrupt to communicate completion status to the disk device driver. This interrupt informs the operating system that the read operation has completed.

7.5 RAID summary

Most of us have heard of RAID (redundant array of independent disks) technology. Unfortunately, there is still significant confusion about how RAID actually works and the performance implications of each RAID strategy. Therefore, this section presents a brief overview of RAID and the performance issues as they relate to commercial server environments. RAID was created by computer scientists at the University of California at Berkeley, to address the huge gap between computer I/O requirements and single disk drive latency and throughput. RAID is a collection of techniques that treat multiple, inexpensive disk drives as a unit, with the object of improving performance and/or reliability. IBM and the IT industry have also

(10)

introduced more RAID levels to meet industry demand. The following RAID strategies are defined by the Berkeley scientists, IBM and the IT industry:

Table 12. RAID summary

RAID-3 is useful for scientific applications that require increased byte throughput. It has very poor random access characteristics, and is not generally used in commercial applications.

RAID-4 uses a single checksum drive that becomes a significant bottleneck in random commercial applications. It is not likely to be used by a significant number of customers because of its slow performance.

RAID strategies that are supported by the IBM ServeRAID adapter are: • RAID-0

• RAID-1 • RAID-1E • RAID-5 • RAID-5E

• Composite RAID levels, such as RAID-10 and RAID-50

7.5.1 RAID-0

RAID-0 is a technique that stripes data evenly across all disk drives in the array. Strictly, it is not a RAID level, as no redundancy is provided. On average, accesses will be random, thus keeping each drive equally busy.

RAID level Fault tolerant? Description

RAID-0 No All data evenly distributed to all drives.

RAID-1 Yes A mirrored copy of one drive to another drive (2

disks).

RAID-1E Yes Mirrored copies of each drive.

RAID-3 Yes Single checksum drive. Bits of data are striped

across N-1 drives.

RAID-4 Yes Single checksum drive. Blocks of data are striped

across N-1 drives.

RAID-5 Yes Distributed checksum. Both data and parity are

striped across all drives.

RAID-5E Yes Distributed checksum and hot-spare. Data, parity

and hot-spare are striped across all drives.

(11)

SCSI has the ability to process multiple, simultaneous I/O requests, and I/O performance is improved because all drives can contribute to system I/O throughout. Since RAID-0 has no fault tolerance, when a single drive fails, the entire array becomes unavailable.

RAID-0 offers the fastest performance of any RAID strategy for random commercial workloads. RAID-0 also has the lowest cost of implementation because redundant drives are not supported.

Figure 38. RAID-0: All data evenly distributed across all drives but there is no fault tolerance

7.5.2 RAID-1

RAID-1 provides fault tolerance by mirroring one drive to another drive. The mirror drive ensures access to data should a drive fail. RAID-1 also has good I/O throughput performance compared to single-drive configurations because read operations can be performed on any data record on any drive contained within the array.

Most array controllers (including the ServeRAID family) do not attempt to optimize read latency by issuing the same read request to both drives in the mirrored pair. The drive in the pair that is least busy is issued the read command, leaving the other drive to perform another read operation. This technique ensures maximum read throughput.

Write performance is somewhat reduced because both drives in the mirrored pair must complete the write operation. For example, two physical write operations must occur for each write command generated by the operating system. 2 3 4 5 6 7 8 9 1 Logical view 1 4 7 2 5 8 3 6 9 RAID-0 - Physical view

(12)

RAID-1 offers significantly better I/O throughout performance than RAID-5. However, RAID-1 is somewhat slower than RAID-0.

Figure 39. RAID-1: Fault-tolerant. A mirrored copy of one drive to another drive.

7.5.3 RAID-1E

RAID-1 Enhanced (which will be referred to as RAID-1E throughout the rest of this document), is only implemented by the IBM ServeRAID adapter and allows a RAID-1 array to consist of three or more disk drives. “Regular” RAID-1 consists of exactly two drives.

The data stripe is spread across all disks in the array to maximize the number of spindles that are involved in an I/O request to achieve maximum

performance. RAID-1E is also called mirrored stripe,as a complete stripe of data is mirrored to another stripe within the set of disks. Like RAID-1, only half of the total disk space is usable — the other half is used by the mirror.

Figure 40. RAID-1: Mirrored copies of each drive

Because you can have more than two drives (up to 16), RAID-1E will out perform RAID-1. The only situation where RAID-1 will perform better then RAID-1E is the reading of sequential data. The reason for this is that when a RAID-1E reads sequential data off a drive, the data is striped across multiple drives. RAID-1E interleaves data on different drives therefore seek operations occur more frequently during sequential I/O. In RAID-1, data is not

interleaved so fewer seek operations occur for sequential I/O. 1 2 3 1' 2' 3'

RAID-1 - Physical view

1 3' 4 2 1' 5 3 2' 6 RAID-1E - Physical view

(13)

7.5.4 RAID-5

RAID-5 offers an optimal balance between price and performance for most commercial server workloads. RAID-5 provides single-drive fault tolerance by implementing a technique calledsingle equation single unknown. This technique says that if any single term in an equation is unknown, the equation can be solved to exactly one solution.

The RAID-5 controller calculates achecksum(parity stripe in Figure 41) using a logic function known as an exclusive-or (XOR) operation. The checksum is the XOR of all data elements in a row. The XOR result can be performed quickly by the RAID controller hardware and is used to solve for the unknown data element.

In Figure 41, addition is used instead of XOR to illustrate the technique: stripe 1 + stripe 2 + stripe 3 = parity stripe 1-3. Should drive one fail, stripe 1 becomes unknown and the equation becomes X + stripe 2 + stripe 3 = parity stripe 1-3. The controller solves for X and returns stripe 1 as the result. A significant benefit of RAID-5 is the low cost of implementation, especially for configurations requiring a large number of disk drives. To achieve fault tolerance, only one additional disk is required. The checksum information is evenly distributed over all drives, and checksum update operations are evenly balanced within the array.

Figure 41. RAID-5: Both data and parity are striped across all drives

However, RAID-5 yields lower I/O throughout then RAID-0 and RAID-1. This is due to the additional checksum calculation and write operations required. In general, I/O throughput with RAID-5 is 30-50% lower than with RAID-1. (The actual result depends upon the percentage of write operations.) A workload with a greater percentage of write requests generally has a lower RAID-5 throughput. RAID-5 will provide I/O throughput performance similar to RAID-0 when the workload does not require write operations (read only). For more information on RAID-5 performance, see 7.6, “ServeRAID RAID-5 algorithms” on page 136.

RAID-5 - Physical view 1 4 7 2 5 7-9 parity 3 4-6 parity 8 1-3 parity 6 9

(14)

7.5.5 RAID-5E

Figure 42. RAID-5E: The hot spare is integrated into all disks, instead of a separate disk

RAID-5E was invented by IBM research and is a technique that distributes the hot-spare drive space over the N+1 drives comprising the RAID-5 array plus standard hot-spare drive. It was first implemented in ServeRAID firmware V3.5.

Adding a hot-spare drive to a server protects data by reducing the time spent in the critical state. This technique does not make maximum use of the hot-spare drive because it sits idle until a failure occurs. Often many years can elapse before the hot-spare drive is ever used. IBM invented a method to utilize the hot-spare drive to increase performance of the RAID-5 array during typical processing and preserve the hot-spare recovery technique. This method of incorporating the hot spare into the RAID array is called RAID-5E. RAID-5E is designed to increase the normal operating performance of a RAID-5 array in two ways:

• The hot-spare drive contains data that can be accessed during normal operation. The RAID-5 array now has an extra drive to contribute to the throughput of read and write operations. Standard 10,000 RPM drives can perform more than 100 I/O operations per second so the RAID-5 array throughput is increased with this extra I/O capability.

• The data in RAID-5E is distributed over N+1 drives instead of N as is done for RAID-5. As a result, the data occupies less tracks on each drive. This has the effect of physically utilizing less space on each drive keeping the head movement more localized and reducing seek times.

Together, these improvements yield a typical system-level performance gain of about 10-20%.

Another benefit of RAID-5E is the faster rebuild times needed to reconstruct a failed drive. In a standard RAID-5 hot-spare configuration the rebuild of a failed drive requires serialized write operations to the single hot-spare drive. Using RAID-5E the hot spare drive space is evenly distributed across all

1 5 Hot spare 2 6 Hot spare 3 7 Hot spare RAID-5E - Physical view

4 5-8 parity Hot spare 1-4 parity 8 Hot spare

(15)

drives, so the rebuild operations are evenly distributed to all remaining drives in the array. Rebuild times with RAID-5E can be dramatically faster than rebuild times using a standard hot-spare configuration.

The only downside of RAID-5E is that the hot-spare drive cannot be shared across multiple physical arrays as can be done with standard RAID-5 plus hot-spare. This RAID-5 technique is more cost efficient for multiple arrays because it allows a single hot-spare drive to provide coverage for multiple physical arrays. This reduces the cost of using a hot-spare drive but the sacrifice is the inability to handle separate drive failures within different arrays. IBM ServeRAID adapters offer increased flexibility by providing the choice to use either standard RAID-5 with hot-spare or the newer integrated hot-spare provided with RAID-5E.

While RAID-5E provides a performance improvement for most operating environments there is a special case where its performance can be slower than RAID-5. Consider a three-drive RAID-5 with hot-spare configuration as shown in Figure 43. This configuration employs a total of four drives but the hot-spare drive is idle so for a performance comparison it can be ignored. A four-drive RAID-5E configuration would have data and checksum on four separate drives.

Figure 43. Writing a 16 KB block to a RAID-5 array with an 8 KB stripe size

Referring to Figure 43, whenever a write operation is issued to the controller that istwo timesthe stripe size (for example, a 16 KB I/O request to an array with an 8 KB stripe size), a three-drive RAID-5 configuration would not require any reads because the write operation would contain all the data needed for each of the two drives. The checksum would be generated by the

RAID-5 with hot-spare 8 KB stripe 8 KB 8 KB 8 KB Adapter cache ServeRAID adapter 16 KB write operation Step 1 Step 2: calculate checksum Step 4: write checksum 16 KB write operation Step 3: write data

(16)

array controller (step 2) and immediately written to the corresponding drive (step 4) without the need to read any existing data or checksum. This entire series of events would require two writes for data to each of the drives storing the data stripe (step 3) and one write to the drive storing the checksum (step 4), for a total of three write operations.

Contrast these events to the operation of a comparable RAID-5E array which contains four drives as shown in Figure 44. In this case, in order to calculate the checksum, a read must be performed of the data stripe on the extra drive (step 2). This extra read was not performed with the three-drive RAID-5 configuration and it slows the RAID-5E array for write operations that are twice the stripe size.

Figure 44. Writing a 16 KB block to a RAID-5E array with a 8 KB stripe size

This problem with RAID-5E can be avoided with proper stripe size selection. By monitoring the average I/O size in bytes, or knowing the I/O size

generated by the application, a large enough stripe size can be selected so that this performance degradation rarely occurs.

7.5.6 Composite RAID levels

The ServeRAID-4 adapter family supports composite RAID levels. This means that it supports RAID arrays that are joined together to form larger RAID arrays.

For example, RAID 10 is the result of forming a RAID-0 array from two or more RAID-1 arrays. With four SCSI channels each supporting 15 drives, this

Adapter cache ServeRAID adapter 16 KB write operation Step 1 Step 3: calculate checksum Step 5: write checksum 16 KB write operation Step 4: write data RAID-5E with integrated hot-spare 8 KB stripe 8 KB 8 KB 8 KB 8 KB Step 2: read data Extra step

(17)

means you can theoretically have up to 60 drives in one array. With the EXP200, the limit is 40 disks and with the EXP300, the limit is 56 disks. A ServeRAID RAID-10 array is shown in Figure 45:

Figure 45. RAID-10: A striped set of RAID-1 arrays

Likewise a striped set of RAID-5 arrays is shown in Figure 46

Figure 46. RAID-50: A striped set of RAID-5 arrays

1 2 3 1' 2' 3' 1 2 3 1' 2' 3' 1 2 3 1' 2' 3'

RAID-10 - Physical view (striped RAID-1)

1 4 7 2 5 8 3 6 9 1 4 7 2 5 7-9 parity 3 4-6 parity 8 1-3 parity 6 9 1 4 7 2 5 7-9 parity 3 4-6 parity 8 1-3 parity 6 9 1 3 5 2 4 6

(18)

The ServeRAID-4 family supports the following combinations:

Table 13. Composite RAID levels supported by ServeRAID-4 adapters

Table 14 shows a summary of the performance characteristics of the three RAID levels commonly used in array controllers. A comparison is also made between small and large I/O data transfers.

Table 14. Summary of RAID performance characteristics

7.6 ServeRAID RAID-5 algorithms

The IBM ServeRAID adapter uses one of two algorithms for the calculation of RAID-5 parity. These algorithms ensure the best performance of RAID-5 write operations in array configurations, regardless of the number of drives in the array:

RAID level The sub-logical array is and the spanned array is

RAID-00 RAID-0 RAID-0

RAID-1E0 RAID-1E RAID-0

RAID level Data capacity1 Sequential I/O performance2 Random I/O performance2 Data availability2

Read Write Read Write With hot spare Without hot spare

Single Disk n 6 6 4 4 0 Not applicable

RAID-0 n 10 10 10 10 0 Not applicable

RAID-1 n/2 7 5 6 3 7 8 RAID-1E n/2 5 4 7 6 8 9 RAID-5 n-1 7 73 7 4 7 8 RAID-5E n-2 8 83 8 5 8 N/A RAID-10 n/2 10 9 7 6 9 9 Notes:

1 In the data capacity column,n refers to the number of equally sized disks in the array.

2 10 = best, 1=worst. You should only compare values within each column. Comparisons between columns is not valid for this table.

(19)

• Use “read/modify write” for RAID-5 arrays of five drives or more. • Use “full XOR” for RAID-5 arrays of three or four drives.

This section compares these two algorithms.

7.6.1 Read/modify write algorithm

The read/modify write algorithm is optimized for configurations that use greater than four drives. The RAID-5 read/modify write algorithm is described in Figure 47. This algorithm always requires four disk operations to be performed for each write command, regardless of the number of drives in the RAID-5 array. As per Figure 47, the steps that occur are:

1. Read old data (data1) 2. Read old checksum (CS6)

3. Calculate the new checksum from old data, new data and old checksum 4. Write new data (data4)

5. Write new checksum (CS9)

Figure 47. Read/modify write algorithm— four I/O operations for every write command

Regardless of the number of drives, with the read/modify write algorithm, the write command will always require four I/O operations: two reads and two writes.

The algorithm is called “read/modify write” because it reads the checksum, modifies the checksum, then writes the checksum.

data1 data4

data2 data3 CS6

CS9 ServeRAID adapter Adapter cache

data1 CS6 data4

New CS9

Read/Modify Write algorithm

Command: Update data1 to data4

Step 1 (Read) Step 2 (Read) Step 3 (Calc) Step 4 (Write) Step 5 (Write)

(20)

7.6.2 Full XOR algorithm

A different method can be used to generate RAID-5 checksum information for a write operation that modifies data1 to be data4. This method is called the “full exclusive or” algorithm (full XOR algorithm).

It involves disk read operations of data2 and data3. The full XOR algorithm then creates a new checksum from data4 + data2 + data3, writes the modified data (data4), and overwrites the old checksum with the new checksum (CS9). In this case, four disk operations are performed.

The following operations (as per Figure 48) show the steps involved in the full XOR algorithm:

1. Read data2 2. Read data3

3. Calculate new checksum (CS6) from new data (data4), data2 and data3 4. Write data4

5. Write checksum (CS9)

Figure 48. Full XOR algorithm

In this case, four disk operations are performed: two reads and two writes. If the number of disks in the array increases, then the number of read operations also increases:

• Five disks: five I/O operations (three reads and two writes) data1 data4 data2 data3 CS6 CS9 Adapter cache data2 data3 data4 New CS9

Full XOR algorithm

Command: Update data1 to data4

Step 1 (Read) Step 2 (Read) Step 3 (Calc) Step 4 (Write) Step 5 (Write) ServeRAID Adapter

(21)

• Six disks: six I/O operations (four reads and two writes) • n disks: n I/O operations (n-2 reads and two writes)

The extra read operations required by this algorithm cause the performance of write commands to degrade as the number of drives increases.

The algorithm is called “full XOR” due to the way the checksum is calculated. The checksum is calculated from all the data and then the calculated

checksum is written to disk. The original checksum is not used in the calculation.

However, for three disks, onlythreeI/O operations are required: one read and two writes. Thus the following conclusions can be reached:

• For 3-drive RAID-5 arrays, full XOR is faster.

• For 4-drive RAID-5 arrays, the algorithms are the same. • For 5+ drive RAID-5 arrays, read/modify write is faster.

Thus, for a four-drive configuration, the full XOR algorithm requires the same number of disk operations as the read/modify write algorithm. A RAID-5 configuration using five drives would require four disk operations for the read/modify write algorithm, but five disk operations for the full XOR algorithm. Consequently, the number of disk operations increases for a full XOR algorithm as the number of drives configured in a RAID-5 array increases. The extra read operations required by the full XOR algorithm cause the performance of write operations to degrade as the number of drives increases.

To take advantage of this, Version 2.3 of the ServeRAID firmware introduced a technique which used the better of these two algorithms depending on the number of drives in the array. It uses full XOR when the adapter is configured with three or four drives in a RAID-5 array, and read/modify write when the adapter is configured with five or more drives.

7.6.3 Sequential write commands

The benefits of these two algorithms also affects the sequential write commands.

When the ServeRAID adapter is configured for RAID-5 and the server I/O is sequential write operations (for example, when copying files to the server or when building a database), additional performance benefits can be achieved using a full XOR algorithm and using a write-back cache policy. (The benefits of write-back cache are discussed in 7.7.7, “Disk cache write-back versus write-through” on page 153.)

(22)

The ServeRAID firmware V2.7 has intelligence to detect this type of I/O, and switches to full XOR. This would cause each data element, data1, data2, data3, and checksum to be stored in the ServeRAID adapter cache after the first operation to update data1 to data4. In write-back mode, the updates to data2, data3 and the successive updates to the checksum could all be accomplished in cache memory. After the entire group of stripe elements is sequentially updated in cache memory, only three disk operations are required to store the updated data2, data3, and checksum information on disk.

This feature of the ServeRAID can improve database load times in RAID-5 mode by up to eight times over earlier ServeRAID firmware levels.

7.7 Factors affecting disk array controller performance

Many factors affect array controller performance. The most important considerations (in order of importance) for configuring the IBM ServeRAID adapter are:

• RAID strategy • Number of drives • Drive performance

• Logical drive configuration • Stripe size

• SCSI bus organization and speed

• Disk cache write-back versus write-through • RAID adapter cache size

• Device drivers • Firmware

7.7.1 RAID strategy

Your RAID strategy should be carefully selected because it significantly affects disk subsystem performance. Figure 49 illustrates the performance differences between RAID-0, RAID-1E and RAID-5 for a server configured with 10,000 RPM Fast/Wide SCSI-2 drives and the IBM ServeRAID-3HB adapter with v3.6 code. The chart shows the RAID-0 configuration delivering about 97% greater throughput than RAID-5 and 35% greater throughput than RAID-1E.

RAID-0 has no fault tolerance and is, therefore, best utilized for read-only environments when downtime for possible backup recovery is acceptable. RAID-1E or RAID-5 should be selected for applications requiring fault

(23)

tolerance. RAID-1E is usually selected when the number of drives is low (less than six) and the price for purchasing additional drives is acceptable.

RAID-1E offers about 45% more throughput than RAID-5. These performance considerations should be understood before selecting a fault-tolerant RAID strategy.

Figure 49. Comparing RAID levels

In many cases, RAID-5 is the best choice because it provides the best price and performance combination for configurations requiring capacity greater than five or more disk drives. RAID-5 performance approaches RAID-0 performance for workloads where the frequency of write operations is low. Servers executing applications that require fast read access to data and high availability in the event of a drive failure should employ RAID-5.

For more information about RAID-5 performance, see 7.6, “ServeRAID RAID-5 algorithms” on page 136.

7.7.2 Number of drives

The number of disk drives significantly affects performance because each drive contributes to total system throughput. Capacity requirements are often the only consideration used to determine the number of disk drives

configured in a server. Throughput requirements are usually not well understood or are completely ignored. Capacity is used because it is easily estimated and is often the only information available.

Configuration:

Windows NT Server 4.0 ServeRAID-3HB

Firmware/driver v3.6 Maximum number of drives 10,000 RPM

8 KB I/O size

Random I/O mix: 67/33 R/W 3575 2641 1816 0 1000 2000 3000 4000 I/ O oper at ions per s e c onde RAID-5 RAID-1E RAID-0

(24)

The result is a server configured with sufficient disk space, but insufficient disk performance to keep users working efficiently. High-capacity drives have the lowest price per byte of available storage and are usually selected to reduce total system price. This often results in disappointing performance, particularly if the total number of drives is insufficient.

It is difficult to accurately specify server application throughput requirements when attempting to determine the disk subsystem configuration. Disk

subsystem throughput measurements are complex. To express a user requirement in terms of “bytes per second” would be meaningless because the disk subsystem’s byte throughput changes as the database grows and becomes fragmented, and as new applications are added.

The best way to understand disk I/O and users’ throughput requirements is to monitor an existing server. Tools such as the Windows 2000 Performance console can be used to examine the logical drive queue depth and disk transfer rate (as described in Chapter 11, “Windows 2000 Performance console” on page 221). Logical drives that have an average queue depth much greater than the number of drives in the array are very busy. This indicates that performance would be improved by adding drives to the array.

Measurements show that server throughput for most server application workloads increase as the number of drives configured in the server is increased. As the number of drives is increased, performance is usually improved for all RAID strategies. Server throughput continues to increase each time drives are added to the server. This can be seen in Figure 50.

In general, adding drives is one of the most effective changes that can be made to improve server performance.

(25)

Figure 50. Improving performance by adding drives to arrays

This trend will continue until another server component becomes the bottleneck. In general, most servers are configured with an insufficient number of disk drives. Therefore, performance increases as drives are added. Similar gains can be expected for all I/O-intensive server applications such as office-application file serving, Lotus Notes, Oracle, DB2 and Microsoft SQL Server.

If you are using one of the IBM ServeRAID family of RAID adapters, you can use thelogical drive migrationfeature to add drives to existing arrays without disrupting users or losing data.

7.7.3 Drive performance

Drive performance contributes to overall server throughput because faster drives perform disk I/O in less time. There are four major components to the time it takes a disk drive to execute and complete a user request:

Configuration: Windows NT 4.0 SQL Server 6.5 ServeRAID II 4.5 GB 7200 RPM 183 221 283 0 50 100 150 200 250 300 4 Drives RAID-0 6 Drives RAID-0 8 Drives RAID-0 Tps

For most server workloads, when the number of drives in the active logical array is doubled, server throughput will improve by about 50% until other bottlenecks occur.

(26)

• Command overhead

This is the times it take for the drive’s electronics to process the I/O request. The time depends on whether it is a read or write request and whether the command can be satisfied from the drive’s buffer. This value is of the order of 0.1 ms for a buffer hit to 0.5 ms for a buffer miss.

• Seek time

This is the time it takes to move the drive head from its current cylinder location to the target cylinder. As the radius of the drives has been decreasing, and drive components have become smaller and lighter, so too has the seek time been decreasing. Average seek time is usually 5-7 ms for most current SCSI-2 drives used in servers today.

• Rotational latency

Once the head is at the target cylinder, the time it takes for the target sector to rotate under the head is called the rotational latency. Average latency is half the time it takes the drive to complete one rotation so it is inversely proportional to the RPM value of the drive:

- 5400 RPM drives have a 5.6 ms latency - 7200 RPM drives have a 4.2 ms latency - 10,000 RPM drives have a 3.0 ms latency • Data transfer time

This value depends on themedia data rate, which is how fast data can be transferred from the magnetic recording media, and theinterface data rate, which is how fast data can be transferred between the disk drive and disk controller (that is, the SCSI transfer rate).

The media data rate improves as a result of greater recording density and faster rotational speeds. A typical value is 0.8 ms. The interface data rate for SCSI-2 F/W is 20 MBps. With 4 KB I/O transfers (which are typical for Windows NT Server and Windows 2000), the interface data transfer time is 0.2 ms. Hence the data transfer time is approximately 1 ms.

As you can see, the significant values that affect performance are the seek time and the rotational latency. For random I/O (which is normal for a

multi-user server) this is true. Reducing the seek time will continue to improve as the physical drive attributes become less.

For sequential I/O (such as with servers with small numbers of users

requesting large amounts of data) or for I/O requests of large block sizes (for example 64 KB), the data transfer time does become important when

compared to seek and latency, so the use of Ultra SCSI, Ultra2 SCSI or Ultra3 SCSI can have a significant positive effect on overall subsystem performance.

(27)

Likewise when caching and read-ahead is employed on the drives

themselves, the time taken to perform the seek and rotation are eliminated, so the data transfer time becomes very significant.

The easiest way to improve disk performance is to increase the number of accesses that can be made simultaneously. This is achieved by using many drives in a RAID array and spreading the data requests across all drives as described in 7.7.2, “Number of drives” on page 141.

Table 15 shows the seek and latency values and buffer sizes for three of IBM’s high-end drives.

Table 15. Comparing 10000 and 7200 RPM drives

7.7.4 Logical drive configuration

Using multiple logical drives on a single physical array is convenient for managing the location of different files types. However, depending on the configuration, it can significantly reduce server performance.

When you use multiple logical drives, you are physically spreading the data across different sections of the array disks. If I/O is performed to each of the logical drives, the disk heads have to seek further across the disk surface than when the data is stored on one logical drive. Using multiple logical drives greatly increases seek time and can slow performance by as much as 25%. An example of this is creating two logical drives in the one RAID array and putting a database on one logical drive and the transaction log on the other. Because heavy I/O is being performed on both, the performance will be poor. If the two logical drives are configured with the operating system on one and data on the other, then there should be little I/O to the operating system code once the server has booted so this type of configuration would be OK. It is best to put the page file on the same drive as the data when using one large physical array. This is counterintuitive: Most think the page file should be put on the operating system drive since the operating system will not see much I/O during runtime. However, this causes long seek operations as the

Disk Capacity RPM Seek Latency Buffer size Media data transfer rate (MBps) Ultrastar 36LP 18.3 GB 7200 6.8 ms 4.17 ms 4 MB 248-400 Ultrastar 36LZ 18.3 GB 10K 4.9 ms 2.99 ms 4 MB 280-452 Ultrastar 72ZX 73.4 GB 10K 5.3 ms 2.99 ms 16 MB 280-473

(28)

head swings over the two partitions. Putting the data and page file on the data array keeps the I/O localized and reduces seek time.

Of course this is not the most optimal case, especially for applications with heavy paging. Ideally, the page drive will be a separate device that can be formatted to the correct stripe size to match paging. In general, most applications will not page when given sufficient RAM so usually this is not a problem.

The fastest configuration is a single logical drive for each physical RAID array. Instead of using logical drives to manage files, you should create directories and store each type of file in a different directory. This will significantly improve disk performance by reducing seek times because the data will be as physically close together as possible.

If you really want or need to partition your data and you have a sufficient number of disks, you should configure multiple RAID arrays instead of configuring multiple logical drives in one RAID array. This will improve disk performance; seek time will be reduced because the data will be physically closer together on each drive.

Note: If you plan to use RAID-5E arrays, you can only have one logical drive per array.

7.7.5 Stripe size

With RAID technology, data isstripedacross an array of hard disk drives. Striping is the process of storing data across all the disk drives that are grouped in an array.

The granularity at which data from one file is stored on one drive of the array before subsequent data is stored on the next drive of the array is called the stripe unit(also referred to asinterleave depth). For the ServeRAID adapter family, the stripe unit can be set to a stripe unit size of 8 KB, 16 KB, 32 KB, or 64 KB. With Netfinity Fibre Channel, a stripe unit is called a segment, and segment sizes can also be 8 KB, 16 KB, 32 KB, or 64 KB.

The collection of these stripe units, from the first drive of the array to the last drive of the array, is called astripe.

(29)

Figure 51. RAID stripes and stripe units

Note: The termstripe sizeshould really be calledstripe unit sizesince it refers to the length of the stripe unit (the piece of space on each drive in the array)

Using stripes of data balances the I/O requests within the “logical drive”. On average, each disk will perform an equal number of I/O operations, thereby contributing to overall server throughout. Stripe size has no effect on the total capacity of the logical disk drive.

7.7.5.1 Selecting the correct stripe size

The selection of stripe size affects performance. In general, the stripe size should be at least as large as the median disk I/O request size generated by server applications.

• Selecting too small a stripe size can reduce performance. In this

environment, the server application requests data that is larger than the stripe size, which results in two or more drives being accessed for each I/O request. Ideally, only a single disk I/O occurs for each I/O request. • Selecting too large a stripe size can reduce performance because a larger

than necessary disk operation might constantly slow each request. This is a problem particularly with RAID-5 where the complete stripe must be read from disk to calculate a checksum. Use too large a stripe, and extra data must be read each time the checksum is updated.

Selecting the correct stripe size is a matter of understanding the predominate request size performed by a particular application. Few applications use a single request size for each and every I/O request. Therefore, it is not possible to always have the ideal stripe size. However, there is always a best-compromise stripe size that will result in optimal I/O performance.

SU1 SU4 SU2 SU5 SU3 SU6 Stripe Stripe Unit

(30)

There are two ways to determine the best stripe size: • Use a rule of thumb as per Table 16.

• Monitor the I/O characteristics of an existing server.

The first and simplest way to choose a stripe size is to use Table 16. This table is based on tests performed by the Netfinity Performance Lab.

Table 16. Stripe size setting for various applications

Notes about Table 16:

• SQL Server 7.0 uses 8 KB I/O blocks but experiments have shown that performance can usually be improved by using double the I/O block size (that is, 16KB).

• Oracle uses multiple block sizes: 2 KB, 4 KB or 8 KB. While using a 16 KB stripe size is not the optimum for all cases, neither is it significantly slower either. Further I/O analysis on specific customer data may determine that 8 KB or 16 KB block sizes may produce better performance.

In general, stripe size only needs to be at least as large as the I/O size. Having a smaller stripe size implies multiple physical I/O operations for each logical I/O which will cause a drop in performance. Using a larger stripe size implies a read-ahead function which may or may not improve performance. Table 16 offers rule-of-thumb settings — there is no way to offer the precise stripe size that will always give the best performance for every environment without doing extensive analysis on the specific workload.

The second way to determine the correct stripe size involves observing the application while it is running using the Windows 2000 Performance console. The key is to determine the average data transfer size being requested by the application and select a stripe size that best matches. Unfortunately, this method requires the system to be running, so it either requires another system running the same application or the reconfiguring of the existing disk

Applications Stripe size

Groupware (Lotus Domino, Exchange etc.) 16 KB

Database server (Oracle, SQL Server, DB2, etc.) 16 KB

File server (Windows 2000, Windows NT) 16 KB

Web server 8 KB

Video file server 64 KB

(31)

array once the measurement has been made (and therefore backup, reformat and restore operations).

The Windows 2000 Performance console or Windows NT 4.0 Performance Monitor can help you determine the proper stripe size. Select:

• Object: PhysicalDisk

• Counter: Avg. Disk Bytes/Transfer

• Instance: the drive that is receiving the majority of the disk I/O

Monitor this value. As an example, the trend value for this counter is shown as the thick line in Figure 52. The running average is shown as indicated.

Figure 52. Average I/O size

Figure 52 represents an actual server application. It can be seen that the application request size (represented by Avg. Disk Bytes/Transfer) varies from a peak of 64 KB to about 20 KB for the two run periods.

As we said at the beginning of this section, in general, the stripe size should be at least as large as the median disk I/O request size generated by the server application.

Data drive average disk bytes per transfer: Range of 20 KB to 64 KB Maximum 64 KB

(32)

This particular server was configured with an 8 KB stripe size, which produced very poor performance. Increasing the stripe size to 16 KB would improve performance and increasing the stripe size to 32 KB would increase performance even more. The simplest technique would be to place the time window around the run period and select a stripe size that is at least as large as the average size shown in the running average counter.

7.7.5.2 Page file drive

Windows NT and Windows 2000 perform page transfers at up to 64 KB per operation, so the paging drive stripe size can be as large as 64 KB. However, in practice, it is usually closer to 32 KB because the application might not make demands for large blocks of memory which limits the size of the paging I/O.

Monitor average bytes per transfer as described in 7.7.5.1, “Selecting the correct stripe size” on page 147. Setting the stripe size to this average size can make a significant increase in performance by reducing the amount of physical disk I/O that occurs due to paging.

For example, if the stripe size is 8 KB and the page manager is doing 32 KB I/O transfers, then four physical disk reads or writes must occur for each page/sec you see in the Performance console. If the system is paging 10 pages/sec, then the disk will actually be doing 40 disk transfers/second.

7.7.6 SCSI bus organization and speed

Concern often exists over the performance effects caused by the number of drives on the SCSI bus, or the speed at which the SCSI bus runs. Yet, in almost all modern server configurations, the SCSI bus is rarely the

If you wish to monitor disk activity you need to enable the physical disk counters. In Windows NT 4.0, physical disk counters are disabled by default. To enable them, issue the commandDISKPERF -Ythen restart the computer. In Windows 2000, physical disk counters are enabled by default. Keeping this setting on all the time draws about 2-3% CPU but if your CPU is not a bottleneck, this is irrelevant and can be ignored.

TypeDISKPERF /?for more help on the DISKPERF command.

(33)

bottleneck. In most cases, optimal performance can be obtained by simply configuring 10 drives per Ultra SCSI bus. If the application is

byte-I/O-intensive (as is the case with video or audio) five drives on one Ultra2 SCSI bus can be used for a moderate (10-20%) increase in system performance.

In general, it is rare that SCSI bus configuration or increasing SCSI bus speed can significantly improve overall server system performance. Consider that servers must access data stored on disk for each of the attached users. Each user is requesting access to different data stored in a unique location on the disk drives. Disk accesses are almost always random because the server must multiplex access to disk data for each user. This means that most server disk accesses require a seek and rotational latency before data is transferred across the SCSI bus.

7.7.6.1 SCSI bus speed

As described in 7.7.3, “Drive performance” on page 143, total disk seek and latency times average about 8-12 ms, depending on the speed of the drive. Transferring a 2 KB block over a 40 MBps SCSI bus takes about 0.5 ms, or approximately 1/20th of the total disk access time. It is easy to see that increasing the bus speed to 80 MBps will only improve the 1/20 portion of time to roughly 1/40 of the total time, resulting in a small fractional gain in overall performance.

Tests have shown that for random I/O, drive throughput usually does not approach the limits of the SCSI bus.

In some cases, Ultra2 SCSI (80 MBps) can be shown to offer measurable performance improvements over Ultra SCSI (40 MBps). This usually occurs when measurements are made with a few drives (four to eight) running server benchmarks that transfer large blocks of data to and from the disk drives. System performance gains can be in the range of 5-10%. Typical examples are file-serving and e-mail benchmarks, where data transfer time becomes a larger component of the total disk access time.

File-serving and e-mail benchmarks (and applications) transfer relatively large blocks of data (12 KB to 64 KB), which increases SCSI bus utilization. More importantly, however, these benchmarks usually build a relatively small set of data files, resulting in artificially reduced disk seek times. In production environments, disk drives are usually filled to at least 30-50% of their

capacity, causing longer seek times compared to the benchmark files that might only use 2-3% of the disk capacity. After all, building a 2 GB database

(34)

for a benchmark might seem like a large data set, but on a disk array

containing five 9 GB drives, that database utilizes less than 1/20th of the total space. This greatly reduces seek times, thereby inflating the performance contribution of the SCSI bus.

Most IBM SCSI drive enclosures offer the ability to split the backplane to offer dual SCSI bus capability. This effectively provides the same performance of Ultra2 by using dual Ultra SCSI buses in one drive package. In the case of the EXP200, which supports Ultra2, the backplane can be split to provide throughput benefits similar to Ultra3 SCSI, provided that the other components in the system can handle this level of throughput. 7.7.6.2 PCI bus

Don't forget that all this data must travel though the PCI bus. The peak data transfer rate for 32-bit 33 MHz PCI is 132 MBps, but the maximum sustained rate is only 80-100 MBps. The ServeRAID-2 adapter used three 40 MBps Ultra SCSI buses, which had the potential to sustain peak rates of 120 MBps. This transfer rate was perfectly matched with 32-bit 33 MHz PCI. Using Ultra2 SCSI only provides the possibility of improving maximum sustainable

performance when both the adapter and the system support faster data transfer rates.

The ServeRAID-3HB adapter offers Ultra2 SCSI transfer performance and can transfer data over 64-bit PCI, which is better matched to the transfer requirements of its three Ultra2 SCSI buses. The adapter must be plugged into a 64-bit slot for Ultra2 SCSI capabilities to take maximum advantage of its potential.

Ultra3 SCSI's 160 MBps rate has similar issues: even faster PCI-to-memory performance is required before maximum throughput can be achieved. A three-channel Ultra3 SCSI RAID adapter can potentially deliver peak rates of 480 MBps. For the moment, no PCI interface can offer such throughput performance. In fact, most memory subsystems cannot offer that much bandwidth for all of the PCI slots combined.

All this is not to say that Ultra2 and Ultra3 SCSI performance have no place on servers. Several issues must be addressed. You should simply remember that these are important considerations when configuring a balanced server. Spec-driven technologies, such as SCSI, are often motivated by desktop environments. In the desktop environment, applications tend to be more sequential, and the system usually has a single SCSI adapter that can monopolize much of the PCI to memory bandwidth. In the desktop

(35)

environment, SCSI-3 provides significant performance gains. Because of the more random nature of server applications, these benefits often do not translate to server environments. The entire delivery path from memory through the PCI bus, over the adapter, and out to the drive must be optimized before faster SCSI bus speeds will realize any appreciable system

performance gains for any workload. 7.7.6.3 Multiple SCSI buses

The SCSI bus organization of drives on a multi-bus controller (such as ServeRAID) does not significantly affect performance for most server workloads.

For example, in a four-drive configuration, it doesn’t matter whether you attach all drives to a single SCSI bus or if you attach two drives each to two different SCSI buses. Both configurations will usually have identical disk subsystem performance. This applies to applications such as database transaction processing, which generate random disk operations of 2 KB or 4 KB. The SCSI bus does not contribute significantly to the total time required for each I/O operation. Each I/O operation usually requires drive seek and latency times; therefore, the sustainable number of operations per second is reduced, causing SCSI bus utilization to be low.

For a configuration which runs applications that access image data or large sequential files, performance improvement can be achieved by using a balanced distribution of drives on the three SCSI buses of the ServeRAID.

7.7.7 Disk cache write-back versus write-through

Most people think that write-back mode is always faster because it allows data to be written to the disk controller cache without waiting for disk I/O to complete.

This is usually the case when the server is lightly loaded. However, as the server becomes busy, the cache fills completely, causing data writes to wait for space in the cache before being written to the disk. When this happens, data write operations slow to the speed at which the disk drives empty the cache. If the server remains busy, the cache is flooded by write requests, resulting in a bottleneck. This happens regardless of the size of the adapter’s cache.

In write-through mode, write operations do not wait in cache memory that must be managed by the processor on the RAID adapter. When the server is lightly loaded (the green zone on the left in Figure 53), write operations take

(36)

longer because they cannot be quickly stored in the cache. Instead, they must wait for the actual disk operation to complete. Thus, when the server is lightly loaded, throughput in write-through mode is generally lower than in

write-back mode.

Figure 53. Comparing write-through and write-back modes under increasing load

However, when the server becomes very busy (the pink zone on the right in Figure 53), I/O operations do not have to wait for available cache memory. They go straight to disk, and throughput is usually greater for write-through than in write-back mode.

Write-through is also faster when battery-backup cache is installed, due partly to the fact that the cache is mirrored. Data in the primary cache has to be copied to the memory on the battery-backup cache card. This copy operation eliminates a single point of failure, thereby increasing the reliability of the controller in write-back mode, but it takes time and slows writes, especially when the workload floods the adapter with write operations.

Comparing

write-through

versus write-back

IBM ServeRAID-3HB 8 KB random I/O RAID-5 Write-through Write-back Increasing Load 200 400 600 800 1000 1200 1400 1600 I/ O s p er Second

Based on Figure 53, the following rule of thumb is appropriate: • If the disk subsystem is very busy, use write-through mode. • If the disks are configured correctly, and the server is not heavily

loaded, use write-back mode.

(37)

7.7.8 RAID adapter cache size

IBM performance tests show that the ServeRAID-3H adapter with 32 MB of cache typically outperforms other RAID adapters with 64 MB for most real-world application workloads. Once the cache size is above the minimum required for the job, the extra cache usually offers little additional

performance benefit.

The cache increases performance by providing data that would otherwise be accessed from disk. However, in real-world applications, total data space is so much larger than disk cache size that, for random operations, there is very little statistical chance of finding the requested data in the cache. For

example, a 50 GB database would not be considered very large by today's standards. A typical database of this size might be placed on an array consisting of seven or more 9 GB drives. For random accesses to such a database, the probability of finding a record in the cache would be the ratio of 32 MB/50 GB, or approximately 1 in 1,600 operations. Double the cache size, and this value is decreased by half; still a very discouraging hit-rate. You can easily see that it would take a very large cache to increase the cache hit-rate to the point where caching becomes advantageous for random accesses. In RAID-5 mode, significant performance gains from write-back mode are derived from the ability of the disk controller to merge multiple write

commands into a single disk write operation. In RAID-5 mode, the controller must update the checksum information for each data update. Write-back mode allows the disk controller to keep the checksum data in adapter cache and perform multiple updates before completing the update to the checksum information contained on the disk. In addition, this does not require a large amount of RAM.

In most cases, disk array caches can usually provide high hit rates only when I/O requests are sequential. In this case, the controller can pre-fetch data into the cache so that on the next sequential I/O request, a cache hit occurs. Pre-fetching for sequential I/O requires only enough buffer space or cache memory to stay a few steps ahead of the sequential I/O requests. This can be done with a small circular buffer.

The cache size needs to increase in proportion to the number of concurrent I/O streams supported by the array controller. The earlier ServeRAID

adapters supported up to 32 concurrent I/O streams, so 32 MB of cache was deemed enough to provide a high-performance hit rate for sequential I/O. For newer RAID adapters, the number of outstanding I/O requests can be as high as 128; thus, these adapters will have proportionally larger caches. (Note, it is

(38)

a coincidence that the number of I/O streams matches the size of the cache in MB)

Having a large cache often means more memory to manage when the workload is heavy and during light loads very little cache memory is required. Most people don't invest the time to think about how cache works. Without much thought, it's easy to reach the conclusion that “bigger is always better.” The drawback is that larger caches take longer to search and manage. This can slow I/O performance, especially for random operations since there is a very low probability of finding data in the cache.

Benchmarks often do not reflect a customer production environment. In general, most “retail” benchmark results run with very low amounts of data stored on the disk drives. In these environments, a very large cache will have a high hit-rate that is artificially inflated compared to the hit-rate from a production workload.

In a production environment, an overly-large cache can actually slow

performance as the adapter continuously searches the cache for data that is never found, before it starts the required disk I/O. This is the reason that many array controllers turn off the cache when the hit-rates fall below an acceptable threshold.

In identical hardware configurations it will take more CPU overhead to manage 64 MB of cache compared to 32 MB, and even more for 128 MB. The point is that bigger caches do not always translate to better performance. Although ServeRAID-4H has a 266 MHz PowerPC 750 with 1 MB L2, this CPU is approximately 5-7 times faster than the 80 MHz CPU used on ServeRAID-3H. Therefore ServeRAID-4H can manage the larger cache without running slower than ServeRAID-3H.

Furthermore, the amount of cache must be proportional to the number of drives attached. Typically cache hits are generated from sequential read ahead. You do not need to read ahead very much to have 100% hits. More drives have more I/O streams to prefetch. ServeRAID-4H has 4 SCSI buses that support up to 40 drives compared to 30 for ServeRAID-3H.

7.7.9 Device drivers

Device drivers play a major role in performance of the subsystem with which the driver is associated. A device driver is software written to recognize a specific device. Most of the device drivers are vendor specific. These drivers