Architecture Enterprise Storage Performance: It s All About The Interface.

(1)

Enterprise Storage Performance: It’s All About The Interface.

A DIABLO WHITE PAPER | APRIL 2014

Architecture

(2)

Your PCIe SSD has an ugly secret.

If you’ve ever tested holistic SSD performance (both IOPS and latency under load), you may already know what that secret is. Though PCIe interface bandwidth is aggressively touted by SSD manufacturers, end-to-end enterprise storage performance is truly governed by the presence or lack of bottlenecks within the system.

Until recently, the performance and scalability bottlenecks inherent to PCIe-based SSDs have been largely ignored. Those SSDs represented the highest-performing storage available, so customers had no alternative options to consider. However, with the introduction of Memory Channel Storage™(MCS™), a superior solution now exists. Memory Chanel Storage leverages an architecture that solves the issues faced by PCIe SSDs….thereby unlocking the true potential of flash storage in the enterprise.

It’s Not All About The Interface

With the advent of enterprise PCIe-based SSDs, interface speed moved to the forefront of most performance-related conversations. When compared to SATA and SAS SSDs, the bandwidth available to PCIe drives was clearly superior. As a result, PCIe SSD vendors have been vocal in promoting PCIe bandwidth as a key technology differentiator.

In practice, however, there is a world of difference between what an interface can support and what a solution using that interface can deliver. The focus on theoretical PCIe bandwidth, while compelling, has served to obscure a critical shortcoming……a pervasive bottleneck that limits both the performance and scalability of PCIe-based storage devices.

Flash Management Overload

It is well known that sophisticated media management is required to make economical flash (i.e. commodity MLC) usable in Enterprise applications. Wear-leveling, garbage collection, and error correction are amongst the activities that must be constantly managed for each flash IC

(3)

To optimize performance, solid state drives simultaneously access multiple flash ICs in parallel. The highest-performing PCIe-based SSDs employ a “big ASIC” architecture, in which many parallel flash devices are managed by a single, monolithic flash controller (see Figure 1).

However, due to the large amount of media management required, the “big ASIC” approach creates a bottleneck under heavy I/O load. This results in degraded latency as the controller ASIC is unable to keep pace with the increased computational burden.

A Telling Comparison

The effect of media management can be observed by examining the performance of leading PCIe solutions. For example, the performance of an MLC-based “big ASIC” solution will be dramatically worse than the performance of analogous (same controller, same number of flash placements) SLC-based solution. This phenomenon is demonstrated in the figures below (plotting IOPS versus I/O latency).

In Figure 2, we are comparing a leading MLC-based PCIe SSD to its SLC-MLC-based counterpart (again…same controller, same number of Flash ICs) in a 100% Random Read scenario. Though the MLC solution does exhibit reduced performance, both solutions are able to effectively leverage the PCIe bandwidth (>90% utilization) and, therefore, achieve comparably high throughput. In this case, the discrepancy between the SLC and MLC solutions is minimized because Read requests trigger much less media management activity than Write requests (e.g. no garbage collection or wear-leveling is required).

The latency profiles of the individual solutions are also worth noting. As Figure 2 shows, latency increases significantly for both products as the I/O load intensifies. This demonstrates another infrequently discussed reality concerning PCIe-based solutions….the low latencies quoted for those SSDs only apply under low I/O loads. When supporting peak performance, the PCIe SSD latencies increase dramatically.

Figure 1 - “Big ASIC” SSD implementation

Figure 2 - 100% Read Comparison

+

IOPS vs. Latency: 100% Read, 4K Random

IOPS

LA

TENCY (ms)

MLC-based PCIe SSD SLC-based PCIe SSD

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(4)

Figure 3 demonstrates the performance comparison in a 100% Random Write scenario. Here, the

results clearly show that the SLC-based solution can scale to support much higher throughput than the MLC-based version. In this case, write bandwidth is halved as the solution moves to MLC.

Though SLC flash does have better read and write performance than MLC, that advantage does not account for the huge performance discrepancy shown above. Instead, this discrepancy is caused by the increased media management overhead necessary for supporting MLC flash. Compared to SLC flash, MLC has inherently lower endurance and also requires more error correction. Therefore wear-leveling, garbage collection, and error correction all become more prevalent and computationally intensive. The resulting media management creates a bottleneck in the MLC-based solution. Due to this bottleneck, references to high PCIe interface speeds can be misleading.

Despite their access to a wide pipe, PCIe based SSDs are not able to leverage those speeds under load. In practice, as demonstrated by Figure 3, less than 15% of the available PCIe bandwidth is being utilized. (Note: The MLC SSD depicted in Figures 2-4 utilizes a write cache…. hence the flat performance at 100K IOPS for the 100% Write workload.)

In Figure 4, we compare using a mixed-use workload consisting of 70% Read transactions and 30% Write transactions. This is a typical read/write mix in real-world Online Transaction Processing (OLTP) applications. Here again, the performance delta is dramatic due to the intensive media management required for the MLC drive. With Reads and Writes interleaved, the associated bottleneck affects all request submissions. PCIe interface capability becomes moot as, once again, only a small portion (less than 30%) of the PCIe bandwidth is actually

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50000 100000 150000 200000 250000

Figure 3 - 100% Write Comparison

+

IOPS vs. Latency: 70/30, 8k Random

IOPS

LA

TENCY (ms)

IOPS vs. Latency: 100% Write, 4K Random

IOPS

LA

TENCY (ms)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 50000 100000 150000 200000 250000

(5)

It’s All About The Architecture

To leverage the full potential of flash in the enterprise, Diablo Technologies has pioneered a storage technology that bypasses the architectural bottlenecks faced by pre-existing solutions. Diablo’s Memory Channel Storage (MCS) architecture achieves end-to-end parallelism (i.e. no bottleneck) by leveraging the server’s natively parallel memory subsystem. Each MCS module plugs into a standard DDR3 DIMM slot and is directly available to the CPU’s

memory controller. Memory controllers were designed to effectively manage massively parallel, time-sensitive, high-speed data access (e.g. to DRAM). MCS DIMMs are

populated on multiple memory controller channels and media management tasks are dispersed across those channels in a distributed fashion. By dividing the media management overhead into manageable chunks and co-processing those

chunks in parallel, Diablo’s employs a “divide and conquer” strategy similar to those popular in distributed computing architectures. The result is a high-performance, efficient persistence layer that can service heavy I/O loads with low, deterministic latency (see Figure 5).

In addition, by taking advantage of MCS’s distributed nature, solution performance can be efficiently scaled to match Quality of Service (QoS) requirements. MCS provides system designers with granular control over the desired performance and capacity. This enables customers to pay only for what they truly need.

Problem Solved

In Figures 6 through 8, we’ve shown how a Memory Channel Storage solution (comprised of 8x SanDisk ULLtraDIMM™ devices) compares to the MLC-based PCIe SSD. Both solutions use commodity MLC flash and have similar total capacity (1.6TB vs 1.4TB).

Figure 6 - The Read performance of an MCS

solution offers a dramatic improvement over a PCIe-based SSD. Most real-world workloads have a significant Write mix, however, so this is less interesting than the next two comparisons that we will examine.

Figure 5 - Memory Channel Storage Architecture

Figure 6 - 100% Read Comparison [with MCS]

+

IOPS vs. Latency: 100% Read, 4K Random

IOPS

LA

TENCY (ms)

MLC-based PCIe SSD 8x ULLtraDIMMs

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 200000 400000 600000 800000 1000000 1200000

(6)

Figure 7 - The MCS-based solution also offers

vastly superior Write performance. This is a critical benefit in write-centric applications like high-frequency trading, and in cases where system memory must be persisted.

Figure 8 - The MCS-based solution also

dominates a mixed-workload comparison. This “70% Read, 30% Write” mix is highly common for popular applications like virtualization and database transaction processing.

IOPS vs. Latency: 70/30, 8K Random

IOPS

LA

TENCY (ms)

Figure 8 - 70 / 30 OLTP Comparison [with MCS]

+

IOPS

LA

TENCY (ms)

IOPS vs. Latency: 100% Write, 4K Random

Figure 7 - 100% Write Comparison [with MCS]

+

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50000 100000 150000 200000 250000 300000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 50000 100000 150000 200000 250000

(7)

So What Have We Learned?

It’s really not all about the interface. As we’ve explained and demonstrated, interface bandwidth is not analogous to storage performance. Though the PCIe interface can support high bandwidth, architectural bottlenecks restrict PCIe SSD performance in practice, thereby making theoretical PCIe bandwidth irrelevant. However, by leveraging a uniquely distributed architecture, Memory Channel Storage avoids such bottlenecks and offers a superior solution for real-world applications. Diablo’s approach has, for the first time, unlocked the true