We are also working on a user-assisted IO-Engine discussed in Section 3.7 which requires some user inputs to facilitate more intelligent le striping. We dene a new MPI le open API MPI_File_open_s() which asks for extra parameters to learn some basic characteristics of future I/O accesses to this le. One of the considered characteristics include read/write frequency because we observe that read performance tends to be more aected by stripe size while write performance tends to be more aected by striping count.
Our experiments in Figure 3.16 and Figure 3.17 show the impact of dierent striping counts and stripe sizes on collective I/O performance. In both cases, we use default two-phase I/O parameters.
In Figure 3.16, we x the stripe size to 1MB and 4MB, and change the striping count from 1 to 16. Write performance generally improves as striping count increases due to two reasons. The rst reason is that the average write lock overhead on a single OST will be smaller. For a simple example, if 20 stripes are accessed by 20 aggregator processes (aggregate request is aligned with the stripe boundary) during a write I/O cycle and are evenly distributed across 10 OSTs, then each OST has to perform a lock relinquish. However, if the 20 stripes are evenly distributed across 20 OSTs, then no lock relinquish needs to be done implying less lock overhead. The second reason is more OSTs provide larger aggregate I/O throughput. Read performance is less impacted by striping count because reads can be serviced with read-ahead cache in Lustre. As long as the bandwidth of OSTs are not saturated, I/O throughput for read will be less impacted by striping count.
However, read performance can benet from relatively larger stripe size because of the read-ahead scheme in Lustre as shown in Figure 3.17, where we x the stripe count to 4 and 8, and change the stripe size from 1MB to 16MB. Large stripe sizes can negatively impact writes because statistically more processes will contend on stripes causing more locking overhead.
0 200 400 600 800 1000 1200 1400 1 4 8 16 Striping Count (b) Stripe Size = 4MB write read 0 200 400 600 800 1000 1200 1 4 8 16 Th ro u gh p u t in M B /s Striping Count (a) Stripe Size = 1MB
Figure 3.16: Striping Count Impact
As a result, a proper heuristic logic would be that for write intensive les, a larger striping count and a default stripe size should be used. For example, checkpointing les created by HPC applications are mostly written once and thus would prefer larger striping count. For read intensive les, a relatively larger stripe size and a default striping count should produce better performance. Of course, we need to carry out more comprehensive experiments and thorough technical research in order to validate this conclusion, which will also be our near future work.
Besides, another characteristic, the region size (the amount of contiguous data be- longing to the same process), can also be used to help choose a stripe size such that RS < SS < RE situation can be formed which theoretically should produce the best result as discussed above.
Modications must be made to MPI API syntax in order to pass these information. Therefore we do not include this feature in the current non-intrusive IO-Engine, but will integrate it in the user-assisted IO-Engine as part of step 1 in the heuristic map.
Moreover, this IO-Engine work can also be extended to optimize parallel I/O perfor- mance for other parallel le systems. We believe some heuristics can be shared among them but the rest are le system dependent.
3.8 Conclusions
Parallel I/O performance has been a challenge for HPC systems because of many factors along the parallel I/O path. In this chapter, we motivate ourselves by investigating
0 500 1000 1500 2000 2500 1m 4m 8m 16m Th rough p u t in MB/ s Stripe Size (a) Striping Count = 4
0 500 1000 1500 2000 2500 3000 1m 4m 8m 16m Stripe Size (b) Striping Count = 8 write read
Figure 3.17: Stripe Size Impact
the parallel I/O stack and exploring the correlations among factors such as le access pattern, parallel I/O modes and specic system parameters. Based this knowledge, we propose IO-Engine, an intelligent I/O middleware module instrumented to the existing MPI-IO library that can transparently optimize HPC I/O workloads in Lustre system.
Chapter 4
In-place Update SWDs
In contrast to parallel I/O performance, storage capacity is another major I/O require- ment in converged HPC systems. Traditional hard drives, which store over 80% of data in most of today's data centers and HPC systems [71], are reaching areal data density limit. New technology must be developed to keep up with the data growth pace.
Shingled Write Disks (SWDs), one of the most promising new technologies, increase the storage density by writing data in overlapping tracks. Consequently, data cannot be updated freely in place without overwriting the valid data in subsequent tracks if any. A write operation therefore may incur several extra read and write operations, which creates a write amplication problem. In this chapter, we propose several novel static Logical Block Address (LBA) to Physical Block Address (PBA) mapping schemes for in- place update SWDs which signicantly reduce the write amplication. The experiments with four traces demonstrate that our scheme can provide comparable performance to that of regular Hard Disk Drives (HDDs) when the SWD space usage is no more than 75%.