2.8 Conclusions
3.2.4 File Allocations
Files in a parallel le system are striped over multiple I/O servers (or OSTs in Lustre terminology) to achieve better aggregate I/O throughput. There are two striping policies in Lustre le system. The default is Round-Robin allocation. Once the le stripe count (SC) and stripe size (SS) are set, a le will be striped across SC consecutive OSTs starting at a particular OST which is chosen according the OST space capacity usage. By default, an OST with lower usage is often chosen as the start OST so that all OSTs are guaranteed to have similar space usage. However, users are allowed to set customized striping policy for their working directories and les (including starting OST, stripe count and stripe size), which will make OSTs space capacities unevenly utilized. When the space usage dierence is bigger than a threshold (20% by default), the second striping policy will be used which is called Weighted Random Allocation. The weighted allocation scheme will give higher priorities to OSTs with lower space utilizations to gradually even the space utilizations on all OSTs.
Parameters striping_factor, striping_unit and romio_lustre_start_iodevice can be set to desired values through MPI hints to control the striping count, stripe size and start OST respectively. Note that newly created les will inherit the striping scheme of the parent directory if these parameters are left unset.
3.3 Related Work
Process collaborations have long been considered as an eective method of improving I/O performance. Seamons et al. introduced the server-directed I/O scheme [64] where the I/O nodes collect requests from computing nodes to explore the I/O sequentiality in favor of disk performance. The most widely used optimization for MPI collective I/O is the two-phase I/O [40] algorithm. Two-phase I/O makes performance improvements by selecting a subset of the application processes to collect requests from individual processes and organize them into bigger contiguous requests. Two-phase I/O has been the foundation for many other approaches including ParColl [52], LACI/O [53] and view-based collective I/O [54], etc. Liao and Choudhary proposed several le domain partitioning algorithms [55] to improve two-phase I/O performance by reducing locking overhead in the underlying le system.
There are many parameters along the parallel I/O path (mostly in MPI-IO layer and parallel le systems) that will lead to undermined I/O performance when left to their default values. There are some work focusing on improving I/O performance by tuning these parameters include [56, 57, 58, 59].
For example, Chaarawi and Gabriel introduced a model [57] to choose the opti- mal number of aggregators at runtime based on factors including the le view, process topology, the per-process write saturation point, and the actual amount of data written in a collective write operation. However, their model does not take into account the importance of physical le layout.
Worringen implemented an approach in NEC's MPI implementation [59] to auto- matically determine the optimal setting for le hints related to collective MPI-IO op- erations. Their approach considers ve parameters (cb_pros, cb_cong_list, cb_read and cb_write and cb_buer_size) and is specic to NEC's MPI-2 implementation (i.e., NEC/SX) and NEC's GFS le system. Their approach is inexible in determining the parameters' optimal values. For example, their approach replies on phase change detec- tion to adapt the buer size.
Liu et al. evaluated several collective I/O and non-collective I/O related MPI param- eters including romio_ds_read, romio_ds_write, and envisioned an automatic MPI-IO tuning tool based on Periscope Tuning Framework [57]. The shorting coming of this work is that they used a simplied strategy which explores next tuning parameter based on the best setting for previously explored parameters. This reduces the testing time but does not visit all the parameter permutations, which will result in failure of recovering correlations among I/O factors. Furthermore, this work only presented some parameter space exploration results without theoretical analysis.
McLay et al. developed a model [56] for understanding MPI collective write on Lustre and oered suggestions on selecting striping count to avoid chasm problem for Lustre le system. However, they did not study MPI collective read and non-collective MPI-IO workloads.
Behzad et al. proposed an empirical parameter training based approach [65, 66] to automatic parameter tuning for HDF5 applications. This approach searches for best parameter values by repeatedly running the application with dierent parameter value combinations which is termed as parameter space search before the actual running. The
set of best parameter values are stored as an XML le and will be loaded at runtime by the application. The downside of this approach is that some trainings can take up to 12 hours without a prediction model or 2 hours when using a prediction model proposed in [66]. This approach is also specic to HDF5 applications.
In contrast, the set of parameters crafted in our work is more complete than existing studies which including both MPI-IO and Lustre le system parameter. Besides, we provide more ecient ways for calculating the optimal parameter values based on a par- allel I/O model that leads to a thorough understanding the relationships between these parameters and the workload patterns. We also implemented our approach, i.e., IO- Engine, in the MPI-IO library by extending the existing APIs to make it non-intrusive. IO-Engine can be used by any applications using MPI-IO library and/or higher level libraries built on top of MPI-IO library without modifying source code.