3.2 CSeq Application Enhancement
3.2.3 GeneSIS Cluster Optimizations
The initial CSeq implementation did not scale properly on the GeneSIS cluster, in particular, it actually ran slower on GeneSIS than the conventional workstation. So clearly, the application required cluster optimizations in order to sufficiently harness the computational power provided by the system hardware. In this section, we document this task of adding space and run-time enhancements to CSeq.
General Methodology
To attack this problem, we divided CSeq into three distinct function categories: (1) disk reading, (2) data processing, and (3) disk writing. Next, we started a development branch of CSeq, namely CSeqSpeed. From here, we manually extracted code from CSeq and moved it to the development branch (via copy-and-pasting) to independently test the performance of each function category.
In all test cases, we used uniform data sets (approximately 1 Gigabyte). We created four distinct genome data sets: (1) an uncompressed FASTA, (2) a Gzip compressed FASTA, (3) an uncompressed Genbank, and (4) a Gzip compressed Genbank. For each data set, we observed the performance of the compiled CSeqSpeed executable and compared it with the expected (theoretical) target performance of the Parallel Virtual File-System (PVFS) disk subsystem I/O. The 7.0TB PVFS partition of the 24.8TB total usable space on the GeneSIS Cluster housed the NCBI database snapshots used for in the sequence processing tests.
Tuning: Disk Reading
First, we constructed the initial base version of CSeqSpeed: a skeleton program that simply opened an NCBI database file, sequentially read each block in the file using a circular buffer, and terminated. Using the Linux “time” command, we established the expected PVFS read performance in MB/sec by testing CSeqSpeed on the uncompressed FASTA data set—we found that the optimal CSeqSpeed read-buffer size for GeneSIS was 1MB - 2MB. Next, we tested CSeq on the compressed FASTA data set—we found that use of the Gzip compression library degraded performance slightly but was consistent with our expectations. Once we found the optimal read
block size, we used Cachegrind to verify that CSeqSpeed contained zero memory leaks. At this point, we had completed our base version of CSeqSpeed, which represented the expected target performance.
So we copy-and-pasted the data structures and algorithms from CSeq to CSe- qSpeed one-by-one to iteratively test the disk read performance on the uncompressed FASTA data set. For each case, we compared the observed performance with the expected performance. If the observed performance was less than the expected performance, then we used Cachegrind and GProf to identify software bottlenecks and eliminate them. Based on our analysis, we found that the following enhancements to CSeq significantly improved space and run-time performance in the disk reading domain:
1. Character Iterator Class Removal: We found that invoking the Character Iterator’s increment operator for each character significantly degraded disk read run-time performance, so we completely removed this class and replaced it with a circular buffer in the File Iterator class.
2. String/C-String Concatenation Minimization: We reduced the total num- ber of string buffer concatenations in the disk reading process by setting the File Iterator’s circular buffer to a static size of 2MB.
3. Pointers and References: We used pointers and references to efficiently pass the read buffer between various objects to eliminate unnecessary and redundant memory allocation and copying.
4. Function In-Lining: We in-lined the most frequently used disk read functions (based on GProf statistics) in the Sequence, Sequence Iterator, FASTA Sequence Iterator, and Genbank Iterator classes.
Tuning: Data Processing and Filtering
Once the disk reading enhancements were complete, we turned to the data processing and filtering enhancements; here, the CSeqSpeed copy-and-paste, Cachegrind, GProf, and code adjustment methodology for tuning the data processing algorithms was similar to that of the disk reading. Based on our analysis, we found that the following enhancements to CSeq significantly improved space and run-time performance in the data processing and filtering domain:
1. Disk Swapping and Memory Usage Reduction: We reduced the RAM requirements for the Sequence, Sequence Iterator, Genbank Sequence Iterator, FASTA Sequence Iterator, Genbank Feature, Filter Set, and Filter classes used by the core processing and filtering algorithms to reduce the disk swap potential.
2. Data Globalization: We globalized the Genbank Lookup Table and certain Filtering enumerations to provide faster access to the processing and filtering algorithms.
3. Pointers and References: We used pointers and references to efficiently pass the Sequence objects to the various processing and filtering algorithms.
4. Function In-Lining: We in-lined the most frequently used data processing and filtering functions (based on GProf statistics) in the Filter Factory, Filter- ing Sequence Iterator, Filter Set, Filter, Sequence, Sequence Iterator, FASTA Sequence Iterator, and Genbank Iterator classes.
Tuning: Disk Writing
Once the data processing enhancements were complete, we turned to the disk writing enhancements; here, the CSeqSpeed copy-and-paste, Cachegrind, GProf, and code adjustment methodology for tuning the data processing algorithms was similar to that of the data processing. Based on our analysis, we found that the following enhancements to CSeq significantly improved space and run-time performance in the disk writing domain:
1. Output Buffering: We adjusted the disk writing algorithms to export 2MB blocks to the PVFS file system (during data processing), so the total number of output file access requests were reduced.
3.3
Statistical Analysis Applications
In this section, we discuss the applications used to conduct k-mer-based experiments on NCBI genome and proteome sequences.
3.3.1 Rankseq: k-mer Arrangement and Classification