CHAPTER 2: Current Tools & Workflows Employed for Analysis of Large-Scale Shotgun Proteomics Experiments
2.2 Challenges of Integrating and Developing Workflows for Analyzing Large Experimental Datasets
2.2.1. Standardization of Data Formats
The goal of proteomics is to identify and quantify proteins, but as data complexity increases, accurately determining high- and low-confidence identifications and high- and
low-abundant abundances becomes non-trivial. For the past decade, MS-analysis has been centrally-processed in a workflow with primary emphasis on converting raw instrumental data (*.RAW files) to summarized reports (DTASelect files1) with basic spectral, peptide, and protein information. As new instruments collect approximately 28,000 spectra per fraction and a typical run consists of 12 fractions, over 308,000 scans are collected for every MS run- which translates into approximately 10 GB. For a single experiment on the newer instruments, between 120,000 and 180,000 spectra are assigned. Some experiments have been known to incorporate up to 60 MS runs, resulting in a deluge of highly dimensional high mass accuracy raw data that requires matching to peptide sequences, mapping to protein sequences, filtering, quantification, and normalization before biological interpretation can begin.
The Human Proteomics Organization’s (HUPO) Proteomics Standard Initiative (PSI) is a group of researchers dedicated to standardizing data formats to improve cross-platform analyses, set standards of high quality data, and foster collaborations between institutions.96 Their organization is divided into three primary working groups: molecular interactions, mass spectrometry and proteomics informatics, and protein separations. Each year each working group hosts a meeting to discuss the emerging needs of the current technologies and experimental designs as well as evaluate how the existing data formats, controlled vocabularies, and responsibilities are faring. Currently, most of the recommended data formats are XML-compliant, ensuring a relational structure that can not only be easily enforced with strictly defined schema but also effectively compressed into manageable file sizes. Until 2008, PSI suggested that mzData and mzXML formats should be used to capture raw data generated by the instruments. Whereas mzData was intended to be more of an index file to aggregate and point to numerous types of raw file formats from instrument vendors and not intended to replace the original files, mzXML files were created to be used as open-source substitutes for the information stored in vendor files, which were locked in proprietary formats. While mzData format is now deprecated, mzXML is still in common use. However, PSI currently recommends mzML or TraML formats instead. mzML files are expected to be ubiquitous and applicable to all
mass spectrometry instrument configurations and experimental designs, although no vendor has yet released software supporting it. TraML is a more specifically-designed format, targeting selected reaction monitoring (SRM) experiments. Both of these data formats contain scanning information to be used as inputs to search algorithms, which are, in turn, recommended to output mzIdentML files. mzIdentML files are expected to report MS scans, their MS/MS scans, the peptide sequences matching the MS/MS scans, and peptide-spectrum match (PSM) scores. These files do not contain all of the original peak data, but they have the structure in place to allow for matching fragment peaks to supplement each PSM score. For software that does not generate mzIdentML files by default, there is free software available to convert the other formats, such as dtaselect (from SEQUEST) or pepXML (from Myrimatch or MASCOT). Notably, the mzIdentML file format is not the last stage of the post-processing analysis. There is yet another step of filtering PSMs, assembling peptides into proteins, and quantifying abundances. For such processed information, PSI is currently working on finishing mzQuantML. However, mzQuantML has yet to be widely adopted, most likely due to the numerous differences on the standard procedure for filtering, assembling, and quantifying proteins.
2.2.2. Impediments to Integrating Systems Biology Data
As technologies continue to improve in quality, throughput, and specialization, propelling the pursuit of scientific knowledge into new frontiers, it is not surprising that the newly acquired information does not immediately suggest clear-cut mathematical models, completely agree with all existing theories, or self-organize in a way that can easily be documented, stored, and accessed. In fact, there are three main impediments to integrating systems biology data. First, with each new discovery and accumulation of information, assimilation of theories, and unexpected breakthrough, the meaning and context of scientific ideas keep changing. Some of these revelations are truly revolutionary while others take a while to refine, become accepted, and incorporated into our understanding of how things work. Secondly, the progression of science requires work to provide context and details. A single discovery cannot stand on its own, but once
it has been recognized and adopted, the scientific community has to work together to understand the meaning and implications behind the discovery. Even from a purely technical point of view, the effort in integrating multiple analyses, services, storing information in a way that not only makes sense with the current architecture, but is amenable to extraction and adaptation in anticipation of future expansion. The instrumentation, file formats, and data recorded keeps changing faster than people can accommodate. Thirdly, people are motivated to further scientific research by a number of different drivers, including funding opportunities reward systems, popular science topics, or personal interests. Most of this work requires extensive collaborations across disciplines, institutions, and cultures, and figuring out how to seamlessly coordinate with others can be a challenge in itself. With the rise of the internet and cloud infrastructures improving data accessibility, computing resources, and methods of communication, some of these hurdles are easier to overcome than others. As we continue to move forward, it is becoming ever more important for computational biologists to keep analyzing data in its proper, albeit dynamic context and implement modular, scalable software solutions.