3.5 Databases and database design
3.5.2 Queries and Views
To access and update the data in a table, queries have to be written. The ANSI (American National Standards Institute) and the ISO (International Organization for Standardization) have defined a language called Structured Query Language (SQL), which is used by many different DBMS (e.g, MySQL and SQLite). SQL itself is descriptive, which means that the user describes what he wants to retrieve and does not have to write the actual code that the DBMS
uses to calculate the result82. A simple query to retrieve the
Name
of all proteins with theProtein_ID
equal to "P27361", would beSELECT Name FROM Protein WHERE Protein_ID == "P27361"
Databases and database design
The
SELECT
defines the attribute (Name
) to query. TheFROM
describes from which table (Protein
), and theWHERE
clause defines criteria, which can be used to filter the result.To describe more complex structures, like peptide-protein relationships, associations are needed (Table 3.1). These associations are defined by shared attributes. In our example, a
peptide table could have the attribute
Protein_ID
used as a foreign key, which would connectthe peptide table with the protein table. A query to select attributes from both tables would be:
SELECT Protein.Name, Peptide.Sequence FROM Protein
INNER JOIN Peptide ON Protein.Protein_ID == Peptide.Protein_Protein_ID
WHERE Protein_ID == "P27361"
The
INNER JOIN
describes the relationship between the peptide and the protein table andconnects the two tables using the
Protein_ID
.Table 3.1: Table to describe peptides in the database. The foreign keyProtein_IDcan
be used to express the association to the protein table.
Peptide
Primary Key Sequence Protein ID
1 EALAHPYL P27361
2 AAANFRRL P27453
... ... ...
A design following the normal forms leads to queries with many joins. Writing these large queries can be inconvenient and often is redundant. Therefore, most DBMS allow creating views. They are virtual or logical tables defined by a select statement. A view may for ex- ample contain all peptides found in a specific sample. Instead of having to write the full
query with all joins, the view allows a selection in a table-wise fashion (e.g.
SELECT * FROM
sample_has_peptides
). This could be also achieved by creating a materialized table with the content of the select. However, if the database is updated, the table would have to be updated as well. In contrast, views do not store the data but select them directly from the original tables. Therefore, they do not need to be updated.Chapter 4
The HLA Ligand Atlas
4.1
Introduction and motivation
The analysis of HLA ligands helps to understand the immune system and the gained knowl-
edge can be used to find new treatment targets69. These targets can then be used to develop
new therapies against diseases like cancer. Unfortunately, the identification of such new tar- gets needs a large number of samples for which the immunopeptidome has to be obtained, analyzed, and stored. Because one HLA immunopeptidome experiment for one sample can result in the identification of up to 5,000 different peptides with many parameters such as identification software specific scores or the number of Peptide-Spectrum Matches (PSMs),
the storage and the analysis can be laborious15. Traditionally, scientists analyze and search
this immunopeptidome data using Excel or other easy-to-use spreadsheet programs. However, this method is not feasible if, as in this presented study, a large number of different samples is measured. Therefore, we developed a user-friendly web interface, which allows fast and simple access to the provided data, to support biologists and biochemists in their analysis of such large datasets. This interface allows wet-lab scientist to search and perform frequent meta-analyses on the contained data using only their web browser and thus provides access to the immunopeptidome dataset for scientists regardless of computer science skills.
The data is stored in a database, in our case, a MySQL database, to allow fast queries of the data. The standard way to access such a database is via SQL queries. However, these queries require knowledge of the underlying database schemata and their usage can be difficult for the
average wet-lab scientist. Databases, such as the human protein atlas126or ProteomicsDB143,
demonstrate that such large databases can be searched by everyone via a web interface. There- fore, we implemented the HLA Ligand Atlas, a web interface to the collected immunopeptidome data.
Other websites and databases like the SysteMHC Atlas114, The Cancer Immunome Atlas
4. The HLA Ligand Atlas
ternational Cancer Genome Consortium (ICGC
https://icgc.org/
) already provide accessto cancer specific data. TCGA and ICGC present information about the genomic, transcrip- tomic and epigenomic changes in cancer. This information can be obtained either as already analyzed data, like somatic mutations and overexpression of genes, or raw data, like FASTQ or BAM files. TCIA contains information on immune-related gene sets, cellular composition of immune infiltrates, HLA types, neoantigens and cancer-germline antigens, as well as tumor heterogeneity. These information are obtained using computational genomic methods and
are based on data provided by TCGA, Van Allen et al.127, and Hugo et al.54. Similar to our
HLA Ligand Atlas, the SysteMHC Atlas provides access to peptides presented by MHC/HLA.
It allows to query proteins and peptides, and to filter these results for the binding top allele and MHC class. In addition, it allows to download spectral libraries for peptides sorted by their top binding allele. SysteMHC Atlas contains 16 published human immunopeptidomics
projects/datasets and 7 unpublished datasets. All 23 datasets are reprocessed using the raw
data and a standardized pipeline (Comet39, X!Tandem31, PeptideProphet79, and iProphet116).
The aim of the SysteMHC Atlas is to gather and provide information and data across multiple heterogeneous projects. In contrast, the HLA Ligand Atlas is focused on one homogeneous project, in which all samples are prepared using the same protocol and are analyzed on one mass spectrometer. Furthermore, the HLA Ligand Atlas provides additional meta data, like the tissue or HLA type for each sample. This information can be queried, which allows answering more detailed questions on the immunopeptidome. On the web interface level, the SysteMHC Atlas only provides basic queries for proteins and peptides, whereas we developed an interface which provides more query options and more statistics.
This chapter first describes the design and architecture of the database and the web inter- face. In the next part, its implementation is described, followed by a description of the data available trough the database and how the data was processed. The last section summarizes the results and provides an outlook on the topic.