• No results found

3.5 Databases and database design

3.5.2 Queries and Views

To access and update the data in a table, queries have to be written. The ANSI (American National Standards Institute) and the ISO (International Organization for Standardization) have defined a language called Structured Query Language (SQL), which is used by many different DBMS (e.g, MySQL and SQLite). SQL itself is descriptive, which means that the user describes what he wants to retrieve and does not have to write the actual code that the DBMS

uses to calculate the result82. A simple query to retrieve the

Name

of all proteins with the

Protein_ID

equal to "P27361", would be

SELECT Name FROM Protein WHERE Protein_ID == "P27361"

Databases and database design

The

SELECT

defines the attribute (

Name

) to query. The

FROM

describes from which table (

Protein

), and the

WHERE

clause defines criteria, which can be used to filter the result.

To describe more complex structures, like peptide-protein relationships, associations are needed (Table 3.1). These associations are defined by shared attributes. In our example, a

peptide table could have the attribute

Protein_ID

used as a foreign key, which would connect

the peptide table with the protein table. A query to select attributes from both tables would be:

SELECT Protein.Name, Peptide.Sequence FROM Protein

INNER JOIN Peptide ON Protein.Protein_ID == Peptide.Protein_Protein_ID

WHERE Protein_ID == "P27361"

The

INNER JOIN

describes the relationship between the peptide and the protein table and

connects the two tables using the

Protein_ID

.

Table 3.1: Table to describe peptides in the database. The foreign keyProtein_IDcan

be used to express the association to the protein table.

Peptide

Primary Key Sequence Protein ID

1 EALAHPYL P27361

2 AAANFRRL P27453

... ... ...

A design following the normal forms leads to queries with many joins. Writing these large queries can be inconvenient and often is redundant. Therefore, most DBMS allow creating views. They are virtual or logical tables defined by a select statement. A view may for ex- ample contain all peptides found in a specific sample. Instead of having to write the full

query with all joins, the view allows a selection in a table-wise fashion (e.g.

SELECT * FROM

sample_has_peptides

). This could be also achieved by creating a materialized table with the content of the select. However, if the database is updated, the table would have to be updated as well. In contrast, views do not store the data but select them directly from the original tables. Therefore, they do not need to be updated.

Chapter 4

The HLA Ligand Atlas

4.1

Introduction and motivation

The analysis of HLA ligands helps to understand the immune system and the gained knowl-

edge can be used to find new treatment targets69. These targets can then be used to develop

new therapies against diseases like cancer. Unfortunately, the identification of such new tar- gets needs a large number of samples for which the immunopeptidome has to be obtained, analyzed, and stored. Because one HLA immunopeptidome experiment for one sample can result in the identification of up to 5,000 different peptides with many parameters such as identification software specific scores or the number of Peptide-Spectrum Matches (PSMs),

the storage and the analysis can be laborious15. Traditionally, scientists analyze and search

this immunopeptidome data using Excel or other easy-to-use spreadsheet programs. However, this method is not feasible if, as in this presented study, a large number of different samples is measured. Therefore, we developed a user-friendly web interface, which allows fast and simple access to the provided data, to support biologists and biochemists in their analysis of such large datasets. This interface allows wet-lab scientist to search and perform frequent meta-analyses on the contained data using only their web browser and thus provides access to the immunopeptidome dataset for scientists regardless of computer science skills.

The data is stored in a database, in our case, a MySQL database, to allow fast queries of the data. The standard way to access such a database is via SQL queries. However, these queries require knowledge of the underlying database schemata and their usage can be difficult for the

average wet-lab scientist. Databases, such as the human protein atlas126or ProteomicsDB143,

demonstrate that such large databases can be searched by everyone via a web interface. There- fore, we implemented the HLA Ligand Atlas, a web interface to the collected immunopeptidome data.

Other websites and databases like the SysteMHC Atlas114, The Cancer Immunome Atlas

4. The HLA Ligand Atlas

ternational Cancer Genome Consortium (ICGC

https://icgc.org/

) already provide access

to cancer specific data. TCGA and ICGC present information about the genomic, transcrip- tomic and epigenomic changes in cancer. This information can be obtained either as already analyzed data, like somatic mutations and overexpression of genes, or raw data, like FASTQ or BAM files. TCIA contains information on immune-related gene sets, cellular composition of immune infiltrates, HLA types, neoantigens and cancer-germline antigens, as well as tumor heterogeneity. These information are obtained using computational genomic methods and

are based on data provided by TCGA, Van Allen et al.127, and Hugo et al.54. Similar to our

HLA Ligand Atlas, the SysteMHC Atlas provides access to peptides presented by MHC/HLA.

It allows to query proteins and peptides, and to filter these results for the binding top allele and MHC class. In addition, it allows to download spectral libraries for peptides sorted by their top binding allele. SysteMHC Atlas contains 16 published human immunopeptidomics

projects/datasets and 7 unpublished datasets. All 23 datasets are reprocessed using the raw

data and a standardized pipeline (Comet39, X!Tandem31, PeptideProphet79, and iProphet116).

The aim of the SysteMHC Atlas is to gather and provide information and data across multiple heterogeneous projects. In contrast, the HLA Ligand Atlas is focused on one homogeneous project, in which all samples are prepared using the same protocol and are analyzed on one mass spectrometer. Furthermore, the HLA Ligand Atlas provides additional meta data, like the tissue or HLA type for each sample. This information can be queried, which allows answering more detailed questions on the immunopeptidome. On the web interface level, the SysteMHC Atlas only provides basic queries for proteins and peptides, whereas we developed an interface which provides more query options and more statistics.

This chapter first describes the design and architecture of the database and the web inter- face. In the next part, its implementation is described, followed by a description of the data available trough the database and how the data was processed. The last section summarizes the results and provides an outlook on the topic.