PHOSIDA – Phosphorylation Site Database
MAPU 2.0: Max-Planck Unified Proteome Database
5.2 Implementation of MAPU 2
The initial content and the original format of MAPU have been described in Zhang et al. (Zhang et al., 2007). The general format of the database has changed drastically, as the previous database version was divided up into several sub-databases, each containing a discrete proteomic dataset. The new version (MAPU 2.0; (Gnad et al., in press)) unifies all sub-databases by re-assigning the determined peptides along with their corresponding data of each experiment to proteins entries of an updated database version. This allows the organism- specific retrieval of various cell type and organelle associated proteomic data:
The user can query the database organism-specifically by protein name, protein description, gene symbol, accession number in the database used for identification (such as the International Protein Index (IPI)), SwissProt accession identifier, protein sequence or peptide sequence (Figure 5.1, left panel).
If more than one protein entry match with the submitted query string, MAPU 2.0 will list all relevant proteins and mark the ones that show peptides determined in specified sub-proteomes in red (Figure 5.1, middle panel). Clicking on one of the red high-lighted entries leads to the result page (Figure 5.1, right panel). If there is only one match to the query, the web user will be guided directly to the result page of the protein. The left panel of the resulting web page displays all investigated cell types and tissues that have been explored. If the given protein was detected in a specific project, the corresponding button is highlighted (Figure 5.1, right panel). Otherwise, the image of the given tissue or cell type is illustrated in very light colors indicating the absence of the specified protein of interest.
Figure 5.1: The Max-Planck Unified Proteome Database 2.0
The web user can search for any protein of interest via accession numbers, gene symbols, gene name, protein description, peptide sequence, or protein name. The final result page illustrates the occurrence of the specified protein in certain tissues or organelles along with general annotations.
Clicking on one of the buttons on the left panel results in the complete listing of all peptides that have been measured in the selected cell type along with associated data such as Mascot scores or PTM scores (Figure 5.2).
Figure 5.2: Listing of peptides that were identified in a given tissue in MAPU 2.0
98
The peptide-to-protein assignment represents one of the main problems of MS data, since a peptide might occur in several proteins, usually isoforms or truncated versions of the gene. Multiple incidences of a peptide sequence can lead to ambiguous protein assignments. This can partially be resolved by noting that it is more likely that a given peptide sequence corresponds to the candidate protein that shows the highest number of peptides in total. MAPU addresses this issue by color highlighting the listed peptides: Green indicates that the peptide sequence is found exclusively in the selected protein of interest, whereas blue indicates that there is another protein entry that contains the peptide and shows the same number of identified peptides in total. Red points to the occurrence of another protein that shows a higher number of detected peptides in total and thus represents the more likely protein present in the sample. If one points the computer mouse to one of the corresponding ‘occurrences’ buttons, a blue colored box will pop up showing all protein entries that contain the given peptide along with the total number of containing peptides that have been identified (Figure 5.2). This fundamental principle of visualizing the ambiguity of protein assignments is also used in PHOSIDA – the phosphorylation site database (Chapter 4).
If the experimental design of a given project also focused on the organellar localizations of proteins, all organelles, in which the protein of interest was detected, are listed.
In addition to the illustration of associated cell types and organelles along with the measured peptides, general information about the protein is provided: Besides protein descriptions and full protein sequence, the corresponding GO identifiers are listed and they link to the Gene Ontology web site reporting full descriptions of the selected annotation. Furthermore, the annotations to each instance include the PubMed references and general features such as active sites, motifs, domains, or signaling sites derived from SwissProt (Figure 5.3). Since there may be several entries covering various isoforms or splice variants that corresponds to one SwissProt entry, we aligned the protein sequence of each SwissProt instance with the one of the corresponding entry of the database that was used for identification, which is usually the IPI database. We used BLASTP to align the protein sequences. The main purpose of this extensive alignment approach is to derive the exact sequence positions of relevant protein features that are annotated in SwissProt within the protein sequences of the entry of the other database.
Figure 5.3: Protein annotations in MAPU 2.0 based on SwissProt cross-references
If the experiment is quantitative the median quantitative data of all measured assigned peptides are taken to describe the quantitation of the protein (provided by MaxQuant output). Moreover, a further essential difference to the previous database version is the underlying programming language. The new release is exclusively based on C# and the ASP.NET technology, in order to have a shared class library, which is also used for the implementation of PHOSIDA (Chapter 4).
Furthermore, the concepts and web applications of MAPU 2.0 and PHOSIDA are very similar. This presents a great advantage for researchers that use both our in-house proteomic database (MAPU) as well as the phosphorylation site database (PHOSIDA). The similar web design also promotes the idea to have a corportate design of our group.
Additionally, each displayed web page includes a question mark button that directs to the help section of MAPU 2.0 describing the format of the current page or exemplifying the web application guideline. These help sections are also available via the ‘background’ section of MAPU 2.0. They contain general descriptions of the experimental designs of various projects, for instance. To allow the retrieval of sub-databases that could not be established in the new concept, a link to the old database version is provided. This is the case for the organellar database as well as the red blood database, as both datasets are exclusively protein-based and therefore cannot be mapped to MAPU 2.0 due to the lack of peptide information.
100
Next, we wished to use the proteomic data to annotate the genome. We extracted all measured peptides of each proteomic dataset and reassigned the given peptide sequences to gene transcripts that are annotated in the EnsEMBL database. We linked our in-house proteomic databases with the genome database in an efficient manner via the DAS/Proserver System (Finn et al., 2007). The basic concepts are explained in Chapter 8.