Glycoproteomics data analysis - Development of data-analysis workflows for glycoproteomics and

Chapter 3: Development of data-analysis workflows for glycoproteomics and glycomics

3.2 Methods

3.2.7 Glycoproteomics data analysis

A naïve glycopeptide hypothesis is a database of glycopeptides from an in-silico enzymatic digest with up to 𝑘 missed cleavages of one or more reference sequences of putative glycoproteins with all combinations of selected post-translational modifications, combined with one or more N-Glycan compositions. We used pyteomics [182] to provide our dictionary of proteolytic enzymes. Post-translational modifications may either be fixed, guaranteed to be present, or variable, occurring or not at each viable site as defined by a subset of rules selected from database such as Unimod [103]or ProteinProspector’s MS-Digest [32]. If a glycopeptide has more than one glycosylation site, combinations of glycans are considered, up to a limit of 𝑔 glycosylations on a single peptide. A hypothesis is only constructed for MS1 compositions, so exact placement of variable modifications and glycosylations need not be determined exactly. Only unique glycopeptides are considered for each reference protein.

After construction, the hypothesis is searched against clustered LC-MS data as for glycomics. The results are used to construct an MS2 hypothesis.

3.2.7.1 MS1 Informed Hypothesis Building

An informed hypothesis begins with an mzIdentML file [278] and a library of glycan compositions, like a hypothesis or database search results set from a glycomics experiment. Each protein and associated peptidoforms are extracted from the mzIdentML file, filtered for just those which contain glycosylation sites following biosynthetic-rules, user specifications or other experimental information and then all combinations of peptidoforms and glycans are generated as in the naïve case. Because post-translational modifications and their locations have already been identified, no position information is removed. This may cause the informed hypothesis to be larger than the naïve hypothesis at the MS1 level, but each match at this level will translate to exactly one sequence searched at MS2 instead one or more. As in the naïve case, the informed MS1 hypothesis is evaluated in the same way.

3.2.7.2 MS2 Naïve Hypothesis Building

After having performed MS1 database searches, an MS2 hypothesis can be constructed. Each hit glycopeptide composition with present but un-localized PTMs is expanded into a set of glycopeptide sequences where PTM locations are exactly specified. This results in a combinatorial expansion of glycopeptide sequences, many of which may be indistinguishable without high coverage. This process included positioning the glycosylation attachment site(s) for each isoform. We generated theoretical peptide b and y ion series for each sequence, as well as precursor stub ions and oxonium ions. A precursor stub ion is composed of the mass of the intact peptide backbone with portions of the N-glycan core attached. Oxonium ions are low mass diagnostic saccharide ions

generated by fragmentation of the glycan units on glycoconjugates during collisional dissociation, analogous to immonium ions generated by peptide dissociation.

3.2.7.3 MS2 Informed Hypothesis Building

Each glycopeptide matched at the MS1 level from an informed hypothesis is translated to exactly one glycopeptide sequence. Informed databases for integrated omics were generated as described previously [291], except that the only glycans which were included from the glycomics analysis had an aggregated abundance greater than twice the mean abundance of all glycan matches.. Tandem mass spectra were identified by first recalculating the precursor ion monoisotopic mass and charge, followed by a database search procedure with a precursor mass tolerance of 10 ppm. For each glycopeptide which fell within the acceptable mass range, theoretical product ions for b, y, b plus HexNAc, y plus HexNAc, and intact peptide plus incremental losses of saccharide units (known as “stub ions”, were assigned. The tandem spectra were deconvolved and peaks were matched for each theoretical product ion with an error tolerance of 10 ppm, constructing a glycopeptide-spectrum match (GSM).

3.2.7.4 MS2 Database Search

For each sequence in the hypothesis, for each tandem spectrum whose precursor neutral mass is within 𝑡1 ppm mass error window, we compute fragment matches. Theoretical fragment masses are compared against the neutral masses for each tandem spectrum peak with an error tolerance of 𝑡₂ ppm. For our datasets, 𝑡₁ = 10, 𝑡₂ = 20. We first search for oxonium ions, evidence that the scan included a glycopeptide, and if none

are found, no further matching is performed. Multiple glycopeptides regularly match the same spectrum, as in proteomics, so spectral evidence is assigned to the best matching cases. Under the simple scoring model, this is just the set of glycopeptides which match the most peaks in the spectrum.

3.2.7.5 MS2 Scoring

Under a simple scoring regime, which does not assume any instrument-specific information, we built the score for each glycopeptide match from the set of all spectrum matches assigned to it. We computed three scores and combined them into a single summary 𝑀𝑆2 𝑆𝑐𝑜𝑟𝑒 following user parameterized weights depending upon the amount of backbone information that was expected, as shown in Method 2.

For each GSM, a score based upon peptide backbone coverage and presence of stub ions was computed and scaled to be between 0.0 and 1.0. For each spectrum where multiple glycopeptides could be assigned, all GSMs tied for the highest score were added to a results set.

For each experiment, a forward-sequence (target) and a reverse-sequence-with- valid-sequon (decoy) database was searched, and a q-value without PIT as described by Käll et al. [281], was computed for the paired result sets. Target spectra which had a q- value < 0.05 were selected for inclusion in the final reported results.

All of the glycomics and glycoproteomics data analysis, including database generation, matching and FDR estimation was performed using components of a prototype GlycReSoft data analysis pipeline, developed by Joshua A. Klein.

MeanCoverage(Glycopeptide, MatchedFragments):

bIonSeries := {0 for 𝑖 in [1, Length(Glycopeptide)]} yIonSeries := {0 for 𝑖 in [1, Length(Glycopeptide)]} For each Fragment in MatchedFragments:

If IonSeries(Fragment) == 𝑏:

bIonSeries[SequenceIndex(Fragment)] = 1 Else If IonSeries(Fragment) == 𝑦:

yIonSeries[SequenceIndex(Fragment)] = 1 Coverage := bIonSeries + Reversed(yIonSeries) MeanCoverage := Sum({_log log21+Coverage [𝑖]

21.5∗ 𝐋𝐞𝐧𝐠𝐭𝐡(Glycopeptide ) for 𝑖 in [1, Length(Glycopeptide)]})

Return MeanCoverage

MeanHexNAcCoverage(Glycopeptide, MatchedFragments):

bSeriesObserved := 0 ySeriesObserved := 0

For each Fragment in MatchedFragments:

If IonSeries(Fragment) == 𝑏 && Glycosylated(Fragment): bSeriesObserved += 1

Else If IonSeries(Fragment) == 𝑦 && Glycosylated(Fragment): ySeriesObserved += 1

bSeriesEnumerated := Length(TheoreticalGlycopeptideSeries(Glycopeptide, 𝑏)) ySeriesEnumerated := Length(TheoreticalGlycopeptideSeries(Glycopeptide, 𝑦)) MeanHexNAcCoverage := _{bSeriesEnumerated + ySeriesEnumerated}bSeriesObserved + ySeriesObserved

Return MeanHexNAcCoverage

MS2Score(Glycopeptide, MatchedFragments):

MeanCoverage := MeanCoverage(Glycopeptide, MatchedFragments)

MeanHexNAcCoverage := MeanHexNAcCoverage(Glycopeptide, MatchedFragments) BackboneScore := (MeanCoverage * 𝐵𝑎𝑐𝑘𝐵𝑜𝑛𝑒𝑊𝑒𝑖𝑔ℎ𝑡) + (MeanHexNAcCoverage * 𝐺𝑙𝑦𝑐𝑜𝑠𝑦𝑙𝑎𝑡𝑒𝑑𝑊𝑒𝑖𝑔ℎ𝑡)

BackboneFactor := 1 − 𝑆𝑡𝑢𝑏𝐼𝑜𝑛𝑊𝑒𝑖𝑔ℎ𝑡

StubIonScore := Min(CountStubIonsObserved(MatchedFragments) / 3.0, 3)

MS2Score := (BackboneScore * BackboneFactor) + (StubIonScore * 𝑆𝑡𝑢𝑏𝐼𝑜𝑛𝑊𝑒𝑖𝑔ℎ𝑡)

Return MS2Score

Method 2: Glycopeptide MS2 scoring scheme

We implemented a false discovery rate (FDR) estimate using the target decoy method [139,635] using a reversed target sequence decoy database, comparing 𝑀𝑆2𝑆𝑐𝑜𝑟𝑒 between target and decoy. In order to preserve as much biological context as possible, the N-glycan sequon in each decoy sequence retained its original orientation.

For example, a target sequence YPVLN(HexNAc)VTMPNNGK becomes GNNPMN(HexNAc)VTLVPYK. This produces deterministic, if reduced entropy decoys, which is desirable for reproducibility [197]. To accommodate for the assumption that the majority of the sequences in even a small combinatorial database are unlikely to be present, we used a “percent incorrect target” adjustment to the Target Decoy ratio calculation and also reported a q-value for each glycopeptide match as described by Käll and colleagues [281].

In document Integrating glycomics, proteomics and glycoproteomics to understand the structural basis for influenza a virus evolution and glycan mediated immune interactions (Page 153-158)