Chapter 3: Development of data-analysis workflows for glycoproteomics and glycomics
3.2 Methods
3.2.7 Glycoproteomics data analysis
A naΓ―ve glycopeptide hypothesis is a database of glycopeptides from an in-silico enzymatic digest with up to π missed cleavages of one or more reference sequences of putative glycoproteins with all combinations of selected post-translational modifications, combined with one or more N-Glycan compositions. We used pyteomics [182] to provide our dictionary of proteolytic enzymes. Post-translational modifications may either be fixed, guaranteed to be present, or variable, occurring or not at each viable site as defined by a subset of rules selected from database such as Unimod [103]or ProteinProspectorβs MS-Digest [32]. If a glycopeptide has more than one glycosylation site, combinations of glycans are considered, up to a limit of π glycosylations on a single peptide. A hypothesis is only constructed for MS1 compositions, so exact placement of variable modifications and glycosylations need not be determined exactly. Only unique glycopeptides are considered for each reference protein.
After construction, the hypothesis is searched against clustered LC-MS data as for glycomics. The results are used to construct an MS2 hypothesis.
3.2.7.1 MS1 Informed Hypothesis Building
An informed hypothesis begins with an mzIdentML file [278] and a library of glycan compositions, like a hypothesis or database search results set from a glycomics experiment. Each protein and associated peptidoforms are extracted from the mzIdentML file, filtered for just those which contain glycosylation sites following biosynthetic-rules, user specifications or other experimental information and then all combinations of peptidoforms and glycans are generated as in the naΓ―ve case. Because post-translational modifications and their locations have already been identified, no position information is removed. This may cause the informed hypothesis to be larger than the naΓ―ve hypothesis at the MS1 level, but each match at this level will translate to exactly one sequence searched at MS2 instead one or more. As in the naΓ―ve case, the informed MS1 hypothesis is evaluated in the same way.
3.2.7.2 MS2 NaΓ―ve Hypothesis Building
After having performed MS1 database searches, an MS2 hypothesis can be constructed. Each hit glycopeptide composition with present but un-localized PTMs is expanded into a set of glycopeptide sequences where PTM locations are exactly specified. This results in a combinatorial expansion of glycopeptide sequences, many of which may be indistinguishable without high coverage. This process included positioning the glycosylation attachment site(s) for each isoform. We generated theoretical peptide b and y ion series for each sequence, as well as precursor stub ions and oxonium ions. A precursor stub ion is composed of the mass of the intact peptide backbone with portions of the N-glycan core attached. Oxonium ions are low mass diagnostic saccharide ions
generated by fragmentation of the glycan units on glycoconjugates during collisional dissociation, analogous to immonium ions generated by peptide dissociation.
3.2.7.3 MS2 Informed Hypothesis Building
Each glycopeptide matched at the MS1 level from an informed hypothesis is translated to exactly one glycopeptide sequence. Informed databases for integrated omics were generated as described previously [291], except that the only glycans which were included from the glycomics analysis had an aggregated abundance greater than twice the mean abundance of all glycan matches.. Tandem mass spectra were identified by first recalculating the precursor ion monoisotopic mass and charge, followed by a database search procedure with a precursor mass tolerance of 10 ppm. For each glycopeptide which fell within the acceptable mass range, theoretical product ions for b, y, b plus HexNAc, y plus HexNAc, and intact peptide plus incremental losses of saccharide units (known as βstub ionsβ, were assigned. The tandem spectra were deconvolved and peaks were matched for each theoretical product ion with an error tolerance of 10 ppm, constructing a glycopeptide-spectrum match (GSM).
3.2.7.4 MS2 Database Search
For each sequence in the hypothesis, for each tandem spectrum whose precursor neutral mass is within π‘1 ppm mass error window, we compute fragment matches. Theoretical fragment masses are compared against the neutral masses for each tandem spectrum peak with an error tolerance of π‘2 ppm. For our datasets, π‘1 = 10, π‘2 = 20. We first search for oxonium ions, evidence that the scan included a glycopeptide, and if none
are found, no further matching is performed. Multiple glycopeptides regularly match the same spectrum, as in proteomics, so spectral evidence is assigned to the best matching cases. Under the simple scoring model, this is just the set of glycopeptides which match the most peaks in the spectrum.
3.2.7.5 MS2 Scoring
Under a simple scoring regime, which does not assume any instrument-specific information, we built the score for each glycopeptide match from the set of all spectrum matches assigned to it. We computed three scores and combined them into a single summary ππ2 πππππ following user parameterized weights depending upon the amount of backbone information that was expected, as shown in Method 2.
For each GSM, a score based upon peptide backbone coverage and presence of stub ions was computed and scaled to be between 0.0 and 1.0. For each spectrum where multiple glycopeptides could be assigned, all GSMs tied for the highest score were added to a results set.
For each experiment, a forward-sequence (target) and a reverse-sequence-with- valid-sequon (decoy) database was searched, and a q-value without PIT as described by KΓ€ll et al. [281], was computed for the paired result sets. Target spectra which had a q- value < 0.05 were selected for inclusion in the final reported results.
All of the glycomics and glycoproteomics data analysis, including database generation, matching and FDR estimation was performed using components of a prototype GlycReSoft data analysis pipeline, developed by Joshua A. Klein.
MeanCoverage(Glycopeptide, MatchedFragments):
bIonSeries := {0 for π in [1, Length(Glycopeptide)]} yIonSeries := {0 for π in [1, Length(Glycopeptide)]} For each Fragment in MatchedFragments:
If IonSeries(Fragment) == π:
bIonSeries[SequenceIndex(Fragment)] = 1 Else If IonSeries(Fragment) == π¦:
yIonSeries[SequenceIndex(Fragment)] = 1 Coverage := bIonSeries + Reversed(yIonSeries) MeanCoverage := Sum({log log21+Coverage [π]
21.5β πππ§π ππ‘(Glycopeptide ) for π in [1, Length(Glycopeptide)]})
Return MeanCoverage
MeanHexNAcCoverage(Glycopeptide, MatchedFragments):
bSeriesObserved := 0 ySeriesObserved := 0
For each Fragment in MatchedFragments:
If IonSeries(Fragment) == π && Glycosylated(Fragment): bSeriesObserved += 1
Else If IonSeries(Fragment) == π¦ && Glycosylated(Fragment): ySeriesObserved += 1
bSeriesEnumerated := Length(TheoreticalGlycopeptideSeries(Glycopeptide, π)) ySeriesEnumerated := Length(TheoreticalGlycopeptideSeries(Glycopeptide, π¦)) MeanHexNAcCoverage := bSeriesEnumerated + ySeriesEnumeratedbSeriesObserved + ySeriesObserved
Return MeanHexNAcCoverage
MS2Score(Glycopeptide, MatchedFragments):
MeanCoverage := MeanCoverage(Glycopeptide, MatchedFragments)
MeanHexNAcCoverage := MeanHexNAcCoverage(Glycopeptide, MatchedFragments) BackboneScore := (MeanCoverage * π΅ππππ΅πππππππβπ‘) + (MeanHexNAcCoverage * πΊππ¦πππ π¦πππ‘ππππππβπ‘)
BackboneFactor := 1 β ππ‘π’ππΌππππππβπ‘
StubIonScore := Min(CountStubIonsObserved(MatchedFragments) / 3.0, 3)
MS2Score := (BackboneScore * BackboneFactor) + (StubIonScore * ππ‘π’ππΌππππππβπ‘)
Return MS2Score
Method 2: Glycopeptide MS2 scoring scheme
We implemented a false discovery rate (FDR) estimate using the target decoy method [139,635] using a reversed target sequence decoy database, comparing ππ2πππππ between target and decoy. In order to preserve as much biological context as possible, the N-glycan sequon in each decoy sequence retained its original orientation.
For example, a target sequence YPVLN(HexNAc)VTMPNNGK becomes GNNPMN(HexNAc)VTLVPYK. This produces deterministic, if reduced entropy decoys, which is desirable for reproducibility [197]. To accommodate for the assumption that the majority of the sequences in even a small combinatorial database are unlikely to be present, we used a βpercent incorrect targetβ adjustment to the Target Decoy ratio calculation and also reported a q-value for each glycopeptide match as described by KΓ€ll and colleagues [281].