Alignment and Preprocessing for Data Analysis
• Preprocessing tools for chromatography
• Basics of alignment
• GC‐FID (1D) data and issues
– PCA – F‐Ratios
• GC‐MS (2D) data and issues
– PCA – F‐Ratios – PARAFAC
• Piecewise Alignment GUI (available online)
– synoveclab.chem.washington.edu/Downloads.htm – Email for username/password
Tools for Analysis: Classification
• Principal Component Analysis (PCA)
• Degree of Class Separation (DCS)
, 2 2
A B
A B
DCS D
s s
2 2
, ( ) ( )
A B XA XB YA YB
D
DCS
Score PC2
Score PC1
DCS PCA Loadings`
Retention Time Retention Time
Signal Score PC2
Score PC1 Scores
X + Error
Why Align?
• Reduction in classification
– PCA
• Increase in uncertainty for quantification
– PARAFAC
• Misalignment occurs frequently
– Daily instrument variation causes misalignment – Correction is necessary to apply these methods
Retention Time Precision & PCA
DCS=32
DCS=3 PCA
PCA 10 Replicates with
limited shifting
10 Replicates with large shifting
Loadings
Retention Time
Retention Time Retention Time
Retention Time
SignalSignal __ Score PC2
Score PC1 Score PC1
Score PC2
Loadings
Scores
Scores
Basics of Alignment
• Types of alignment algorithms
– Cross correlation coefficient
– Correlation Optimized Warping (COW) – *Piecewise alignment
• Alignment Parameters
– Window Size – Shift
• Target Selection
– PCA
– Correlation Coefficient – Windowed Target
Alignment Algorithms
• Cross correlation coefficient
– Move the entire chromatogram to maximize correlation
• Correlation Optimized Warping (COW)
– Separate the chromatogram into windows
– Warp and move the windows to optimize the correlation – Find the best alignment path to correct the data
• *Piecewise alignment
– Separate the chromatogram into windows – Shift the windows to optimize the correlation – Find the best alignment path to correct the data
Alignment Parameters
Parameter
Window Size, W
Shift, L
Description Effect
Too Small Relative movement high, difficult to determine
quality of alignment
Insufficient movement of
segments
Effect Too Large
Insufficient flexibility to correct peak
to peak shifting
Increases time
Window Size ~1 pk width
0<shift<=maximum shift
Determining correct
values
Alignment Metric
Alignment Metric
L tR
Target Selection
• *Global Approach
– All chromatograms are initially collected – *User chosen target
– PCA optimized target
• Scores are produced for every sample
• The sample with the minimum distance from the center is the target – *Maximum correlation target
• Calculate the product of each chromatograms correlation to the others
• The maximum correlation is the target
• Online Approach
– An initial target is set
– Chromatograms are aligned as they are collected
– The target changes as new chromatograms are collected
Target Selection (Maximum Correlation)
0 2 4 6 8 10 12
0 10 20 30 40 50 60
8.05 8.1 8.15 8.2
0 1 2 3 4 5 6 7 8 9 10
0 5 10 15 20
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55
0.6 Sample # 9 is the target
8.05 8.1 8.15 8.2
0 2 4 6 8 10
Type 1 Type 2 Type 3
Target Type 1 Type 2 Type 3
Retention Time, min
Signal
Retention Time, min
Signal
Sample #
Product Correlation
Alignment of GC‐FID (1D) Data and Issues
• Removal of artifacts and solvent peaks
• Baseline correction and normalization
• Alignment
• Improving PCA
– F‐Ratio
Preprocessing Tools for Chromatography
• Noise filtering
– Median filter
• Baseline correction
• Normalization
• Alignment
Experimental
• GC‐FID Separation
• 4 diesel sample types
• 9 minute separation
• 17 replicate injections over 5 days
7 7.2 7.4 7.6 7.8 0.4
0.6 0.8 1 1.2 1.4 1.6 1.8 2
Noise Filtering (Solvent, Spikes, and Outliers)
2 4 6 8 10
0 2 4 6 8 10 12 14
7 7.2 7.4 7.6 7.8
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Median Filtered Data (minimize spike)
Retention Time, min
Signal
GC-FID Separation
SignalSignal
Retention Time, min
Noise Spike
Noise Spike
Noise Filtering (Solvent, Spikes, and Outliers)
0 2 4 6 8
0 2 4 6 8 10 12
0.8 1 1.2 1.4 1.6 1.8 2 2.2
x 109 -2.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5x 108
Outlier (low signal)
Solvent Dominated Info
Retention Time, min
Signal
GC-FID Separation
PC1
PC2
Scores Plot Solvent
Noise Filtering (Solvent, Spikes, and Outliers)
0 2 4 6 8 10
0 2 4 6 8 10 12 14
1.5 2 2.5 3 3.5 4 4.5 5
x 107 -3
-2 -1 0 1 2 3 4 5x 106
Outlier (Low Signal)
Outlier Low Signal
Retention Time, min
Signal
GC-FID Separation
PC1
PC2
Scores Plot
0 2 4 6 8 10 0
2 4 6 8 10 12 14
Noise Filtering (Solvent, Spikes, and Outliers)
No Outlier
3.4 3.6 3.8 4 4.2 4.4 4.6
x 107 -3
-2 -1 0 1 2 3 4 5x 106
No Outlier (Baseline / Normalization on PC1)
Retention Time, min
Signal
GC-FID Separation
PC1
PC2
Scores Plot
Noise Filtering (Solvent, Spikes, and Outliers)
6.2 6.3 6.4 6.5 6.6 6.7 6.8
x 10-3 -4
-2 0 2 4 6 8x 10-4
0 2 4 6 8 10
-0.5 0 0.5 1 1.5 2 2.5x 10-9
Tight Clustering in PCA
Baseline Corrected and Normalized Data
Subtract offset line from data Then divide by total signal
Retention Time, min
Normalized Signal
PC1
PC2
Scores Plot
Experimental
• GC Separation
• 3 gasoline sample types
• 15 minute separation
• 6 replicate injections over 2 days
• Misalignment is due to day to day instrument variation
Alignment GUI
Initially blank to guide the user through alignment
Load from workspace
Load from file
Alignment GUI
Zoom & Pan
Choose or Estimate Target Choose or Estimate Parameters
Load & View Data Sizes
Align Data
Alignment GUI
Now save data to workspace or .mat file
Data after alignment
Data before alignment Estimated
target and parameters
Alignment GUI
PCA Classification of Gasoline
0.038-5 0.0385 0.039 0.0395 0.04 0.0405 0.041 -4
-3 -2 -1 0 1 2 3x 10-3
8.05 8.1 8.15 8.2 8.25 8.3 8.35
0 0.5 1 1.5 2 2.5 3
x 10-8
We want better separation of groups
Retention Time, min
Signal
PC1
PC2
Scores Plot
Sample 1
Sample 2
Col 1
Sample 3
GC‐MS Separations
Fisher Ratio Method (F‐Ratio)
Works well for samples that have large amounts of within class variance
Works best when comparing a small number of sample classes
For each mass channel calculate Fisher Ratio at each point in 2D space,
Fisher Ratio =
2 2
b t w c l a s s w i t h i n c l a s s
Fisher Ratios
Col 1
0 5 10 15 0
20 40 60 80 100 120 140 160 180 200
0 5 10 15
0 100 200 300 400 500 600 700 800
Fisher Ratio Method (F‐Ratio)
8.05 8.1 8.15 8.2 8.25 8.3 8.35
0 0.5 1 1.5 2 2.5 3
x 10-8 No alignment
After alignment
Select Top Ten
Retention Time, min
Signal
Selected Chromatographic Region
Retention Time, min
F-Ratio
Retention Time, min
F-Ratio
0 5 10 15 -1
0 1 2 3 4 5 6 7x 10-8
Fisher Ratio Method (F‐Ratio)
5 6 7 8 9 10
x 10-3 -1
-0.5 0 0.5 1 1.5 2 2.5x 10-3
Retention Time, min
Signal
PC1
PC2
Scores Plot Regions of interest
Tight clustering of sample types
Summary
• Removal of solvents or artifacts is essential
• Baseline correction is an important step
• Alignment is essential for improving classification
• The F‐Ratio algorithm can further improve classification
Alignment of GC‐MS (2D) Data and Issues
• Removal of artifacts and solvent peaks
• Baseline correction and normalization
• Alignment
• Improving PCA
– F‐Ratio
• PARAFAC
Experimental
• GC‐MS Separation
• 3 gasoline sample types
• 15 minute separation
• 6 replicate injections over 2 days
• Misalignment is due to day to day instrument variation
Alignment GUI
Initially blank to guide the user through alignment
Load from workspace
Load from file
Alignment GUI
Zoom & Pan
Choose or Estimate Target Choose or Estimate Parameters
Load & View Data Sizes
Align Data
Alignment GUI
Now save data to workspace or .mat file
Data after alignment
Data before alignment Estimated
target and parameters
Alignment GUI
0 2 4 5
10 15 20
Alignment
TIC
Retention Time
Retention Time
Shift
Alignment
Total Ion Current Shift Function (TIC‐SF)
Sample Target
0 2 4 2
4 6 8 10 12
Retention Time
Shift
Total Ion Current Shift Function (TIC‐SF)
GC‐MS Data
Retention Time
Apply alignment
0 2 4
2 4 6 8 10 12
Retention Time
Sample Target
GC‐MS Data
Alignment
7.1 7.15 7.2 7.25 7.3 0
0.5 1 1.5 2 2.5
x 10-8
7.1 7.15 7.2 7.25 7.3
0 0.5 1 1.5 2 2.5
x 10-8
0 5 10 15
-1 0 1 2 3 4 5 6 7x 10-8
Align
Alignment
Retention Time, min
TIC Signal TIC Signal
Retention Time, min
TIC Signal
7.05 7.1 7.15 7.2 7.25 0
0.5 1 1.5 2 2.5 3
x 10-8
0.013-2 0.0135 0.014 0.0145 0.015
-1.5 -1 -0.5 0 0.5 1 1.5x 10-3
PCA w/ full MS
Classification of Full MS Data
Retention Time, min
TIC Signal
PC1
PC2
Scores Plot
Sample 1 m/z Col 1
m/z Col 1
Sample 2
m/z Col 1
Sample 3
Fisher Ratios for Samples
m/z Col 1
GC‐MS Separations
Fisher Ratio Method
Works well for samples that have large amounts of within class variance
Works best when comparing a small number of sample classes
For each mass channel calculate Fisher Ratio at each point in 2D space,
Fisher Ratio =
2 2
b t w c l a s s w i t h i n c l a s s
Sum Fisher Ratios along mass channel dimension
Col 1
Simulated Example Using F‐Ratios for PCA
PCA
W = 400
L = 1000 Align
Retention Time
m/z
Scores 2D F‐Ratios
Shows differences between all samples
TIC TIC
F-Ratios GC-MS Data
4.8 5 5.2 5.4 5.6 40
50 60 70 80 90 100
100 200 300 400 500 600 700
0 5 10 15
40 60 80 100 120 140
100 200 300 400 500 600 700
0 5 10 15
-1 0 1 2 3 4 5 6 7x 10-8
F‐Ratios
Top Fisher Ratios
GC‐MS Fisher Ratios GC‐MS Fisher Ratios GC‐MS Separation
Retention Time, min
TIC Signal
Procedure
• Select masses using mass spectral information
• Select time regions using F‐Ratios
• Combine to reduce the data set
• Improve results of PCA
0 5 10 15 -1
0 1 2 3 4 5 6 7x 10-8
Classification of Full MS Data
1.8 1.9 2 2.1 2.2 2.3 2.4
x 10-3 -2.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5x 10-4
Clear Separation on 1st PC
PC1
PC2
Scores Plot of Nearest Samples
Retention Time, min
Signal
Regions of interest (with MS values)
Quantification of GC‐MS Data
• Use aligned, baseline corrected and normalized data
• Use PARAFAC of small regions for analysis
– Match values to mass spectra – Peak Sums
– Peak Profiles
Quantification
• Target analyte Parallel Factor Analysis (PARAFAC) isolates the pure component peak and mass spectral information from overlapping peaks and background for both identification and quantification.
Data Matrix = Analyte 1 + Analyte 2 + ... Noise
x x
+ +…+
=
1 2
y1 y2
z1 z2
R E
Compound : malate, TMS
# of factors: 3 Chosen comp.: #3 Match value : 948 Peak sum : 4413882 Peak height : 376188
PARAFAC Results
Col. 1 Time Col. 2 Time
Intensity Intensity
1.0 1.2 1.4 1.6 0
2 4 6 8
PARAFAC for GC‐MS Data
0 2 4
2 4 6 8 10 12 14
0 2 4
2 4 6 8 10 12
Stack Samples
Retention Time
Sample
Run PARAFAC
Peak Profile
Mass Spectrum
Analyte Concentration
Sample 1
Sample 2
Sample 1
Sample 2
Experimental
• GC‐MS Separation
• 3 gasoline sample types
• 15 minute separation
• 6 replicate injections over 2 days
• Misalignment is due to day to day instrument variation
• Looking for isobutyl benzene in the gasoline
Isobutyl Benzene Spectrum
20 40 60 80 100 120 140 160
0 10 20 30 40 50 60 70 80 90 100
Mass
Signal
7.040 7.06 7.08 7.1 7.12 7.14 7.16 7.18 1
2 3 4 5 6x 10-9
0 5 10 15 20 25 30 35
0 1 2 3 4 5 6 7 8 9x 10-3
0 5 10 15 20
0 0.5 1 1.5 2 2.5 3 3.5x 10-3
20 40 60 80 100 120 140 160
0 1 2 3 4 5 6 7x 10-5
PARAFAC Results for GC‐MS Data
Retention Time, min
TIC Signal
Quantitative Info
Qualitative Info
Qualitative Info
Mass Spectrum
Peak Profile
Peak Sum
Selected Analysis Region for PARAFAC 1
2 3
1
2 3
1
20 40 60 80 100 120 140 160 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
20 40 60 80 100 120 140 160
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
20 40 60 80 100 120 140 160
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
MVIBB=403 MVIBB=788 MVIBB=910
PARAFAC Results for GC‐MS Data
Component 1 Component 2 Component 3
Best match to isobutyl benzene
Summary
• Mass spectral chromatographic data should be aligned
• An available alignment algorithm is able to align 1‐D and 2‐D data
• F‐Ratio methods can be used to improve classification
• PARAFAC can be effectively used to separate overlapped analytes