• No results found

Alignment and Preprocessing for Data Analysis

N/A
N/A
Protected

Academic year: 2022

Share "Alignment and Preprocessing for Data Analysis"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

Alignment and Preprocessing for Data Analysis

Preprocessing tools for chromatography

Basics of alignment

GC‐FID (1D) data and issues

PCA F‐Ratios

GC‐MS (2D) data and issues

PCA F‐Ratios PARAFAC

Piecewise Alignment GUI (available online)

synoveclab.chem.washington.edu/Downloads.htm Email for username/password

(2)

Tools for Analysis: Classification

Principal Component Analysis (PCA)

Degree of Class Separation (DCS)

, 2 2

A B

A B

DCS D

s s

2 2

, ( ) ( )

A B XA XB YA YB

D

DCS

Score PC2

Score PC1

DCS PCA Loadings`

Retention Time Retention Time

Signal Score PC2

Score PC1 Scores

X + Error

(3)

Why Align?

• Reduction in classification

PCA

• Increase in uncertainty for quantification

PARAFAC

• Misalignment occurs frequently

Daily instrument variation causes misalignment Correction is necessary to apply these methods

(4)

Retention Time Precision & PCA

DCS=32

DCS=3 PCA

PCA 10 Replicates with

limited shifting

10 Replicates with large shifting

Loadings

Retention Time

Retention Time Retention Time

Retention Time

SignalSignal __ Score PC2

Score PC1 Score PC1

Score PC2

Loadings

Scores

Scores

(5)

Basics of Alignment

Types of alignment algorithms

Cross correlation coefficient

Correlation Optimized Warping (COW) *Piecewise alignment

Alignment Parameters

Window Size Shift

Target Selection

PCA

Correlation Coefficient Windowed Target

(6)

Alignment Algorithms

Cross correlation coefficient

Move the entire chromatogram to maximize correlation

Correlation Optimized Warping (COW)

Separate the chromatogram into windows

Warp and move the windows to optimize the correlation Find the best alignment path to correct the data

*Piecewise alignment

Separate the chromatogram into windows Shift the windows to optimize the correlation Find the best alignment path to correct the data

(7)

Alignment Parameters

Parameter

Window Size, W

Shift, L

Description Effect

Too Small Relative movement high, difficult to determine

quality of alignment

Insufficient movement of

segments

Effect Too Large

Insufficient flexibility to correct peak

to peak shifting

Increases time

Window Size ~1 pk width

0<shift<=maximum shift

Determining correct

values

Alignment Metric

Alignment Metric

L  tR

(8)

Target Selection

*Global Approach

All chromatograms are initially collected *User chosen target

PCA optimized target

Scores are produced for every sample

The sample with the minimum distance from the center is the target *Maximum correlation target

Calculate the product of each chromatograms correlation to the others

The maximum correlation is the target

Online Approach

An initial target is set

Chromatograms are aligned as they are collected

The target changes as new chromatograms are collected

(9)

Target Selection (Maximum Correlation)

0 2 4 6 8 10 12

0 10 20 30 40 50 60

8.05 8.1 8.15 8.2

0 1 2 3 4 5 6 7 8 9 10

0 5 10 15 20

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

0.6 Sample # 9 is the target

8.05 8.1 8.15 8.2

0 2 4 6 8 10

Type 1 Type 2 Type 3

Target Type 1 Type 2 Type 3

Retention Time, min

Signal

Retention Time, min

Signal

Sample #

Product Correlation

(10)

Alignment of GC‐FID (1D) Data and Issues

• Removal of artifacts and solvent peaks

• Baseline correction and normalization

• Alignment

• Improving PCA

F‐Ratio

(11)

Preprocessing Tools for Chromatography

• Noise filtering

Median filter

• Baseline correction

• Normalization

• Alignment

(12)

Experimental

• GC‐FID Separation

• 4 diesel sample types

• 9 minute separation

• 17 replicate injections over 5 days

(13)

7 7.2 7.4 7.6 7.8 0.4

0.6 0.8 1 1.2 1.4 1.6 1.8 2

Noise Filtering (Solvent, Spikes, and Outliers)

2 4 6 8 10

0 2 4 6 8 10 12 14

7 7.2 7.4 7.6 7.8

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Median Filtered Data  (minimize spike)

Retention Time, min

Signal

GC-FID Separation

SignalSignal

Retention Time, min

Noise Spike

Noise Spike

(14)

Noise Filtering (Solvent, Spikes, and Outliers)

0 2 4 6 8

0 2 4 6 8 10 12

0.8 1 1.2 1.4 1.6 1.8 2 2.2

x 109 -2.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5x 108

Outlier (low signal)

Solvent Dominated Info

Retention Time, min

Signal

GC-FID Separation

PC1

PC2

Scores Plot Solvent

(15)

Noise Filtering (Solvent, Spikes, and Outliers)

0 2 4 6 8 10

0 2 4 6 8 10 12 14

1.5 2 2.5 3 3.5 4 4.5 5

x 107 -3

-2 -1 0 1 2 3 4 5x 106

Outlier (Low Signal)

Outlier Low Signal

Retention Time, min

Signal

GC-FID Separation

PC1

PC2

Scores Plot

(16)

0 2 4 6 8 10 0

2 4 6 8 10 12 14

Noise Filtering (Solvent, Spikes, and Outliers)

No Outlier

3.4 3.6 3.8 4 4.2 4.4 4.6

x 107 -3

-2 -1 0 1 2 3 4 5x 106

No Outlier (Baseline /  Normalization on PC1)

Retention Time, min

Signal

GC-FID Separation

PC1

PC2

Scores Plot

(17)

Noise Filtering (Solvent, Spikes, and Outliers)

6.2 6.3 6.4 6.5 6.6 6.7 6.8

x 10-3 -4

-2 0 2 4 6 8x 10-4

0 2 4 6 8 10

-0.5 0 0.5 1 1.5 2 2.5x 10-9

Tight Clustering in PCA

Baseline Corrected and Normalized Data

Subtract offset line from data Then divide by total signal

Retention Time, min

Normalized Signal

PC1

PC2

Scores Plot

(18)

Experimental

• GC Separation

• 3 gasoline sample types

• 15 minute separation

• 6 replicate injections over 2 days

• Misalignment is due to day to day instrument  variation

(19)

Alignment GUI

Initially blank to guide the user through alignment

(20)

Load from workspace

Load from file

Alignment GUI

(21)

Zoom & Pan

Choose or Estimate Target Choose or Estimate Parameters

Load & View Data Sizes

Align Data

Alignment GUI

(22)

Now save data to  workspace or .mat file

Data after alignment

Data before  alignment Estimated 

target and  parameters

Alignment GUI

(23)

PCA Classification of Gasoline

0.038-5 0.0385 0.039 0.0395 0.04 0.0405 0.041 -4

-3 -2 -1 0 1 2 3x 10-3

8.05 8.1 8.15 8.2 8.25 8.3 8.35

0 0.5 1 1.5 2 2.5 3

x 10-8

We want better  separation of groups

Retention Time, min

Signal

PC1

PC2

Scores Plot

(24)

Sample 1

Sample 2

Col 1

Sample 3

GC‐MS  Separations

Fisher Ratio Method (F‐Ratio)

Works well for samples that have large amounts of within  class variance

Works best when comparing a small number of sample  classes

For each mass channel calculate Fisher Ratio at each point in 2D space,  

Fisher Ratio =

2 2

b t w c l a s s w i t h i n c l a s s

Fisher Ratios

Col 1

(25)

0 5 10 15 0

20 40 60 80 100 120 140 160 180 200

0 5 10 15

0 100 200 300 400 500 600 700 800

Fisher Ratio Method (F‐Ratio)

8.05 8.1 8.15 8.2 8.25 8.3 8.35

0 0.5 1 1.5 2 2.5 3

x 10-8 No alignment

After alignment

Select Top Ten

Retention Time, min

Signal

Selected Chromatographic Region

Retention Time, min

F-Ratio

Retention Time, min

F-Ratio

(26)

0 5 10 15 -1

0 1 2 3 4 5 6 7x 10-8

Fisher Ratio Method (F‐Ratio)

5 6 7 8 9 10

x 10-3 -1

-0.5 0 0.5 1 1.5 2 2.5x 10-3

Retention Time, min

Signal

PC1

PC2

Scores Plot Regions of interest

Tight clustering of sample types

(27)

Summary

Removal of solvents or artifacts is essential

Baseline correction is an important step

Alignment is essential for improving classification

The F‐Ratio algorithm can further improve classification

(28)

Alignment of GC‐MS (2D) Data and Issues

• Removal of artifacts and solvent peaks

• Baseline correction and normalization

• Alignment

• Improving PCA

F‐Ratio

• PARAFAC

(29)

Experimental

• GC‐MS Separation

• 3 gasoline sample types

• 15 minute separation

• 6 replicate injections over 2 days

• Misalignment is due to day to day instrument  variation

(30)

Alignment GUI

Initially blank to guide the user through alignment

(31)

Load from workspace

Load from file

Alignment GUI

(32)

Zoom & Pan

Choose or Estimate Target Choose or Estimate Parameters

Load & View Data Sizes

Align Data

Alignment GUI

(33)

Now save data to  workspace or .mat file

Data after alignment

Data before  alignment Estimated 

target and  parameters

Alignment GUI

(34)

0 2 4 5

10 15 20

Alignment

TIC

Retention Time

Retention Time

Shift

Alignment

Total Ion Current Shift Function (TIC‐SF)

Sample Target

(35)

0 2 4 2

4 6 8 10 12

Retention Time

Shift

Total Ion Current Shift Function (TIC‐SF)

GC‐MS Data

Retention Time

Apply  alignment

0 2 4

2 4 6 8 10 12

Retention Time

Sample Target

GC‐MS Data

Alignment

(36)

7.1 7.15 7.2 7.25 7.3 0

0.5 1 1.5 2 2.5

x 10-8

7.1 7.15 7.2 7.25 7.3

0 0.5 1 1.5 2 2.5

x 10-8

0 5 10 15

-1 0 1 2 3 4 5 6 7x 10-8

Align

Alignment

Retention Time, min

TIC Signal TIC Signal

Retention Time, min

TIC Signal

(37)

7.05 7.1 7.15 7.2 7.25 0

0.5 1 1.5 2 2.5 3

x 10-8

0.013-2 0.0135 0.014 0.0145 0.015

-1.5 -1 -0.5 0 0.5 1 1.5x 10-3

PCA w/ full MS

Classification of Full MS Data

Retention Time, min

TIC Signal

PC1

PC2

Scores Plot

(38)

Sample 1 m/z Col 1

m/z Col 1

Sample 2

m/z Col 1

Sample 3

Fisher Ratios  for Samples

m/z Col 1

GC‐MS  Separations

Fisher Ratio Method

Works well for samples that have large amounts of within  class variance

Works best when comparing a small number of sample  classes

For each mass channel calculate Fisher Ratio at each point in 2D space,  

Fisher Ratio =

2 2

b t w c l a s s w i t h i n c l a s s

Sum Fisher Ratios along mass  channel dimension

Col 1

(39)

Simulated Example Using F‐Ratios for PCA

PCA

W = 400

L = 1000 Align

Retention Time

m/z

Scores 2D F‐Ratios

Shows differences  between all samples

TIC TIC

F-Ratios GC-MS Data

(40)

4.8 5 5.2 5.4 5.6 40

50 60 70 80 90 100

100 200 300 400 500 600 700

0 5 10 15

40 60 80 100 120 140

100 200 300 400 500 600 700

0 5 10 15

-1 0 1 2 3 4 5 6 7x 10-8

F‐Ratios

Top Fisher Ratios

GC‐MS Fisher Ratios GC‐MS Fisher Ratios GC‐MS Separation

Retention Time, min

TIC Signal

(41)

Procedure

• Select masses using mass spectral information

• Select time regions using F‐Ratios

• Combine to reduce the data set

• Improve results of PCA

(42)

0 5 10 15 -1

0 1 2 3 4 5 6 7x 10-8

Classification of Full MS Data

1.8 1.9 2 2.1 2.2 2.3 2.4

x 10-3 -2.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5x 10-4

Clear Separation on 1st PC

PC1

PC2

Scores Plot of Nearest Samples

Retention Time, min

Signal

Regions of interest (with MS values)

(43)

Quantification of GC‐MS Data

• Use aligned, baseline corrected and  normalized data

• Use PARAFAC of small regions for analysis

Match values to mass spectra Peak Sums

Peak Profiles

(44)

Quantification

Target analyte Parallel Factor Analysis (PARAFAC) isolates the  pure component peak and mass spectral information from  overlapping peaks and background for both identification and  quantification.

Data Matrix = Analyte 1 + Analyte 2 + ... Noise

x x

+ +…+

=

1 2

y1 y2

z1 z2

R E

Compound : malate, TMS

# of factors: 3 Chosen comp.: #3 Match value : 948 Peak sum : 4413882 Peak height : 376188

PARAFAC Results

Col. 1 Time Col. 2 Time

Intensity Intensity

(45)

1.0 1.2 1.4 1.6 0

2 4 6 8

PARAFAC for GC‐MS Data

0 2 4

2 4 6 8 10 12 14

0 2 4

2 4 6 8 10 12

Stack Samples

Retention Time

Sample

Run PARAFAC

Peak Profile

Mass Spectrum

Analyte Concentration

Sample 1

Sample 2

Sample 1

Sample 2

(46)

Experimental

• GC‐MS Separation

• 3 gasoline sample types

• 15 minute separation

• 6 replicate injections over 2 days

• Misalignment is due to day to day instrument  variation

• Looking for isobutyl benzene in the gasoline

(47)

Isobutyl Benzene Spectrum

20 40 60 80 100 120 140 160

0 10 20 30 40 50 60 70 80 90 100

Mass

Signal

(48)

7.040 7.06 7.08 7.1 7.12 7.14 7.16 7.18 1

2 3 4 5 6x 10-9

0 5 10 15 20 25 30 35

0 1 2 3 4 5 6 7 8 9x 10-3

0 5 10 15 20

0 0.5 1 1.5 2 2.5 3 3.5x 10-3

20 40 60 80 100 120 140 160

0 1 2 3 4 5 6 7x 10-5

PARAFAC Results for GC‐MS Data

Retention Time, min

TIC Signal

Quantitative Info

Qualitative Info

Qualitative Info

Mass Spectrum

Peak Profile

Peak Sum

Selected Analysis Region for PARAFAC 1

2 3

1

2 3

1

(49)

20 40 60 80 100 120 140 160 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

20 40 60 80 100 120 140 160

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

20 40 60 80 100 120 140 160

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

MVIBB=403 MVIBB=788 MVIBB=910

PARAFAC Results for GC‐MS Data

Component 1 Component 2 Component 3

Best match to isobutyl benzene

(50)

Summary

• Mass spectral chromatographic data should  be aligned

• An available alignment algorithm is able to  align 1‐D and 2‐D data

• F‐Ratio methods can be used to improve  classification

• PARAFAC can be effectively used to separate  overlapped analytes

References

Related documents

T HE DISTRIBUTION OF DEGREE MODIFIERS The central conclusion of sections 3 and 4 is that scale structure (open vs. closed) and standard value (relative vs. absolute) are

counseling would be more adjusted to school than those who have not received guidance and counseling” when tested, the result revealed that guidance and counseling

She is also an Executive Member of the National Strategic Plan on HIV/AIDS and an Executive Committee Member of an international organization: the Regional Consultancy on

We also have measurements of all subsequent financial transactions (borrowing, lending, saving) with the formal sector (type of institution, e.g., BAAC, village funds such as

Under the law, a Free Trade Zone is defined as a geographic area of the country under special customs and tax controls, in which enterprises are licensed to devote their production

Thank you for contacting me so quickly about your situation. The swiftness of your response gives me the chance to make arrangements that will help you. While I’m afraid I can’t

In order to tackle this research objective, we used fuzzy logic and fuzzy inference systems to design a suitable handoff initiation algorithm that can handle