Accurate deep learning off-target prediction with novel sgrna-dna sequence encoding in CRISPR-Cas9 gene editing

(1)

1

Accurate deep learning off-target prediction

with novel sgRNA-DNA sequence encoding

in CRISPR-Cas9 gene editing

Robert Nadon

McGill University Montreal, QC, Canada

Jeremy Charlier

National Bank of Canada Montreal, QC, Canada

Vladimir Makarenkov

(2)

2

Context and Motivations

What is CRISPR-Cas9?

• CRISPR-Cas9 is a gene editing technique

• CRISPR = Clustered Regularly Interspaced Short Palindromic Repeats • Cas9 is a protein capable to cut DNA at specific locations

• Target sequence (20 bases long)

• PAM (Protospacer Adjacent Motif) sequence (3 bases long)

What is the current challenge in genome editing and how CRISPR-Cas9 is useful?

• Predicting potential off-target mutations is crucial for clinical application • CRISPOR data base available at http://crispor.tefor.net/

• Data base is expanding rapidly, requiring advanced analytics • Possibility to address off-targets predictions as a binary problem

• Validated off-targets classified as 1

• Non-validated off-targets classified as 0

Cleavage with CISPR-Cas9

Cleavage with CISPR-Cas9 for gene editing *https://en.wikipedia.org/wiki/CRISPR_gene_editing

(4)

4

Context and Motivations

Current state-of-the-art at the time of our submission [1]

• Take categorical data A, G, C and T and convert than into 4 sequences: • [1,0,0,0] for A

• [0,1,0,0] for G • [0,0,1,0] for C • [0,0,0,1] for T

• Encoding target DNA and guide RNA in a 4x23 matrix

• Use the encoded matrix to perform binary classification with DL models: • 1 = validated off-targets

• 0 = non-validated off-targets

4x23 encoding a sgRNA-DNA sequence pair

(5)

5

Context and Motivations

Problem: Target DNA and complementary guide RNA are encoded with 1 for identical nucleobases

Let us suppose we only have the encoded matrix

Surjective-only mapping leads to a loss of information

A G 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 G G G G 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 A G

Consider the first 2 columns

of the encoded matrix

?

Target DNA

Comp. guide RNA

(6)

6

Our Proposed Contribution: A Novel 8x23 Encoding

Contribution:

• Replace the surjective-only mapping by a bijective mapping • New encoded matrix is 8x23 instead of 4x23

• Use the 8x23 encoding matrix for training DL models to perform binary

classification of off-targets

• Demonstrate the higher classification performance of the 8x23 encoding vs.

the 4x23 encoding

Bijective mapping between X and Y

Encoded 8x23 sequence

(7)

7

Deep Learning Models Used to Demonstrate the Performance of our

Encoding

First type of DL model: Feed-forward Neural Networks (FNNs)

• Input: encoded matrix

• Combination of dense layers, batch normalization layers and dropout layers to do the predictions • A dense layer is a regular deeply connected neural network layer

• Batch normalization applies a transformation that maintains the mean output close to 0 and the output standard

deviation close to 1

• The Dropout layer randomly sets input units to 0 to prevent overfitting (gap of model performance between train set

and test set)

Feed-forward Neural Network Applied to Off-Targets Predictions Feed-forward Neural Network Applied to

(8)

8

Deep Learning Models Used to Demonstrate the Performance of our

Encoding

Second type of DL model: Convolutional Neural Networks (CNNs)

• Combination of Conv2D layers, MaxPooling2D Layer, Flatten layer and dropout layer to do the predictions • Conv2D: convolution kernel convolved with the layer input to produce a tensor of outputs

• MaxPooling2D: downsample the input representation by taking the maximum value over the window defined • Flatten: flattens the input

Convolutional Neural Network Applied to Off-Targets Predictions Convolutional Neural Network Applied to

(9)

9

Deep Learning Models Used to Demonstrate the Performance of our

Encoding

Third type of DL model: Recurrent Neural Networks (RNNs)

• Combination of RNN layers, batch normalization layer and dropout layer to do the predictions • Two types of RNNs

• LSTM for Long-Short-Term-Memory • GRU for Gated Recurrent Unit

• Flatten: flattens the input

Recurrent Neural Networks

(10)

10

Training Neural Networks: A Challenging Task

CRISPOR data set split into two sub-sets

• Training set and a test set

• With a ratio of 0.3 and equal stratification of the classes

• Leading to 18,236 samples in the training set and 7,816 samples in

the test set

Definition of overfitting

• DL occurs when a model iterates through too many samples and learns patterns

only present in the training set How to limit DL overfitting?

• Callbacks: a set of functions used during the training

• ReduceLROnPlateau: reduce the learning rate of the optimizer

• EarlyStopping: the training is stopped when the model performance stagnates

*https://www.analyticsvidhya.com/blog/2020/02/underfitting-overfitting-best-fitting-machine-learning/

Overfitting: the model is performing well on the training set but the performance drops significantly over the test set

(11)

11

Experiments: CRISPR Training and Testing

Receiver Operating Characteristic (ROC) curve

• Recall: graphical plot that illustrates the diagnostic ability of a binary classifier • Relies on TPRs and FPRs

• TPRs: True Positive Rates = accurate predictions of the model for a given class (often 1 for a binary classification) • FPRs: False Positive Rates = wrong predictions of the model for a given class (often 1 for a binary classification) • Use of the Area Under the Curve (AUC) to estimate the predictive performance of a model

• Best theoretical value is 1.0, worst theoretical value is 0.0

• The 8x23 encoding leads to higher predictive performance for all models, all other parameters being equal

(12)

12

Experiments: CRISPR Training and Testing

Quantitative metrics to estimate the performance of the models

• F1 score = 2 * (precision * recall) / (precision + recall) • With precision = TPs / (TPs + FPs)

• And recall = TPs / (TPs + FNs) • AUC PR 1

• Area Under the Curve for precision and recall of class 1 • Both F1 scores and AUC PR 1 are higher for the 8x23 encoding

→ superior performance of the encoding all other parameters being equal

(13)

13

Experiments: GUIDE-Seq Transfer Learning

Transfer learning

• Definition: use the trained DL models on CRISPOR data to perform predictions on GUIDE-Seq • GUIDE-seq data only contains 430 nucleobase sequence pair

• Population size is too small to train a robust model

• Use of the Area Under the Curve (AUC) to estimate the predictive performance of a model • Best theoretical value is 1.0, worst theoretical value is 0.0

• The 8x23 encoding leads to higher predictive performance for all models, all other parameters being equal

(14)

14

Experiments: GUIDE-Seq Transfer Learning

Quantitative metrics to estimate the performance of the models

• Superior AUC ROC and superior AUC PR 1 for all models

• Best F1-score performing model (FNN3) has a higher F1 score for the 8x23 encoding → Transfer learning has a huge impact on the predictive performance of the model → superior performance of the encoding all other parameters being equal

(15)

15

Conclusion

In this work,

• We propose to encode the nucleobase sequence pairs

as a matrix of size 8×23

• We demonstrated the superior performance of the 8x23

encoding w.r.t. 4x23 encoding

• For different DL models

• On 2 data sets, the CRISPOR and the GUIDE-seq data

set

Future work

• Will address the insertions and deletions (indels) in the

8x23 encoding

• Use persistent homology for binary and sequence

classification of off-targets

Top 20 predictions using 4x23 encoding on CRISPOR data

(16)

16

Thank you for your attention

Jeremy Charlier

[email protected]

Accurate deep learning off-target prediction with novel sgrna-dna sequence encoding in CRISPR-Cas9 gene editing