1
Accurate deep learning off-target prediction
with novel sgRNA-DNA sequence encoding
in CRISPR-Cas9 gene editing
Robert Nadon
McGill University Montreal, QC, Canada
Jeremy Charlier
National Bank of Canada Montreal, QC, Canada
Vladimir Makarenkov
2
Table of Contents
1. Context and Motivations
2. Our Proposed Contribution: A Novel 8x23 Encoding
3. Deep Learning Models Used to Demonstrate the Performance of our Encoding 4. Training Neural Networks: A Challenging Task
5. Experiments
3
Context and Motivations
What is CRISPR-Cas9?
• CRISPR-Cas9 is a gene editing technique
• CRISPR = Clustered Regularly Interspaced Short Palindromic Repeats • Cas9 is a protein capable to cut DNA at specific locations
• Target sequence (20 bases long)
• PAM (Protospacer Adjacent Motif) sequence (3 bases long)
What is the current challenge in genome editing and how CRISPR-Cas9 is useful?
• Predicting potential off-target mutations is crucial for clinical application • CRISPOR data base available at http://crispor.tefor.net/
• Data base is expanding rapidly, requiring advanced analytics • Possibility to address off-targets predictions as a binary problem
• Validated off-targets classified as 1
• Non-validated off-targets classified as 0
Cleavage with CISPR-Cas9
Cleavage with CISPR-Cas9 for gene editing *https://en.wikipedia.org/wiki/CRISPR_gene_editing
4
Context and Motivations
Current state-of-the-art at the time of our submission [1]
• Take categorical data A, G, C and T and convert than into 4 sequences: • [1,0,0,0] for A
• [0,1,0,0] for G • [0,0,1,0] for C • [0,0,0,1] for T
• Encoding target DNA and guide RNA in a 4x23 matrix
• Use the encoded matrix to perform binary classification with DL models: • 1 = validated off-targets
• 0 = non-validated off-targets
4x23 encoding a sgRNA-DNA sequence pair
5
Context and Motivations
Problem: Target DNA and complementary guide RNA are encoded with 1 for identical nucleobases
Let us suppose we only have the encoded matrix
Surjective-only mapping leads to a loss of information
A G 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 G G G G 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 A G
Consider the first 2 columns
of the encoded matrix
?
?
Target DNA
Target DNA
Comp. guide RNA
Comp. guide RNA
6
Our Proposed Contribution: A Novel 8x23 Encoding
Contribution:
• Replace the surjective-only mapping by a bijective mapping • New encoded matrix is 8x23 instead of 4x23
• Use the 8x23 encoding matrix for training DL models to perform binary
classification of off-targets
• Demonstrate the higher classification performance of the 8x23 encoding vs.
the 4x23 encoding
Bijective mapping between X and Y
Encoded 8x23 sequence
7
Deep Learning Models Used to Demonstrate the Performance of our
Encoding
First type of DL model: Feed-forward Neural Networks (FNNs)
• Input: encoded matrix
• Combination of dense layers, batch normalization layers and dropout layers to do the predictions • A dense layer is a regular deeply connected neural network layer
• Batch normalization applies a transformation that maintains the mean output close to 0 and the output standard
deviation close to 1
• The Dropout layer randomly sets input units to 0 to prevent overfitting (gap of model performance between train set
and test set)
Feed-forward Neural Network Applied to Off-Targets Predictions Feed-forward Neural Network Applied to
8
Deep Learning Models Used to Demonstrate the Performance of our
Encoding
Second type of DL model: Convolutional Neural Networks (CNNs)
• Input: encoded matrix
• Combination of Conv2D layers, MaxPooling2D Layer, Flatten layer and dropout layer to do the predictions • Conv2D: convolution kernel convolved with the layer input to produce a tensor of outputs
• MaxPooling2D: downsample the input representation by taking the maximum value over the window defined • Flatten: flattens the input
Convolutional Neural Network Applied to Off-Targets Predictions Convolutional Neural Network Applied to
9
Deep Learning Models Used to Demonstrate the Performance of our
Encoding
Third type of DL model: Recurrent Neural Networks (RNNs)
• Input: encoded matrix
• Combination of RNN layers, batch normalization layer and dropout layer to do the predictions • Two types of RNNs
• LSTM for Long-Short-Term-Memory • GRU for Gated Recurrent Unit
• Flatten: flattens the input
Recurrent Neural Networks
10
Training Neural Networks: A Challenging Task
CRISPOR data set split into two sub-sets
• Training set and a test set
• With a ratio of 0.3 and equal stratification of the classes
• Leading to 18,236 samples in the training set and 7,816 samples in
the test set
Definition of overfitting
• DL occurs when a model iterates through too many samples and learns patterns
only present in the training set How to limit DL overfitting?
• Callbacks: a set of functions used during the training
• ReduceLROnPlateau: reduce the learning rate of the optimizer
• EarlyStopping: the training is stopped when the model performance stagnates
*https://www.analyticsvidhya.com/blog/2020/02/underfitting-overfitting-best-fitting-machine-learning/
Overfitting: the model is performing well on the training set but the performance drops significantly over the test set
11
Experiments: CRISPR Training and Testing
Receiver Operating Characteristic (ROC) curve
• Recall: graphical plot that illustrates the diagnostic ability of a binary classifier • Relies on TPRs and FPRs
• TPRs: True Positive Rates = accurate predictions of the model for a given class (often 1 for a binary classification) • FPRs: False Positive Rates = wrong predictions of the model for a given class (often 1 for a binary classification) • Use of the Area Under the Curve (AUC) to estimate the predictive performance of a model
• Best theoretical value is 1.0, worst theoretical value is 0.0
• The 8x23 encoding leads to higher predictive performance for all models, all other parameters being equal
12
Experiments: CRISPR Training and Testing
Quantitative metrics to estimate the performance of the models
• F1 score = 2 * (precision * recall) / (precision + recall) • With precision = TPs / (TPs + FPs)
• And recall = TPs / (TPs + FNs) • AUC PR 1
• Area Under the Curve for precision and recall of class 1 • Both F1 scores and AUC PR 1 are higher for the 8x23 encoding
→ superior performance of the encoding all other parameters being equal
13
Experiments: GUIDE-Seq Transfer Learning
Transfer learning
• Definition: use the trained DL models on CRISPOR data to perform predictions on GUIDE-Seq • GUIDE-seq data only contains 430 nucleobase sequence pair
• Population size is too small to train a robust model
• Use of the Area Under the Curve (AUC) to estimate the predictive performance of a model • Best theoretical value is 1.0, worst theoretical value is 0.0
• The 8x23 encoding leads to higher predictive performance for all models, all other parameters being equal
14
Experiments: GUIDE-Seq Transfer Learning
Quantitative metrics to estimate the performance of the models
• Superior AUC ROC and superior AUC PR 1 for all models
• Best F1-score performing model (FNN3) has a higher F1 score for the 8x23 encoding → Transfer learning has a huge impact on the predictive performance of the model → superior performance of the encoding all other parameters being equal
15
Conclusion
In this work,
• We propose to encode the nucleobase sequence pairs
as a matrix of size 8×23
• We demonstrated the superior performance of the 8x23
encoding w.r.t. 4x23 encoding
• For different DL models
• On 2 data sets, the CRISPOR and the GUIDE-seq data
set
Future work
• Will address the insertions and deletions (indels) in the
8x23 encoding
• Use persistent homology for binary and sequence
classification of off-targets
Top 20 predictions using 4x23 encoding on CRISPOR data