A study of the dynamics and genetics of COVID-19 through machine learning

(1)

c

(2)

A STUDY OF THE DYNAMICS AND GENETICS OF COVID-19 THROUGH MACHINE LEARNING

BY

SAYANTANI BASU

THESIS

Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science

in the Graduate College of the

University of Illinois at Urbana-Champaign, 2020

Urbana, Illinois

Adviser:

(3)

ABSTRACT

COrona VIrus Disease (COVID-19), a disease caused due to the severe acute respiratory syndrome coronavirus 2 (SARS-CoV2) resulted in over 12 million infections and over 560,000 deaths in a worldwide pandemic. Countries all over the world carried out mitigation measures to curb the pandemic in forms of lockdowns, disinfection measures, and social distancing. We aim to study the dynamics of this disease by using a machine learning based approach using Long Short-Term Memory (LSTM) in order to evaluate the degree to which the virus spread. Our model is trained on accumulated COVID-19 cases and deaths. Parameters in our model can be adjusted to obtain predictions as required. Results have been obtained at both the country and county levels for the United States of America and some globally affected areas. We show predictions up to three different points of time – May 11, June 10, and June 30. We have also carried out a quantitative evaluation of mitigation measures in eight different counties in the United States depending on the rate of difference between a short and long window parameter based on the proposed LSTM model. The proposed LSTM model provides useful insights and can be a useful aid for various places planning strategies for mitigation and reopening. We aim to study the genetics of COVID-19 using a Multi-Layer Perceptron (MLP) model to recognize and classify COVID-19 genetic sequences at various taxonomic levels. Our model is an alignment-free method based on machine learning and natural language processing techniques that achieves reasonable performance in terms of cross-validation accuracy and time compared to the baseline model. The results from this work could potentially contribute to society during the current global concern.

(4)

To my parents and brother, for their love and support.

(5)

ACKNOWLEDGMENTS

I would like to specially thank my advisor, Professor Roy H. Campbell, as well as the Department of Computer Science at the University of Illinois at Urbana-Champaign (UIUC) for allowing me to work on a topic as relevant and current as COVID-19. I am also thankful to the Coordinated Science Laboratory (CSL) at UIUC for funding my work over the Summer. This project has been funded by the Jump ARCHES endowment through the Health Care Engineering Systems Center.

Professor Campbell was always willing to brainstorm new ideas and help me introduce new dimensions as part of this research. I have gained a lot from our research meetings and discussions. I was able to learn so much by working on this new research topic. Most of the work in this thesis is based on our recently accepted research paper in the Science Citation Index (SCI) Elsevier Journal Chaos, Solitons & Fractals [1, 2] as well as some other manuscripts in progress.

A substantial part of this thesis was written during the COVID-19 stay-at-home and social distancing period, which came with its own set of challenges. Perhaps the most crucial part of working on this thesis was organizing the deluge of information and updating it from time to time. This also included experiments and test results, since most of the data used is dynamic and gets updated on a regular basis! However, I have had tremendous support and encouragement from the Urbana-Champaign community, family, and friends during these trying times, for which I am really thankful.

(6)

TABLE OF CONTENTS

CHAPTER 1 INTRODUCTION . . . 1

CHAPTER 2 RELATED WORK . . . 3

2.1 Modeling-Based Approaches for COVID-19 . . . 3

CHAPTER 3 BACKGROUND . . . 6

3.1 Disease, Epidemics and Pandemics . . . 6

3.2 Approaches for Fighting COVID-19 . . . 9

CHAPTER 4 PROPOSED METHODS . . . 11

4.1 Approaches on Modeling Diseases . . . 11

4.2 Proposed Method on COVID-19 Dynamics . . . 14

4.3 Proposed Method on COVID-19 Genetics . . . 17

CHAPTER 5 EXPERIMENTS AND RESULTS . . . 19

5.1 Results on COVID-19 Dynamics . . . 19

5.2 Results on COVID-19 Genetics . . . 46

CHAPTER 6 CONCLUSION . . . 48

(7)

CHAPTER 1: INTRODUCTION

A novel coronavirus outbreak that spread from China in December 2019 and turned into a pandemic led to infections and fatalities worldwide. At present, it is estimated that over 12 million people have been infected with COVID-19 and over 560,000 people have died from contracting the virus [3]. The virus was termed the severe acute respiratory syndrome coronavirus 2 (SARS-CoV2) and the corresponding disease was termed the COrona VIrus Disease (COVID-19). Most commonly observed symptoms are cough and fever, with several patients in severe condition showing pneumonia-like symptoms and dyspnea [4]. Studies have established that COVID-19 spreads through particles of infected respiratory droplets that are transmitted from an infected to a healthy person [5]. A range of similar symptoms in humans have also been observed, including connections to skin rashes [6] and gastrointestinal symptoms [7]. Several recent studies have studied probable zoonotic pathways, as well as other species that are capable of contracting COVID-19 [8].

There is no specific cure for COVID-19, although efforts all around the globe are currently in progress for developing a vaccine or drugs targeted at treating COVID-19. Present treat-ment methods have made use of antiviral medicines used for treating other viral diseases [9, 10, 11, 12].

In order to take steps for treating this viral and contagious disease, healthcare profession-als were provided with special equipment to treat patients infected with COVID-19. This led to an increase in demand for PPE (Personal Protective Equipment) for frontline health-care workers in places with high COVID-19 hospitalizations, and for ventilators required in hospitals [13, 14].

Globally, countries have enforced mitigation measures for COVID-19, which commonly included total lockdowns, social distancing, and advising people to take precautions like washing hands, sanitizing surfaces, and wearing face masks. Such decisions were carried out based on statistical models and projections showcasing the effects of various mitigation measures [15, 16, 17].

Individuals showing mild COVID-19 symptoms were generally asked to self-isolate for a period of two weeks and obtain medical help in the event that their health conditions wors-ened [18, 19, 20]. In several cohort-based studies, COVID-19 was shown to affect particular groups of people in various populations [21, 22], as well as people having pre-existing medical conditions [23, 24].

Asymptomatic transmission was considered to be another possible route of infection trans-mission. In other words, asymptomatic COVID-19 individuals are those who test positive

(8)

for COVID-19 without exhibiting any visible symptoms [25]. This may have accelerated the spread of COVID-19 [26] as asymptomatic people may not be aware and may have spread the disease through social interactions with healthy individuals.

In this thesis, we propose machine learning approaches to study the dynamics and genetics of COVID-19. The rest of this thesis is organized as follows: Chapter 2 discusses related work on machine learning approaches for COVID-19, Chapter 3 gives some background on diseases, epidemics, pandemics, and mitigation measures, Chapter 4 elaborates our proposed methods, Chapter 5 discusses our results for our models on the dynamics and genetics of COVID-19 while Chapter 6 concludes this thesis.

The work in this thesis has led to the following papers:

[1] S. Basu and R. H. Campbell, “Going by the numbers: learning and modeling COVID-19 disease dynamics”, Chaos, Solitons & Fractals, July 2020, doi: 10.1016/j.chaos.2020.110140.

[2] S. Basu and R. H. Campbell, “Going by the numbers: learning and modeling COVID-19 disease dynamics”, medRxiv, 2020.

(9)

CHAPTER 2: RELATED WORK

2.1 MODELING-BASED APPROACHES FOR COVID-19

The increased incidence of COVID-19 prompted research efforts for studying different aspects of the disease. In this section, we discuss work related to machine learning and other similar approaches for COVID-19.

Many previously proposed methods have modeled predictions of how COVID-19 infections can possibly spread based on statistical approaches. Neher et al. [27] have obtained an SIR based seasonal transmissability model [28], a category of epidemiological models than can be used to predict the direction taken by the virus for causing infections in the future. They have considered infection, population turnover rates, and emigration respectively in their proposed approach. A more refined version of their method [29] visualizes the hospital capacity needed to be able to accommodate patients during COVID-19.

Similarly, another hospital impact model CHIME (COVID-19 Hospital Impact Model for Epidemics) proposed by Penn Medicine [30] predicts the projection of the number of hospi-talizations, patients requiring ICU (Intensive Care Unit) care and ventilators respectively for Penn hospitals. They have formulated a statistics-based approach depending on parameters like hospitalization rate, social contact and detection probability.

Another category of projection models for studying COVID-19 involves machine learning based approaches by different research teams in a global effort to study the various facets of the disease.

Hu et al. [31] have proposed an auto-encoder based learning model to forecast cumulative cases of COVID-19 in China until April 2020 by training on cumulative data from previous months. They have also carried out a cluster study by grouping the cities and provinces based on extracted features. Other LSTM approaches have also considered patient data [32] as well in order to forecast epidemic curves in China [33].

So far, all proposed COVID-19 projection and forecast models have been centered on graphical analyses and epidemic trajectory prediction. However, in our proposed approach, we obtain useful insights based on model predictions and also seek to quantitatively evaluate the degree to which mitigation measures are working depending on the projected trajectories for COVID-19.

Several other learning models have incorporated image and prescription data to carry out automated detection of COVID-19 in patients. Wang et al. [34] have studied the “Traditional Chinese Medicine (TCM)” model using a deep learning based “Ontology-based

(10)

Side-effect Prediction Framework (OSPF)”. Their model comprised an Artificial Neural Network (ANN) model of three hidden layers that was applied to TCM prescriptions for treating patients diagnosed with COVID-19. They also validated their proposed OSPF model in order to measure the safety and effectiveness of using prescriptions from treating other flu-like illnesses and applying them to treat COVID-19 patients. Any TCM prescription is classified as ‘Safe’ or ‘Unsafe’ by their proposed model. Additionally, they recognized seven TCM prescriptions that are more important compared to others for the treatment of patients diagnosed with COVID-19.

Wang et al. [35] have proposed a GRU (Gated Recurrent Unit) model that has attention and bidirectionality BI-AT-GRU that is able to classify six types of respiratory patterns in COVID-19 patients. They have noted that breathing problems in COVID-19 patients arise predominantly as a result of ‘Tachypnea’, which corresponds to irregular shallow and rapid breathing. Their proposed model can distinguish among respiratory patterns of ‘Eup-nea’, ‘Tachyp‘Eup-nea’, ‘Cheyne-Stokes’, ‘Central-Ap‘Eup-nea’, ‘Biots’, and ‘Bradypnea’ with high F-1 values.

Wang et al. [36] have proposed ‘Inception Migration Neuro Network’, which is a CNN (Convolutional Neural Network) model to obtain radiography features from volumetric chest CT (Computer Tomography) scans of patients from two medical institutes in China. Their proposed model identifies the presence or absence of COVID-19 in a chest CT scan. They have studied the F1 and AUC (Area Under Curve) scores for their proposed model. More-over, they claim their proposed method is low cost and non-invasive in nature.

Another similar approach DeepPneumonia by Song et al. [37] used deep learning for studying CT images of patients with COVID-19. They collected lung CT images for patients from two hospitals in China for bacterial pneumonia, COVID-19, and healthy patients. Their proposed method has a sensitivity of 0.93 and an AUC score of 0.99 for classifying COVID-19. Li et al. [38] have proposed ‘COVNet’ (‘COVID-19 detection neural network’), a deep learning approach capable of identifying COVID-19 from ‘CAP (Community Acquired Pneu-monia)’ and other non-pneumonia diseases based on CT scans of patients. The architecture of their proposed model included max-pooling, ResNet50, and a fully connected layer. Their main aim was to distinguish COVID-19 features from the images.

Sethy and Behera [39] have proposed a similar machine learning model for X-Ray images based on ResNet50 and SVM (Support Vector Machines) in order to identify patients with COVID-19. They have evaluated the efficiency of their proposed model using False Positive Rate (FPR), F1, Kappa, and MCC (Matthews correlation coefficient) metrics.

We now shift our discussion to genetically classifying COVID-19 based on its genomic signature. The classic approach to finding genetic similarities among sequences would be to

(11)

use alignment techniques. However, while such methods provide significant accuracies, they are demanding in terms of computational resources [40]. Several alignment-free methods have been previously proposed [41, 42] in order to overcome such limitations. One such approach involves the use of machine learning to classify genes at various taxonomic levels with reasonable accuracy and does not require alignment. This recently proposed approach by Randhawa et al. [43] used a digital signal process approach combined with machine learning as well as a decision tree classifier to obtain a model for classifying COVID-19 sequences. They have evaluated their model in terms of the cross-validation accuracy on 29 COVID-19 sequences (available as of January 27, 2020) and time (in seconds).

In this thesis, we explore an LSTM-based approach for learning the epidemiological dy-namics of COVID-19. Statistical learning models need rigorous settings of data and tuning of coefficients to be capable of epidemiological predictions. On the other hand, machine learning methods enable learning complex patterns automatically from data depending on the proposed model and its corresponding tuned hyperparameters. This is immensely ap-plicable to time-series epidemiological data in the case of COVID-19 where the cumulative values of infections and deaths change with regard to location and time. We then show results on a neural network model for classifying COVID-19 genetic sequences at various taxonomic levels. We compare our performance with a recently proposed method [43] as baseline and show that our model improves on cross-validation accuracy and reduces time for several test cases on various taxonomic levels, while performing at par on the original test dataset of COVID-19 sequences. We additionally provide results on a similar sample from recently available sequences (available as of June 30, 2020) for comparison purposes.

(12)

CHAPTER 3: BACKGROUND

3.1 DISEASE, EPIDEMICS AND PANDEMICS

3.1.1 Disease

A disease is defined as the disruption in ordinary biological functioning of any organism. Diseases can be caused due to a variety of infectious agents like viruses, bacteria, fungi, multicellular organisms like mosquitoes and wasps, or due to genetic factors like inheri-tance and mutation. In the context of this thesis, we discuss COVID-19 (COrona VIrus Disease), which is a disease caused by the virus severe acute respiratory syndrome coron-avirus 2 (SARS-CoV2). The original source of transmission of the disease was observed to be zoonotic, which means animal to human transmission. However, ultimately human-to-human transmission led to propagation of the disease worldwide. COVID-19 is believed to spread from droplet infection or direct contact as indicated by clinical studies [44, 45, 46, 47, 48].

3.1.2 Epidemics and Pandemics

An epidemic is a occurrence of a disease in a widespread fashion at a given point in time that can affect several individuals of a population.

A pandemic is a kind of epidemic which spreads widely over several countries and affects a significant number of people.

Historically, humans have been subjected to several epidemic outbreaks due to diseases which have often led to a significant number of global infections and deaths. A summary is provided in Table 3.1 for the number of deaths worldwide, type, time period, place of origin, cause, and spread for each disease.

For comparison purposes, we also consider epidemics from the family of coronaviruses like MERS and SARS and include the COVID-19 outbreak for reference. It is noteworthy that most recent disease outbreaks have been viral in nature.

(13)

T able 3.1: Epidemics and Related Disease Outbreaks in History S.No. Name W orldwide Deaths T yp e Time P erio d Place of Origin Cause Spread 1 An tonine Plague 5 million Viral 165-180 AD Egypt 1. T rade 2. Soldiers returning from w ar 3. P ossibly smallp o x outbreak Symptoms: blac k skin rash, diarrhea, fev er, blac k sto ol) Roman Empire (P ersia to Spain, Britain to Egypt) Also caused death o f Roman emp eror Marcus Aurelius 2 Plague of Justinian 30-50 million Bacterial 541-543 AD Egypt Bacteria Y ersinia p estis that infected R attus rattus (blac k rat) on grains shipp ed to Constan tinople Byzan tine (East Roman) empire, Constan tinople, Sasanian (Neo-P ersian) empire, Mediterranean p ort cities 3 Japanese Smallp o x 2 million Viral 735-737 AD Kyush u, Jap an 1. Airb orne 2. P ossibly b y fish e rm a n who had the illness from Korean p eninsula Fiv e islands of Japan 4 Bub o ni c Plague 50 million Bacterial 1346-1353 AD Cen tral Asia or East Asia 1. Mongol armies 2. T raders 3. Fleas as carr iers from infected animals 4. Exp osure to infected animals Europ e, Asia, Africa Estimated deaths of 30-60% of Europ e 5 Smallp o x 5-8 million Viral 1520-1521 Coast of Mexico V a riola virus that en tered Mexico from Cuba b y tra v eling arm y Aztec Empire (p rese n tly Cen tral and Southern Mexico) Theories surrounding fall of the Aztec Empi re in the Battle of T eno cho ti tlan b y Hernan Cortes v ersus w eak ening of p opulation due to smallp o x 6 Great Plagues (Seville, London and Vienna) 1 million Bacterial 1647-1679 St. Giles-in-the-Fields, London Netherlands, p ossibly via Dutc h ships carrying cotton Amsterdam, Russia, N a ple s, Austria, London, Malta, South Spain 7 Cholera 40 million Bacterial 1817-1923 Jessore in presen t-da y Bangladesh Consuming fo o d and w ater con tami n ated with Vibrio choler ae bacterium W orldwide 8 The Third Plague 12 million Bacterial 1855-1859 Y unnan, China 1. W ater-traffic spread bacteria from China to Hong Kong 2. Spread to Briti sh India from Hong Kong Singap ore, T aiw an, Japan, Indian sub con tinen t

(14)

T able 3.1: (Con tin ued) Epidemics and Related Disease Outbrea ks in History S.No. Name W orldwide Deaths T yp e Time P erio d Place of Origin Cause Spread 9 Russian Flu 1 million Viral 1889-1890 St. P etersburg, Russia An Influenza virus spr e a d due to mo dern transp ort W orldwide 10 Spanish Flu 50 million Viral 1918-1920 F rance Virus H1N1 Influenza A W orldwide The term ”Spanish” is a misnomer since Spain w as o ne of the first coun tries to rep ort the flu . 11 Asian Flu 1 million Viral 1957-1958 China Virus H2N2 Influenza A China to Singap ore, follo w ed b y Hong Kong and US Maurice Hilleman, an American microbiologist in v en ted the v accine for Asian Flu whic h help ed curb deaths in the US 12 Hong Kong Flu 1 million Viral 1968-1970 China Virus H3N2 Influenza A W orldwide 13 SARS (Sev ere Acute Respiratory Syndrome) 774 Viral 2002-2003 (no cases after 2004) Southern China 1. SARS-CoV (SARS corona virus) Mainland China, Hong Kong, affected 17 coun tries w orldwide 14 MERS (Middle East Respiratory Syndrome) or Camel Flu 862 Viral 2015-2018 (F e w er cases from 2018 to 2020) Saudi Arabia 1. MERS-CoV (MERS corona virus) 2. Betacorona virus from bats 3. Camels ha v e a n tib o dies similar to MERS-CoV Saudi Arabia, Arabian P eninsula, South Korea 15 CO VID -19 (COrona virus VIrus Disease) 560,000 Cases: 12 million Viral Decem b er 2019-Presen t W uhan, China 1. Sev ere acute respiratory syndrome corona virus 2 2. P ossible origin in horsho e bats 3. Also link ed to Huanan Seafo o d Wholesale mark et (liv e animals) W orldwide

(15)

3.2 APPROACHES FOR FIGHTING COVID-19

The following subsections describe general methods for fighting similar epidemics, however, the discussion is carried out specifically with regard to COVID-19 [49] in this context.

3.2.1 Do Nothing

If no action is taken, people get infected more and more leading to an increase in the death rate. Moreover, this can also contribute to additional burdens on hospitals and other health-care facilities who are suddenly faced with the challenge of accommodating large crowds of infected people. There is also a belief that doing nothing will lead to the population be-coming immune due to the presence of antibodies obtained by contracting viruses or getting vaccination, which in turn leads to the phenomenon of “herd immunity” [50, 51, 52] and the consequent diminishing of such epidemics.

Additionally, there arises the phenomenon of “collateral damage” [53], where people suf-fering from other ailments cannot receive proper and timely treatment due to increased attention towards COVID-19 patients in such a scenario.

3.2.2 Mitigation

In this scenario, damage reduction for COVID-19 is carried out in terms of social distancing and frequent sanitizing so that the infections do not reach their peak and place considerable strain on the healthcare system. The end goal in this scenario is to just “flatten the curve” [54, 55, 56].

3.2.3 Suppression

This is an approach to control the spread of the COVID-19 epidemic by following strict social distancing measures and lockdowns, followed by reopening measures to enable people to get back to their normal life. While this is able to curb the infections and ease the burden on the healthcare systems, it puts pressure on people by ordering them to follow strict measures including face coverings, frequent sanitizing, and so on [57, 58]. Extended periods of lockdowns are also harmful for the economy of any country [59, 60].

(16)

3.2.4 The Hammer and the Dance

The ‘Hammer and the Dance’ [49, 61, 62] strategy involves acting quickly in the first few weeks to ‘hammer’ the COVID-19 virus by enforcing strict lockdowns and other strong mitigation measures. Once this controls surges in the virus, it can be followed by the ‘dance’ for a prolonged time, which involves relaxing mitigation measures to a certain extent and conducting phased reopening in an effort to control the virus until medications or vaccines are formulated.

3.2.5 Rapid Testing and Contact Tracing

This is a control measure supported widely by its proponents [63, 64, 65]. Even if social distancing and lockdowns are implemented, testing is equally important in order to obtain an idea of the proportions of the population that are affected by the virus. Due to the contagious nature of COVID-19, rapid testing ensures that people who test positive for the virus can quickly be isolated to prevent the infection from spreading to other people. Moreover, contact tracing should also be carried out to identify individuals who have come in close contact with an infected person and possibly isolating them as they are also potential carriers of COVID-19.

(17)

CHAPTER 4: PROPOSED METHODS

4.1 APPROACHES ON MODELING DISEASES

In this section, we discuss a set of approaches for modeling disease propagation that usually consists of time-series data. We also discuss Multi-layer Perceptron that we used for classifying COVID-19 sequences.

Long Short-Term Memory Networks (LSTMs)

Originally proposed by Hochreiter and Schmidhuber [66], Long Short-Term Memory Net-works (LSTMs) are a category of Recurrent Neural NetNet-works (RNNs) capable of capturing and ‘remembering’ long-term dependencies [67]. These advantages motivated us to use this approach for our proposed model on COVID-19 dynamics.

Figure 4.1 shows the structure of an LSTM cell with connections for forming an entire network used for learning, where xt indicates the input at time t, ft indicates the forget

gate, it indicates the input gate, ot indicates the output gate, ct−1 and ct indicate the states

at time t − 1 and t respectively, ht−1 denotes the output from the previous LSTM cell, ht

denotes the output to the next LSTM cell, and ˜ct denotes the candidate value of cell state.

(18)

The equations below summarize the operations occurring in an LSTM. Equations are based on the original LSTM model [66, 67]. The b term indicates the bias for the corresponding LSTM gates. The w and u terms indicate the weights for the input and recurrent layers respectively. ft = σ(wfxt+ ufht−1+ bf) (4.1) it= σ(wixt+ uiht−1+ bi) (4.2) ot = σ(woxt+ uoht−1+ bo) (4.3) ˜ ct = tanh(wcxt+ ucht−1+ bc) (4.4) ct = ft· ct−1+ it· ˜ct (4.5) ht= ot· tanh(ct) (4.6)

Bayesian Inference on SIR

Bayesian Inference can be applied to an SIR model for modeling disease spread. The SIR model for disease propagation includes Susceptible (S), Infected (I), and Recovered (R) sections of the population. N represents the total population size, γ and β represent the recovery and infection rate respectively, Λ represents the cumulative deaths, ρ represents the fraction of population at risk, η represents the probability of death and Rt−ψ represents

the population removed with delay. In this context, R0 represents the ratio between the

frequency of contacts to that of recovery. The equations for the model adopted from recently proposed work [68, 69] are stated below.

dS dt = − βIS N (4.7) dI dt = βIS N − γI (4.8) dR dt = γI (4.9) N = S + I + R (4.10) Λt= ρηRt−ψ (4.11)

The equations below from recently proposed work [68, 69] represent Bayesian inference on the SIR model. α indicates the probability of accepting the parameters.

(19)

log(θ?) ∼ N (log θ? | log θ, σId) (4.12) logp(Λ1:T, θ? | y1:T) = log p(θ) + T X t=1 log (Poisson(yt| Λt)) (4.13) α = min 1,p(Λ1:T, θ ? _{| y} 1:T) p(Λ1:T, θ | y1:T) (4.14)

Time Series Regression (TSR)

The traditional Time Series Regression (TSR) model for modeling infectious diseases is represented by the equations below [70]. xt represents the variable that varies according to

time, β, β0, and βp represent the regression coefficients, zp,t represents the risk factors, and

f (t) represents the time function.

Yt ∼ Poisson(µt) (4.15) log(µt) = β0 + βxt+ X p βpzp,t+ f (t) (4.16) Multilayer Perceptron (MLP)

We now discuss Multilayer Perceptron (MLP), a type of feedforward neural network that we use for our proposed model on classifying COVID-19 genetic sequences at various taxo-nomic levels. Figure 4.2 shows an example of a network with two hidden layers drawn using NN-SVG [71]. Based on a set of inputs, the task of the MLP is to learn to predict the out-put. The MLP consists of several units called ‘neurons’ which perform weighted summations followed by an activation function (for example, sigmoid or tanh). It uses a process called backpropagation [72] for training and performs optimization by minimizing a loss function like the cross-entropy loss function.

(20)

Figure 4.2: Example of an MLP with two hidden layers

The equations for classification performed by an MLP trained using backpropagation based on softmax and cross-entropy error are summarized below [72, 73]. yi indicates the softmax

activation at output i, E denotes cross-entropy error, _∂w∂E

ji indicates the gradient for the top layer weights and _∂s∂E1

j

indicates the gradient for the hidden layer units.

yi = esi Pnclass c esc (4.17) E = − nclass X i tilog(yi) (4.18) ∂E ∂wji = (yi− ti)hj (4.19) ∂E ∂s1 j = nclass X i (yi− ti)wjihj(1 − hj) (4.20)

4.2 PROPOSED METHOD ON COVID-19 DYNAMICS

In our proposed method, we use LSTMs to observe the disease dynamics and apply it to the forecast of cases and deaths of COVID-19. The innate nature of LSTMs enables us to gain an insight into future predictions of COVID-19 and how it can aid precautionary mitigation measures in healthcare.

(21)

4.2.1 Dataset

In order to train, validate and test our proposed LSTM model, we obtain publicly available data from the Johns Hopkins University repository [74]. The repository provides time-series COVID-19 data for cumulative confirmed cases and deaths for the United States of America (USA) and other global locations. For our experiments, we show results of COVID-19 predictions on country-level data as well as fine-grained county-level data. We conduct these experiments due to the variability of the data at different places that provides a rich environment for study. Additionally, for eight locations in the United States, we evaluate the effectiveness of mitigation measures based on our proposed LSTM model.

For experimental purposes, we consider data over three periods of time, namely from January 22 to May 11, from January 22 to June 10, and from January 22 to June 30. These periods of time are particularly significant in the USA as they indicate the first COVID-19 surge, reopenings, and the second COVID-19 surge respectively. For the sake of uniformity, we have conducted experiments on all three time intervals for all locations considered as part of this thesis.

When time series data is separated into train, validation, and test sets, caution has to be exercised in order to maintain temporal relations. For infectious diseases, the disease propagates over time, resulting in time-series data of the cumulative numbers of cases and deaths. As a result, in order to predict the cumulative infections or deaths at a certain point in time, the model would have to be trained on the preceding time steps.

Simultaneously, tasks like monitoring the generalization and performing hyperparame-ter selection have to be carried out for the model. To handle this situation, the data is partitioned on time as shown in Figure 4.3 for training and validation.

Separately, the data is trained on three consecutive splits of train and validation data for various hyperparameters to obtain the minimum total root mean squared error (RMSE) on the validation sets. We specially focus on tuning the lookback and lookahead parameters or windows. The dataset split on train and validation sets for the model is considered to be 80:20 ratio approximately depending on the initial lookback and lookahead parameters of 1 and 1 respectively. However, the model can handle other splits depending on the data as well. Smaller splits on the train and validation data have approximately been considered at a 60:20 ratio. For the purpose of our experiments, the lookback parameter is tested between 1 and 7 and the lookahead parameter is tested between 5 and 7 in order to obtain reasonable future predictions. All parameters have been tested while maintaining constraints on indices of the train, validation, and test sets.

(22)

Figure 4.3: Splitting data for the proposed LSTM model

4.2.2 Model Design

The Python framework Keras [75] was used with the backend as TensorFlow in the pro-posed LSTM model.

For all experiments, the LSTM has 10 layers and is trained for 10,000 epochs with batch size 10. Normalization is carried out on the logarithmic scale and Adam [76] is used with learning rate 0.001 for the optimizer. These parameters have been fixed based on validation performance. The parameters tuned in particular are the window parameters lookback and lookahead depending on specific location data. Specifically, in this situation, lookback indicates window size for the consecutive days the LSTM is required to “look back” before providing a prediction. The lookahead parameter denotes the forthcoming days the model has to “look ahead” before making a prediction for a date in the future. Both parameters are used on the respective data (training, validation, and test) by index slicing formulated as data[:-(lookahead)] divided into sets of lookback days according to time for the input of the model and data[(lookback+lookahead):] for the output of the model.

The model is generally able to project future predictions of cumulative COVID-19 con-firmed cases or deaths. For research purposes, we plot and compare the predicted values after training on the tuned hyperparameters and the actual values of cumulative confirmed cases or deaths for those particular dates in order to obtain suitable interpretations. All our analyses of the results and figures for various locations are discussed in depth in Section 5.1.

(23)

4.3 PROPOSED METHOD ON COVID-19 GENETICS

A recently proposed approach for identifying COVID-19 sequences using a machine learn-ing based alignment-free method [43] used machine learnlearn-ing models in conjunction with a decision tree classifier to yield results. However, their approach consumes time for some test cases even though the test results are very accurate. We propose a machine learning approach using neural networks to train and test for COVID-19 genetic sequences. Like all the sequences used in the baseline method [43], we obtain our data from SourceForge [77], NCBI, and GISAID.

In our proposed method, we use the value of k as 7 for the k-mers and 10-fold cross validation in order to compare our results with that of the baseline method [43]. In the processing steps, we eliminate any ‘N’ bases from the remaining nucleotides – ‘A’, ‘C’, ‘T’, and ‘G’. We then use a natural language processing approach to convert the sequences into a ‘genetic language’ using CountVectorizer to extract unigrams of 7-mers or sets of seven contiguous DNA nucleotides from each genetic sequence. Each of these vectors is then normalized with StandardScaler and principal component analysis (PCA) is applied to reduce the vectors to a length of 100 each or the size of the training set, whichever is smaller, in order to accommodate for finer taxonomic classification. Our neural network model is a Multilayer Perceptron with 4 hidden layers of size 80 trained for 1000 iterations fixed based on experimentation. We use the Python library scikit-learn [78] for the proposed neural network model. All parameters have been tuned based on experimentation. We now elaborate each of the tests carried out for our experiments below that are adopted from the baseline method [43]:

1. Test-1 : Model trained and cross-validated on Adenoviridae, Anelloviridae, Caudovi-rales, Geminiviridae, Genomoviridae, Microviridae, OrterviCaudovi-rales, Papillomaviridae, Parvoviridae, Polydnaviridae, Polyomaviridae, and Riboviria sequences. For testing, COVID-19 sequences should be classified as Riboviria.

2. Test-2 : Model trained and cross-validated on Betaflexiviridae, Bromoviridae, Cali-civiridae, Coronaviridae, Flaviviridae, Peribunyaviridae, Phenuiviridae, Picornaviri-dae, PotyviriPicornaviri-dae, ReoviriPicornaviri-dae, RhabdoviriPicornaviri-dae, and Secoviridae sequences. For testing, COVID-19 sequences should be classified as Coronaviridae.

3. Test-3a : Model trained and cross-validated on Alphacoronavirus, Betacoronavirus, Deltacoronavirus, and Gammacoronavirus sequences. For testing, COVID-19 sequences should be classified as Betacoronavirus.

(24)

4. Test-3b : Model trained and cross-validated on a smaller subset with same proportions of Alphacoronavirus, Betacoronavirus, and Deltacoronavirus sequences. For testing, COVID-19 sequences should be classified as Betacoronavirus.

5. Test-4 : Model trained and cross-validated on Embecovirus, Merbecovirus, Nobecovirus, and Sarbecovirus. For testing, COVID-19 sequences should be classified as Sarbe-covirus.

6. Test-5 : Model trained and cross-validated on Embecovirus, Merbecovirus, Nobecovirus, Sarbecovirus, and COVID-19 virus.

7. Test-6 : Model trained and cross-validated on Sarbecovirus and COVID-19 virus. All our results are discussed in Section 5.2.

(25)

CHAPTER 5: EXPERIMENTS AND RESULTS

5.1 RESULTS ON COVID-19 DYNAMICS

Our results based on our proposed LSTM model are discussed in two main parts: (i) the prediction models showing predictions of COVID-19 cumulative confirmed cases and deaths for the United States (at the country and county levels) and other globally affected areas and (ii) comparative evaluations on mitigation measures for several affected US counties in the current situation. For all our graphs, dates in January, February, March, April, May, and June have been prefixed with ‘J’, ‘F’, ‘M’, ‘A’, ‘MA’, and ‘JU’ respectively followed by the day of that particular month.

For all the graphs, we plot the predicted values and actual values, as well as a baseline that is computed as the moving average for all actual values over the lookback window. In general, we observed that predicting the cumulative comfirmed cases resulted in higher RMSE values when compared to the RMSE values of the predicted cumulative deaths. We also observed shorter values of lookback and lookahead brought about lower values for both train RMSE and test RMSE. In our proposed LSTM model, it is of more interest to interpret the LSTM predictions instead of entirely studying RMSE values. Our experimental results show the predictability of COVID-19 using the LSTM model with differing window sizes. In this context, we base our discussion on mitigation measures – if the actual values exceed those of the LSTM model, we observe that mitigation measures have not worked well and if the actual values are identical to or less than those of the LSTM model, we observe that the mitigation measures have worked well. Our experiments indicate that mitigation factors introduce a perturbation based on what has been learnt. However, there are also several other factors like effects of the weather, holidays, and new tests for the COVID-19 virus that may contribute to an increase in infections.

5.1.1 Prediction Models

Prediction Models for the United States of America and Other Globally Affected Areas We begin by showing results obtained for the United States of America and other locations affected globally, namely (in order from current higher incidence of COVID-19 infections to flatter curves), (i) United States, (ii) India, (iii) Japan, (iv) Italy, and (v) Hubei, China. The COVID-19 cumulative confirmed cases as on June 30, 2020 for these considered locations

(26)

have been plotted on the map as shown in Figure 5.1. All maps have been generated using the Python plotly library [79].

Figure 5.1: World map showing COVID-19 cases for US and other countries being studied

The plots of training and predictions for cumulative COVID-19 confirmed cases and deaths in the United States of America are shown in Figure 5.2. The United States differed in lockdown and phased reopening strategies depending on the various states [80]. COVID-19 predictions on cumulative cases suggest the cases are still rising. Actual COVID-19 cases in the nation are also increasing. The LSTM predictions for deaths show a slight flattening trend compared to the COVID-19 cases. All the curves suggest that stronger measures may need to be continued in order to curb infections.

(27)

Figure 5.2: Predictions for the United States of America

The plots of training and predictions for cumulative COVID-19 confirmed cases and deaths in India are shown in Figure 5.3. Nationwide lockdowns were carried out in India in many phases based on the order of March 24 [81]. The predicted cumulative infections show that the mitigation measures are working, though stronger enforcement may be needed to lower the rate of incidence of cases. More measures are also needed for deaths that show an

(28)

increasing trend compared to the predictions.

Figure 5.3: Predictions for India

The plots of training and predictions for cumulative COVID-19 confirmed cases and deaths in Japan are shown in Figure 5.4. Several mitigation measures were implemented in Japan in March 2020 to curb COVID-19 propagation [82]. The small increase in actual cases

(29)

in contrast to the flattening shown by the LSTM predictions denotes that the mitigation measures have not yet fully worked. The LSTM predictions on deaths have a trend similar to the confirmed cases.

Figure 5.4: Predictions for Japan

The plots of training and predictions for cumulative COVID-19 confirmed cases and deaths in Italy are shown in Figure 5.5. Italy carried out strict lockdowns in various areas

(30)

begin-ning February 23 [83]. The LSTM predictions show the same flattebegin-ning trend compared to the actual cumulative COVID-19 cases that implies the mitigation measures have been successful. The LSTM model is capable of predicting the flattened curve as achieved by Italy. The deaths in Italy show flattening similar to that of the cases. The slight deviation for the predictions indicates that some mitigation measures may still be needed.

(31)

The plots of training and predictions for cumulative COVID-19 confirmed cases and deaths in Hubei, China are shown in Figure 5.6, where COVID-19 was initially observed in Wuhan City [27]. They implemented mitigation measures in January [84]. The flattened curves suggest measures have worked. Similar trends occur for deaths but some enforcement may be needed.

(32)

Prediction Models for Counties Within the United States

We now discuss results obtained for counties within the United States of America that were affected during COVID-19, namely (in order from current higher incidence of COVID-19 infections to flatter curves), (i) Los Angeles, California, (ii) Dallas, Texas, (iii) Hillsborough, Florida, (iv) Maricopa, Arizona, (v) King, Washington, (vi) Fulton, Georgia, (vii) Cook, Illinois, and (viii) New York City, New York. The COVID-19 cumulative confirmed cases as on June 30, 2020 for these considered locations have been plotted on the respective state maps. All state maps have been generated using the Python plotly library [79]. We discuss strategies and mitigation measures based on varying lockdown and reopening strategies in different states [85].

Figure 5.8 shows the plots of predictions for COVID-19 confirmed cases and deaths in Los Angeles, California shown in the map in Figure 5.7. The cases and deaths continually show a rise which is captured by the LSTM predictions as more data is introduced. However, the predictions show a slight divergence from the actual curve, which indicates that more mitigation measures may be necessary.

(33)

(34)

Figure 5.10 shows the plots of predictions for COVID-19 confirmed cases and deaths in Dallas, Texas shown in the map in Figure 5.9. In this case too, the LSTM learns the trend better with introduction of more data. Cases and deaths are shown to be rising by the end of June, which is indicated by both the actual and predicted curves. However, stronger mitigation measures may need to be introduced in this case as well since the actual values are well above the LSTM predictions by the end of June.

(35)

(36)

Figure 5.12 shows the plots of predictions for COVID-19 confirmed cases and deaths in Hillsborough, Florida shown in the map in Figure 5.11. The cases show a rising trend that is captured by both the actual and predicted LSTM curves. However, for deaths, there is considerable divergence between the actual and predicted curves. This indicates further need of implementing stronger mitigation measures.

(37)

(38)

Figure 5.14 shows the plots of predictions for COVID-19 confirmed cases and deaths in Maricopa, Arizona shown in the map in Figure 5.13. Arizona showed a sudden rise in cases in June. As indicated by the LSTM predictions and actual cases, till June 10, the mitigation measures were working though there was a steady rise in cases. However, by June 30, the divergence between actual and predicted curves for both predictions and deaths show that mitigation measures did not work for the later part of June.

(39)

(40)

Figure 5.16 shows the plots of predictions for COVID-19 confirmed cases and deaths in King, Washington shown in the map in Figure 5.15. The plots indicate that the increase in mitigation measures were working well until the last week of June, which is shown by a slight divergence between the LSTM and actual curves in both infections and deaths.

(41)

(42)

Figure 5.18 shows the plots of predictions for COVID-19 confirmed cases and deaths in Fulton, Georgia shown in the map in Figure 5.17. Based on the graphs, the mitigation measures are working in terms of the deaths but not for infections as indicated by the divergence in late June.

(43)

(44)

Figure 5.20 shows the plots of predictions for COVID-19 confirmed cases and deaths in Cook, Illinois shown in the map in Figure 5.19. The predictions are able to capture the trend of flattening for both cases and deaths. The mitigation measures are working for both cases and deaths based on the graphs. However, the slight increasing trend for both cases and deaths indicates that prolonged mitigation measures may be required.

(45)

(46)

Figure 5.22 shows the plots of predictions for COVID-19 confirmed cases and deaths in New York City, New York shown in the map in Figure 5.21. New York was one of the most severely affected areas of the United States. Counties in New York implemented lockdowns in March [80]. The actual and predicted curves indicate a flattening of the curve for both infections and deaths. However, while mitigation measures are working based on the infections, the slight divergence between the LSTM predictions and actual values for deaths indicates that more measures may need to be enforced before completely lifting restrictions.

(47)

(48)

5.1.2 How Well are Social Distancing and Other Mitigation Measures Working?

We now evaluate the present mitigation measures in various counties in the United States. The aim is to evaluate whether the current mitigation measures have worked for the better or worse. The models discussed are the results generated from our proposed LSTM model. To begin, we consider two different window sizes – a shorter and a longer window size. For this purpose, the lookback parameter is fixed as 1 and the line of best fit is computed on the rate of change of the difference between a smaller lookahead parameter of 1 and a larger lookahead parameter of 5. These analyses can additionally be performed on individual parameters. In this context, for experimental purposes the rate on the difference is computed so that a smoother curve is obtained. For these analyses too, the model can be extended for the near future. This model is valuable for specific locations to make decisions on whether to tighten or loosen their lockdown policies based on the degree to which the mitigation measures are working.

Figure 5.23 and Figure 5.24 show the various plots measuring the extent to which miti-gation measures are working on rates of COVID-19 infections and deaths in eight different counties in the United States: (i) Los Angeles, California, (ii) Dallas, Texas, (iii) Hillsbor-ough, Florida, (iv) Maricopa, Arizona, (v) King, Washington, (vi) Fulton, Georgia, (vii) Cook, Illinois, and (viii) New York City, New York. The best fit plots denote the line of best fit in each figure whereas the LSTM rate plots denote the computed rate of difference between the two LSTM models as mentioned earlier. The line of best fit provides important insights on the degree to which mitigation measures are currently working based on data till June 30.

The slope denotes the direction in which mitigation measures are trending – a positive slope means current social distancing and mitigation measures are not working well while a negative slope means the current measures are working as expected. A positive slope denotes a rise in the rate of infections or deaths while a negative slope denotes a fall in the rate of infections or deaths. The intercept provides an approximate estimate of the increase or decrease in the rate of infections or the rate of deaths based on a positive or negative value. The RMSE value provides an approximation of whether the rates of infections or deaths have stabilized.

(49)

(a) Rate of COVID-19 Infections in Los Angeles, California

(b) Rate of COVID-19 Deaths in Los Angeles, California

(c) Rate of COVID-19 Infections in Dallas, Texas (d) Rate of COVID-19 Deaths in Dallas, Texas

(e) Rate of COVID-19 Infections in Hillsborough, Florida

(f) Rate of COVID-19 Deaths in Hillsborough, Florida

(g) Rate of COVID-19 Infections in Maricopa, Arizona

(h) Rate of COVID-19 Deaths in Maricopa, Ari-zona

Figure 5.23: Evaluating effects of mitigation measures in various counties in the United States

(50)

(a) Rate of COVID-19 Infections in King, Wash-ington

(b) Rate of COVID-19 Deaths in King, Washing-ton

(c) Rate of COVID-19 Infections in Fulton,

Geor-gia (d) Rate of COVID-19 Deaths in Fulton, Georgia

(e) Rate of COVID-19 Infections in Cook, Illinois (f) Rate of COVID-19 Deaths in Cook, Illinois

(g) Rate of COVID-19 Infections in New York City, New York

(h) Rate of COVID-19 Deaths in New York City, New York

Figure 5.24: (Continued) Evaluating effects of mitigation measures in various counties in the United States

In order to provide a quantified evaluation of the degree to which mitigation measures and social distancing are working, we compare the slope, intercept, and RMSE values of the line of best fit for the counties as shown in Table 5.1.

(51)

Table 5.1: Comparing effects of mitigation measures in various counties in the United States County Mitigation Measures for COVID-19 Infections Mitigation Measures for COVID-19 Deaths

Slope Intercept RMSE Slope Intercept RMSE Los Angeles, California 16.31 -22.57 617.21 -0.25 7.00 15.47

Dallas, Texas -6.07 10.01 100.17 0.08 -0.38 3.73 Hillsborough, Florida 4.98 8.58 134.57 -0.03 0.31 1.97 Maricopa, Arizona -10.73 165.94 409.28 0.08 2.44 5.97 King, Washington 1.52 -1.80 24.73 -0.00 0.27 2.19 Fulton, Georgia 1.94 1.65 44.21 -0.02 0.25 4.36 Cook, Illinois 10.78 -160.34 97.76 -0.37 1.80 21.24

New York City, New York 1.14 35.57 65.40 -0.24 6.99 5.89

Mitigation measures have not worked in Los Angeles, California based on the positive rate of infections even though the rate of deaths is slightly slowing down. Similar to Los Angeles, California, mitigation measures have not worked in Hillsborough, Florida and Fulton, Geor-gia as indicated by the positive rate of infections even though the rate of deaths is slightly slowing down. The same trends are also observed in Cook, Illinois and New York City, New York, which continue to show rising infection rate trends during the second COVID-19 surge. King, Washington, shows a rising trend of infection rates though the death rates seem to be stable. Though Dallas, Texas and Maricopa, Arizona show decreasing trends for infection rates, the increasing trends for death rates show that the mitigation measures are not yet working as expected. A noteworthy trend is that COVID-19 death rates are declining or almost close to stabilization based on the eight counties in the United States being consid-ered. This occurred because of rapid increase in testing and better treatment provided at hospitals, that brought down the mortality rate and increased the recovery rate.

High fluctuations or variations in infections or deaths as observed based on high RMSE values for infection rates in Los Angeles, California and Maricopa, Arizona denote that reopening measures may need to be considered carefully as the infection or deaths rates have not yet stabilized well at such locations.

(52)

Table 5.2: Evaluation of mitigation measures in various counties in the United States County Success of Mitigation Measures based on COVID-19 Infections Success of Mitigation Measures based on COVID-19 Deaths

Los Angeles, California 7 3

Dallas, Texas ₃ ₇ Hillsborough, Florida 7 3 Maricopa, Arizona ₃ ₇ King, Washington 7 3 Fulton, Georgia ₇ ₃ Cook, Illinois 7 3

New York City, New York ₇ ₃

Table 5.2 compares the success of mitigation measures for COVID-19 infections and deaths, where 3 denotes mitigation measures being successful and 7 denotes mitigation measures not being successful. Table 5.2 gives an overall view of the success of mitigation measures depending on the analyses carried out in Table 5.1.

Therefore, different effects and evaluations of mitigation measures have been studied in various parts of the United States. Mitigation measures in various places may still need contact tracing, phased reopening, and rapid testing before a regular lifestyle can be resumed.

5.2 RESULTS ON COVID-19 GENETICS

We now discuss our results on classification of COVID-19 genetic sequences at various taxonomic levels. Table 5.3 and Table 5.4 show our results compared with the baseline model [43] on their original set of 29 COVID-19 sequences and our randomly sampled set of 29 COVID-19 sequences respectively. We consider only full length gene sequences for all our experiments. All cross-validation (CV) accuracies are denoted in percentages and time in seconds. Our proposed model gives better cross-validation accuracy compared to the baseline for Test-1, Test-2 and Test-4 as shown in Table 5.3 and for Test-1, Test-2, Test-4, and Test-5 as shown in Table 5.4. Our proposed method also takes lesser time compared to the baseline for Test-1, Test-2, Test-3a, and Test-3b as shown in Table 5.3 and for Test-1, Test-2, and Test-3b as shown in Table 5.4. For all tests, our model performs at par with the baseline in terms of test results.

(53)

Table 5.3: Comparison of Proposed Model with the Baseline Model [43] on 29 COVID-19 sequences as of January 27, 2020 Train Data Test Data CV Baseline CV Proposed Test Baseline Test Proposed Time Baseline Time Proposed Test-1 COVID-19 95.0 98.3 100 100 323.33 138.43 Test-2 COVID-19 93.1 97.8 100 100 222.86 109.74 Test-3a COVID-19 98.1 96.1 100 100 11.95 10.62 Test-3b COVID-19 100 96.7 100 100 5.56 3.96 Test-4 COVID-19 98.4 100.0 100 100 5.48 7.06 Test-5 COVID-19 98.7 98.0 - - 2.86 8.34 Test-6 COVID-19 100.0 100.0 - - 2.14 4.89

Table 5.4: Comparison of Proposed Model with the Baseline Model [43] on 29 sampled COVID-19 sequences as of June 30, 2020

Train Data Test Data CV Baseline CV Proposed Test Baseline Test Proposed Time Baseline Time Proposed Test-1 COVID-19 95.0 98.35 100 100 323.33 127.48 Test-2 COVID-19 93.1 97.94 100 100 222.86 97.72 Test-3a COVID-19 98.1 98.07 100 100 11.95 13.38 Test-3b COVID-19 100 98.33 100 100 5.56 4.14 Test-4 COVID-19 98.4 100.0 100 100 5.48 8.41 Test-5 COVID-19 98.7 99.33 - - 2.86 10.78 Test-6 COVID-19 100.0 98.75 - - 2.14 4.71

Code and files from our experiments are available at this link : https://github.com/ sayantanibasu/covid19-models.

(54)

CHAPTER 6: CONCLUSION

The COVID-19 pandemic had an adverse effect worldwide and led to a massive number of infections and deaths. All countries took important mitigation measures to fight the pandemic by enforcing social distancing, lockdowns, mandatory face coverings, frequent sanitizing, and a variety of other mitigation measures. In this thesis, we study the dynamics and genetics of COVID-19. In order to study the dynamics of COVID-19, we propose an LSTM based model that can be trained on cumulative numbers of COVID-19 confirmed cases and deaths. Our proposed model predicts and provides useful insights at both the country and county levels. We have obtained results at three points of time – May 11, June 10, and June 30 for the United States of America and some other globally affected regions. We also quantitatively evaluate the degree to which mitigation measures are working based on the rate of infections and deaths. This can aid countries and counties while deciding on phased reopening and mitigation strategies. In order to study the genetics of COVID-19, we propose an alignment-free MLP based model that is trained at various taxonomic levels to study the ‘genetic language’ and classify genetic sequences of COVID-19 at each taxonomic level. Our proposed approach shows higher cross-validated accuracy and reduced time on several tests and performs at par with the baseline model in identifying COVID-19 genetic sequences. We hope that our proposed work and results on COVID-19 will help contribute to society.

(55)

REFERENCES

[1] S. Basu and R. H. Campbell, “Going by the numbers: learning and model-ing COVID-19 disease dynamics,” Chaos, Solitons & Fractals, July 2020, doi: 10.1016/j.chaos.2020.110140.

[2] S. Basu and R. H. Campbell, “Going by the numbers: learning and modeling COVID-19 disease dynamics,” medRxiv, 2020.

[3] “COVID-19 Statistics,” https://news.google.com/covid19/map?hl=en-US&mid=/m/ 02j71&gl=US&ceid=US:en, 2020.

[4] V. Surveillances, “The epidemiological characteristics of an outbreak of 2019 novel coro-navirus diseases (COVID-19)—China, 2020,” China CDC Weekly, vol. 2, no. 8, pp. 113–122, 2020.

[5] “Modes of transmission of virus causing COVID-19: implications for IPC precaution recommendations: scientific brief, 27 March 2020,” World Health Organization, Tech. Rep., 2020.

[6] S. Recalcati, “Cutaneous manifestations in COVID-19: a first perspective,” Journal of the European Academy of Dermatology and Venereology, 2020.

[7] J. Gu, B. Han, and J. Wang, “COVID-19: gastrointestinal manifestations and potential fecal–oral transmission,” Gastroenterology, vol. 158, no. 6, pp. 1518–1519, 2020.

[8] J. Shi, Z. Wen, G. Zhong, H. Yang, C. Wang, B. Huang, R. Liu, X. He, L. Shuai, Z. Sun et al., “Susceptibility of ferrets, cats, dogs, and other domesticated animals to SARS–coronavirus 2,” Science, 2020.

[9] L. Dong, S. Hu, and J. Gao, “Discovering drugs to treat coronavirus disease 2019 (COVID-19),” Drug discoveries & therapeutics, vol. 14, no. 1, pp. 58–60, 2020.

[10] J. Stebbing, A. Phelan, I. Griffin, C. Tucker, O. Oechsle, D. Smith, and P. Richard-son, “COVID-19: combining antiviral and anti-inflammatory treatments,” The Lancet Infectious Diseases, vol. 20, no. 4, pp. 400–402, 2020.

[11] F. Touret and X. de Lamballerie, “Of chloroquine and COVID-19,” Antiviral Research, p. 104762, 2020.

[12] P. Horby, W. S. Lim, J. Emberson, M. Mafham, J. Bell, L. Linsell, N. Staplin, C. Brightling, A. Ustianowski, E. Elmahi et al., “Effect of dexamethasone in hospi-talized patients with COVID-19: Preliminary report,” medRxiv, 2020.

[13] M. L. Ranney, V. Griffeth, and A. K. Jha, “Critical supply shortages—the need for ventilators and personal protective equipment during the COVID-19 pandemic,” New England Journal of Medicine, 2020.

(56)

[14] “Rational use of personal protective equipment for coronavirus disease (COVID-19): interim guidance, 27 February 2020,” World Health Organization, Tech. Rep., 2020. [15] F. E. Alvarez, D. Argente, and F. Lippi, “A simple planning problem for COVID-19

lockdown,” National Bureau of Economic Research, Tech. Rep., 2020.

[16] “COVID-19 Projections (IHME),” https://covid19.healthdata.org/ united-states-of-america, 2020.

[17] S. Das, P. Ghosh, B. Sen, and I. Mukhopadhyay, “Critical community size for COVID-19–a model based approach to provide a rationale behind the lockdown,” arXiv preprint arXiv:2004.03126, 2020.

[18] T. Greenhalgh, G. C. H. Koh, and J. Car, “COVID-19: a remote assessment in primary care,” BMJ, vol. 368, 2020.

[19] J. E. Hollander and B. G. Carr, “Virtually perfect? Telemedicine for COVID-19,” New England Journal of Medicine, 2020.

[20] R. M. Anderson, H. Heesterbeek, D. Klinkenberg, and T. D. Hollingsworth, “How will country-based mitigation measures influence the course of the COVID-19 epidemic?” The Lancet, vol. 395, no. 10228, pp. 931–934, 2020.

[21] COVID, CDC and Team, Response, “Severe outcomes among patients with coronavirus disease 2019 (COVID-19)— United States, February 12–March 16, 2020,” MMWR Morb Mortal Wkly Rep, vol. 69, no. 12, pp. 343–346, 2020.

[22] A. T. Cruz and S. L. Zeichner, “COVID-19 in children: initial characterization of the pediatric disease,” Pediatrics, 2020.

[23] Q. Zhao, M. Meng, R. Kumar, Y. Wu, J. Huang, N. Lian, Y. Deng, and S. Lin, “The impact of COPD and smoking history on the severity of COVID-19: a systemic review and meta-analysis,” Journal of medical virology, 2020.

[24] M. Bansal, “Cardiovascular disease and COVID-19,” Diabetes & Metabolic Syndrome: Clinical Research & Reviews, 2020.

[25] Y. Bai, L. Yao, T. Wei, F. Tian, D.-Y. Jin, L. Chen, and M. Wang, “Presumed asymp-tomatic carrier transmission of COVID-19,” JAMA, vol. 323, no. 14, pp. 1406–1407, 2020.

[26] C.-C. Lai, Y. H. Liu, C.-Y. Wang, Y.-H. Wang, S.-C. Hsueh, M.-Y. Yen, W.-C. Ko, and P.-R. Hsueh, “Asymptomatic carrier state, acute respiratory disease, and pneumo-nia due to severe acute respiratory syndrome coronavirus 2 (SARSCoV-2): facts and myths,” Journal of Microbiology, Immunology and Infection, 2020.

[27] R. A. Neher, R. Dyrdak, V. Druelle, E. B. Hodcroft, and J. Albert, “Potential impact of seasonal forcing on a SARS-CoV-2 pandemic,” Swiss Medical Weekly, vol. 150, no. 1112, 2020.

(57)

[28] W. O. Kermack and A. G. McKendrick, “Contributions to the mathematical theory of epidemics–I. 1927.” 1991.

[29] “COVID-19 Scenarios,” https://neherlab.org/covid19/, 2020.

[30] “COVID-19 Hospital Impact Model for Epidemics,” https://penn-chime.phl.io, 2020. [31] Z. Hu, Q. Ge, L. Jin, and M. Xiong, “Artificial intelligence forecasting of COVID-19 in

China,” arXiv preprint arXiv:2002.07112, 2020.

[32] S. K. Bandyopadhyay and S. Dutta, “Machine Learning approach for confirmation of COVID-19 cases: positive, negative, death and release,” medRxiv, 2020.

[33] Z. Yang, Z. Zeng, K. Wang, S.-S. Wong, W. Liang, M. Zanin, P. Liu, X. Cao, Z. Gao, Z. Mai et al., “Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions,” Journal of Thoracic Disease, vol. 12, no. 3, p. 165, 2020.

[34] Z. Wang, L. Li, J. Yan, and Y. Yao, “Evaluating the traditional Chinese medicine (TCM) officially recommended in China for COVID-19 using ontology-based side-effect prediction framework (OSPF) and deep learning,” Preprints 2020, 2020020230, 2020. [35] Y. Wang, M. Hu, Q. Li, X.-P. Zhang, G. Zhai, and N. Yao, “Abnormal respiratory

pat-terns classifier may contribute to large-scale screening of people infected with COVID-19 in an accurate and unobtrusive manner,” arXiv preprint arXiv:2002.05534, 2020. [36] S. Wang, B. Kang, J. Ma, X. Zeng, M. Xiao, J. Guo, M. Cai, J. Yang, Y. Li, X. Meng

et al., “A deep learning algorithm using CT images to screen for corona virus disease (COVID-19),” medRxiv, 2020.

[37] Y. Song, S. Zheng, L. Li, X. Zhang, X. Zhang, Z. Huang, J. Chen, H. Zhao, Y. Jie, R. Wang et al., “Deep learning enables accurate diagnosis of novel coronavirus (COVID-19) with CT images,” medRxiv, 2020.

[38] L. Li, L. Qin, Z. Xu, Y. Yin, X. Wang, B. Kong, J. Bai, Y. Lu, Z. Fang, Q. Song et al., “Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT,” Radiology, p. 200905, 2020.

[39] P. K. Sethy and S. K. Behera, “Detection of coronavirus disease (COVID-19) based on deep features,” Preprints 2020, 2020030300, 2020.

[40] J. D. Thompson, F. Plewniak, and O. Poch, “A comprehensive comparison of multiple sequence alignment programs,” Nucleic acids research, vol. 27, no. 13, pp. 2682–2690, 1999.

[41] S. Vinga and J. Almeida, “Alignment-free sequence comparison—a review,” Bioinfor-matics, vol. 19, no. 4, pp. 513–523, 2003.

(58)

[42] A. Zielezinski, S. Vinga, J. Almeida, and W. M. Karlowski, “Alignment-free sequence comparison: benefits, applications, and tools,” Genome biology, vol. 18, no. 1, p. 186, 2017.

[43] G. S. Randhawa, M. P. Soltysiak, H. El Roz, C. P. de Souza, K. A. Hill, and L. Kari, “Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study,” Plos One, vol. 15, no. 4, p. e0232391, 2020.

[44] R. Thompson, “Pandemic potential of 2019-nCoV,” The Lancet Infectious Diseases, vol. 20, no. 3, p. 280, 2020.

[45] C.-C. Lai, T.-P. Shih, W.-C. Ko, H.-J. Tang, and P.-R. Hsueh, “Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and corona virus disease-2019 (COVID-19): the epidemic and the challenges,” International journal of antimicrobial agents, p. 105924, 2020.

[46] D. Wang, B. Hu, C. Hu, F. Zhu, X. Liu, J. Zhang, B. Wang, H. Xiang, Z. Cheng, Y. Xiong et al., “Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus–infected pneumonia in Wuhan, China,” JAMA, 2020.

[47] D. Chang, M. Lin, L. Wei, L. Xie, G. Zhu, C. S. D. Cruz, and L. Sharma, “Epidemiologic and clinical characteristics of novel coronavirus infections involving 13 patients outside Wuhan, China,” JAMA, 2020.

[48] W. G. Carlos, C. S. Dela Cruz, B. Cao, S. Pasnick, and S. Jamil, “Novel Wuhan (2019-nCoV) coronavirus,” American journal of respiratory and critical care medicine, vol. 201, no. 4, pp. P7–P8, 2020.

[49] “The Hammer and the Dance,” https://medium.com/@tomaspueyo/ coronavirus-the-hammer-and-the-dance-be9337092b56, 2020.

[50] P. Fine, K. Eames, and D. L. Heymann, ““Herd immunity”: a rough guide,” Clinical infectious diseases, vol. 52, no. 7, pp. 911–916, 2011.

[51] K. O. Kwok, F. Lai, W. I. Wei, S. Y. S. Wong, and J. W. Tang, “Herd immunity– estimating the level required to halt the COVID-19 epidemics in affected countries,” Journal of Infection, vol. 80, no. 6, pp. e32–e33, 2020.

[52] H. E. Randolph and L. B. Barreiro, “Herd immunity: understanding COVID-19,” Im-munity, vol. 52, no. 5, pp. 737–741, 2020.

[53] A. Sud, M. E. Jones, J. Broggio, C. Loveday, B. Torr, A. Garrett, D. L. Nicol, S. Jhanji, S. A. Boyce, P. Ward et al., “Collateral damage: the impact on cancer outcomes of the COVID-19 pandemic,” 2020.

[54] L. Thunstr¨om, S. C. Newbold, D. Finnoff, M. Ashworth, and J. F. Shogren, “The benefits and costs of using social distancing to flatten the curve for COVID-19,” Journal of Benefit-Cost Analysis, pp. 1–27, 2020.

(59)

[55] M. Saez, A. Tobias, D. Varga, and M. A. Barcel´o, “Effectiveness of the measures to flatten the epidemic curve of COVID-19. the case of spain,” Science of the Total Envi-ronment, p. 138761, 2020.

[56] C. C. Branas, A. Rundle, S. Pei, W. Yang, B. G. Carr, S. Sims, A. Zebrowski, R. Doorley, N. Schluger, J. W. Quinn et al., “Flattening the curve before it flattens us: hospital critical care capacity limits and mortality from novel coronavirus (SARS-CoV2) cases in US counties,” medRxiv, 2020.

[57] D. Kai, G.-P. Goldstein, A. Morgunov, V. Nangalia, and A. Rotkirch, “Universal mask-ing is urgent in the COVID-19 pandemic: SEIR and agent based models, empirical validation, policy recommendations,” arXiv preprint arXiv:2004.13553, 2020.

[58] D. Proverbio, F. Kemp, S. Magni, A. D. Husch, A. Aalto, L. Mombaerts, A. Skupin, J. Goncalves, J. Ameijeiras-Alonso, and C. Ley, “Assessing suppression strategies against epidemic outbreaks like COVID-19: the SPQEIR model,” medRxiv, 2020. [59] O. Coibion, Y. Gorodnichenko, and M. Weber, “The cost of the COVID-19 crisis:

Lockdowns, macroeconomic expectations, and consumer spending,” National Bureau of Economic Research, Tech. Rep., 2020.

[60] N. Fernandes, “Economic effects of coronavirus outbreak (COVID-19) on the world economy,” SSRN 3557504, 2020.

[61] D. G. McNeil, “The coronavirus in America: the year ahead,” New York Times. April, vol. 20, 2020.

[62] J. H. Stock, “Data gaps and the policy response to the novel coronavirus,” National Bureau of Economic Research, Tech. Rep., 2020.

[63] L. Ferretti, C. Wymant, M. Kendall, L. Zhao, A. Nurtay, L. Abeler-D¨orner, M. Parker, D. Bonsall, and C. Fraser, “Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing,” Science, vol. 368, no. 6491, 2020.

[64] J. Peto, “COVID-19 mass testing facilities could end the epidemic rapidly,” BMJ, vol. 368, 2020.

[65] J. Bedford, D. Enria, J. Giesecke, D. L. Heymann, C. Ihekweazu, G. Kobinger, H. C. Lane, Z. Memish, M.-d. Oh, A. Schuchat et al., “COVID-19: towards controlling of a pandemic,” The Lancet, vol. 395, no. 10229, pp. 1015–1018, 2020.

[66] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[67] K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J. Schmidhuber, “LSTM: a search space odyssey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 2222–2232, 2016.

(60)

[68] “Bayesian Inference on SIR Model,” https://jonnylaw.rocks/blog/ bayesian-inference-for-an-sir-model/, 2020.

[69] J. Lourenco, R. Paton, M. Ghafari, M. Kraemer, C. Thompson, P. Simmonds, P. Klen-erman, and S. Gupta, “Fundamental principles of epidemic spread highlight the im-mediate need for large-scale serological surveys to assess the stage of the SARS-CoV-2 epidemic,” medRxiv, 2020.

[70] C. Imai, B. Armstrong, Z. Chalabi, P. Mangtani, and M. Hashizume, “Time series regression model for infectious disease and weather,” Environmental Research, vol. 142, pp. 319–327, 2015.

[71] A. LeNail, “NN-SVG: publication-ready neural network architecture schematics,” Jour-nal of Open Source Software, vol. 4, no. 33, p. 747, 2019.

[72] R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in Neural Networks for Perception. Elsevier, 1992, pp. 65–93.

[73] P. Sadowski, “Notes on backpropagation,” https://www. ics. uci. edu/pjsadows/notes. pdf (online), 2016.

[74] “Johns Hopkins COVID-19 Data Repository,” https://github.com/CSSEGISandData/ COVID-19, 2020.

[75] F. Chollet et al., “Keras: the Python deep learning library,” ascl, pp. ascl–1806, 2018. [76] D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv preprint

arXiv:1412.6980, 2014.

[77] “Dataset for COVID-19 Genetic Approach,” https://sourceforge.net/projects/ mldsp-gui/files/COVID19Dataset/, 2020.

[78] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: machine learning in Python,” The Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [79] C. Sievert, C. Parmer, T. Hocking, S. Chamberlain, K. Ram, M. Corvellec, and P.

De-spouy, “plotly: create interactive web graphics via plotly’s javascript graphing library [software],” 2016.

[80] A. Brzezinski, G. Deiana, V. Kecht, and D. Van Dijcke, “The COVID-19 pandemic: Government vs. community action across the United States,” COVID Economics: Vet-ted and Real-Time Papers, vol. 7, pp. 115–156, 2020.

[81] D. Ray, S. Subramanian, and L. Vandewalle, “India’s lockdown,” Centre for Economic Policy Research, Tech. Rep., 2020.