Predictive analytics - Leveraging Artificial Intelligence and Big Data for IFAD 2.0 Phase 2

In this work stream, which represents the core of machine learning applications, the project developed algorithms aimed at supporting the project cycle through ex-ante predictions of performance and probability of positive impact of IFAD-supported interventions, given a specific set of portfolio and beneficiary features. Two main prediction models have been built, at project and household level, respectively. While the first prediction can inform about successful features for portfolio performance and guide the organization on which projects are likely to succeed, the second can strengthen household level targeting at project design by determining the beneficiary and project level features that drive positive impact. We also responded to the COVID-19 crisis by proposing a ML approach to enhance knowledge about the impact of the pandemic in IFAD’s beneficiary countries.

Impact of COVID-19 in IFAD beneficiary countries

COVID-19 (the novel severe acute respiratory syndrome caused by the coronavirus 2 “SARS-CoV-2”) was recognized as a pandemic by the World Health Organization (WHO) on 11 March 2020, having caused economic and public health disruptions around the world. Addressing the pandemic requires economic and public health coordination at international, national and local levels, but the lack of reliable statistics is one of the most evident barriers to target the best policies at present.⁹ Where official statistics are not readily available or reliable, the use of “big data” can improve our ability to understand and predict the evolution of complex phenomena.

This workstream aimed at estimating the real COVID-19 incidence in selected countries using big data and machine learning approaches, in order to support IFAD’s understanding of the impact of the pandemic on beneficiary countries. A model was devised to predict regional spreading of COVID-19 in countries where IFAD operates and where official data is not available/reliable.

One of the most relevant information sources during the pandemic has been the database by the Johns Hopkins Coronavirus Resource Centre,¹⁰ which is constantly updated with data concerning cases and deaths. However, the data has some limitations, especially related to under-reporting of new cases. Figure 47 shows how the distribution of cases per capita is uneven among across countries.

Similarly, the deaths per capita are distributed in figure 48. In both instances, it is notable that in many developing nations there is a low and abnormal prevalence of the pandemic, indicating the possibility of under-reporting.

Figure 47 Cases per capita, as reported in the Johns Hopkins database (updated to September 2, 2020).

Figure 48 Deaths per capita, as reported in the Johns Hopkins database (updated to September 2, 2020).

Such a lack of data requires the development of solutions that correct the figures provided by Johns Hopkins. However, the use of data from internet search engines has shown high potential in the study of various cases of the spread of infections, with a significant impact on the “real time monitoring”

of epidemics.¹¹ The idea underlying the approach

9 Brunori, P. & Resce, G. (2020) Searching for the peak Google Trends and the Covid-19 outbreak in Italy; Fantazzini, D. (2020) Short-term forecasting of the COVID-19 pandemic using Google Trends data: Evidence from 158 countries. Applied Econometrics.

10 https://coronavirus.jhu.edu/map.html

Innovation Challenge 2020 – Final Report Leveraging Artificial Intelligence and Big Data for IFAD 2.0 – Phase 2 29 of Ginsberg et al. (2008) is simple: Internet users

who suspect they have a disease tend to look for information online about symptoms and conditions associated with it. Such research leaves a trace, feeding the availability of a granular and massive dataset. A strong correlation between online search queries – provided by Google Trends – and influenza infections patterns has been confirmed by various studies and has now been applied to COVID-19.

The maps shown in figures 50 to 54 present the distribution of symptoms related to the coronavirus, as represented by topics identified in Google Trends, namely: “coronavirus”, “cough”, “fever”,

“sore throat” and “pneumonia”. Their distribution is clearly distinct from official data on contagions and deaths, with increased presence of the topics in African countries, which illustrates the hypothesis of a possible under-reporting.

Figure 49 Distribution of queries related to the topic

“Coronavirus” on Google Trends.

Figure 50 Distribution of queries related to the topic

“Cough” on Google Trends.

Figure 51 Distribution of queries related to the topic

“Fever” on Google Trends.

Figure 52 Distribution of queries related to the topic

“Sore Throat” on Google Trends.

Figure 53 Distribution of queries related to the topic

“Pneumonia” on Google Trends.

11 Ginsberg J. et al (2008) Detecting influenza epidemics using search engine query data. Nature 457: 1012-10155; Cook S. et al (2011) Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1), Pandemic.PLoS ONE 6(8); Broniatowski D.A. et al (2013). National and Local Influenza Surveillance through Twitter: An Analysis of the 2012-2013 Influenza Epidemic, PLoS ONE 8(12): e83672.

30 Innovation Challenge 2020 – Final Report Leveraging Artificial Intelligence and Big Data for IFAD 2.0 – Phase 2 In addition to Google Trends, another web

source available from Google are the COVID-19 Community Mobility Reports. Google Maps uses aggregated and anonymous data to show how crowded certain places are, which allows identification of, for example, the peak hours of a restaurant. This type of aggregated and anonymized data could be useful for making critical decisions in the mitigation of COVID-19. Indeed, responses to the pandemic worldwide have increasingly moved towards public health strategies related to restrictions on movement and social distancing, in order to slow down transmission or plan re-openings.

Information on mobility from Google is available for the following location types: food and pharmacies, parks, public transport stations, retail and leisure, residential areas, and workplaces.

To correct Johns Hopkins’ official estimates with alternative data provided by Google, it is necessary to estimate a model that - as certain covariates vary - modifies the value of the COVID-19 cases and deaths. Given that most of the countries where IFAD operates are possibly under-reporting coronavirus cases and deaths, the objective is to predict the number of cases and deaths as accurately as possible, through the application of the “Random Forest”

machine learning algorithm.

Johns Hopkins’ data on number of cases per million and number of deaths per million have been analysed in relation to Google Trends topics, Google mobility reports, socioeconomic indicators, health indicators and governance indicators. After predicting for countries across the globe, predictions were compared with the observed cases, to highlight where there may be more deviation (i.e. under-reporting). The maps in figures 54 and 55 show that the most significant deviations refer to countries in Africa and Asia.

Figure 54 Intensity of predicted cases compared to official Johns Hopkins cases.

Figure 55 Intensity of predicted deaths compared to official John Hopkins deaths.

Countries that are identified as under-reporting in cases also tend to be those that are

under-reporting in deaths. In order to test this association, a Pearson correlation was used, which is significant (p-value < 0.001) and equal to 0.75. As explained above, with Random Forest it is possible to identify which variables help to better predict the dependent variables. Regarding the number of COVID-19 cases, the most important variable is the number of tests performed, followed by the GDP per capita and the population. From Google Trends the search topics of pneumonia and fever are the most important, and as far as Google Mobility data is concerned, the best variable is the one referring to moves to the pharmacy.

In the case of deaths from COVID-19, the variables that best explain the number of deaths per million are health care spending and political management of violence. It is reasonable to think that lower expenditures in health care can mean lower capacity to treat COVID-19 patients, which increases the risk of deaths. Also, in this case, pneumonia and fever are the Google Trends topics that best help predict the number of deaths. As far as Google Mobility data is concerned, the variable that best explains the regression is park related mobility.

Using COVID-19 incidence predictions for planning

The output from this study is a tool that can help international organisations target countries needing additional funding due to the COVID-19 outbreak.

It comprises a composite index that combines food security indicators with COVID-19 cases and deaths. The food security indicators considered are SOFI 2020¹² country level data for “prevalence of undernourishment in the total population” and

Innovation Challenge 2020 – Final Report Leveraging Artificial Intelligence and Big Data for IFAD 2.0 – Phase 2 31

“prevalence of stunting in children under 5 years of age”. For cases and deaths, the predictions developed above were employed.

Composite indices developed through data-driven methods have been extensively employed as a technique for aggregation.¹³ In particular, the data envelopment analysis (DEA) method compiles multi-dimensional metrics into one index using the combination of weights that is the most convenient for the evaluated alternative.

Using DEA, the global score for each country was estimated by a linear programme. Weights estimation can be difficult and highly subjective. In line with DEA methodology, the linear programme is computed separately for each country. The weights in the objective function are chosen optimally with the purpose of maximizing the score of the evaluated

country. The optimization ensures that each country is evaluated on the basis of its own best possible weights. Figure 56 shows the association between the composite index estimated using predictions and the composite index estimated on actual COVID-19 data.

Overall, the correlation is positive, although there are countries showing significant differences.

Focusing on countries where IFAD had projects, the countries with higher differences between the composite index combining food security measures and COVID-19 cases and deaths, before and after correction, are: Oman (+50%), Armenia (+18%), Montenegro (+18%), Cabo Verde (+13%), Suriname (+12%), Dominican Republic (+11%), El Salvador (+11%), Republic of Moldova (+11%), Macedonia (+11%) and Honduras (+11%).

12 FAO, IFAD, UNICEF, WFP and WHO. 2020. The State of Food Security and Nutrition in the World 2020. Transforming food systems for affordable healthy diets. Rome, FAO.

13 Decancq, K. and Lugo, M. A. (2013) Weights in multidimensional indices of wellbeing: An Overview, Econometric Reviews, 32(1):7–34; Patrizii, V. et al (2017). The Cost of Well-Being. Social Indicators Research, 133(3):985–1010; Greco, S. et al (2018).

On the methodological framework of composite indices: A review of the issues of weighting, aggregation, and robustness. Social Indicators Research: 1-34.

Figure 56 Correlation between the composite index estimated using predictions and the composite index estimated on actual COVID-19 data.

Composite Index with Predited data on Covid-19

Composite Index with Reported data on Covid-19

32 Innovation Challenge 2020 – Final Report Leveraging Artificial Intelligence and Big Data for IFAD 2.0 – Phase 2

Predicting performance at

In document Leveraging Artificial Intelligence and Big Data for IFAD 2.0 Phase 2 (Page 44-48)