This may be the author s version of a work that was submitted/accepted for publication in the following source:

(1)

This may be the author’s version of a work that was submitted/accepted for publication in the following source:

Senanayake, Sameera,White, Nicole,Graves, Nicholas, Healy, Helen, Ba- boolal, Keshwar, &Kularatna, Sanjeewa

(2019)

Machine learning in predicting graft failure following kidney transplantation:

A systematic review of published predictive models.

International Journal of Medical Informatics, 130, Article number:

1039571-10.

This file was downloaded from: https://eprints.qut.edu.au/200343/

2019 Elsevier B.V.c

This work is covered by copyright. Unless the document is being made available under a Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other uses. If the document is available under a Creative Commons License (or other specified license) then refer to the Licence for details of permitted re-use. It is a condition of access that users recog- nise and abide by the legal requirements associated with these rights. If you believe that this work infringes copyright please provide details by email to [email protected]

License: Creative Commons: Attribution-Noncommercial-No Derivative Works 4.0

Notice: Please note that this document may not be the Version of Record (i.e. published version) of the work. Author manuscript versions (as Sub- mitted for peer review or as Accepted for publication after peer review) can be identified by an absence of publisher branding and/or typeset appear- ance. If there is any doubt, please refer to the published source.

https://doi.org/10.1016/j.ijmedinf.2019.103957

(2)

1 Machine learning in predicting graft failure following kidney transplantation: a systematic review of published predictive models

Sameera Senanayake â, Nicole White â, Nicholas Graves â, Helen Healy ^b,c, Keshwar Baboolal^b,c, Sanjeewa Kularatna â

a Australian Centre for Health Service Innovation, Queensland University of Technology, Australia

b Royal Brisbane Hospital for Women, Brisbane, Australia

c School of Medicine, University of Queensland, Australia

*Address for correspondence Sameera Senanayake

Australian Centre for Health Services Innovation,

School of Public Health, Institute of Health and Biomedical Innovation, Queensland University of Technology,

60 Musk Ave, Kelvin Grove, QLD 4059, Australia [email protected] +61450865361

Abstract – 279

Body of the article - 3605

(3)

2 Abstract

Introduction

Machine learning has been increasingly used to develop predictive models to diagnose different disease conditions. The heterogeneity of the kidney transplant population makes predicting graft outcomes extremely challenging. Several kidney graft outcome prediction models have been developed using machine learning, and are available in the literature.

However, a systematic review of machine learning based prediction methods applied to kidney transplant has not been done to date. The main aim of our study was to perform an in-depth systematic analysis of different machine learning methods used to predict graft outcomes among kidney transplant patients, and assess their usefulness as an aid to decision-making.

Methods

A systemic review of machine learning methods used to predict graft outcomes among kidney transplant patients was carried out using a search of the Medline, the Cumulative Index to Nursing and Allied Health Literature, EMBASE, PsycINFO and Cochrane databases.

Results

A total of 295 articles were identified and extracted. Of these, 18 met the inclusion criteria.

Most of the studies were published in the United States after 2010. The population size used to develop the models varied from 80 to 92,844, and the number of features in the models ranged from 6 to 71. The most common machine learning methods used were artificial neural networks, decision trees and Bayesian belief networks. Most of the machine learning based predictive models predicted graft failure with high sensitivity and specificity. Only one machine leering based prediction model had modelled time-to-event (survival) information.

Seven studies compared the predictive performance of machine learning models with traditional regression methods and the performance of machine learning methods was found to be mixed, when compared with traditional regression methods.

Conclusion

There was a wide variation in the size of the study population and the input variables used.

However, the prediction accuracy provided mixed results when machine learning and

(4)

3 traditional predictive methods are compared. Based on reported gains in predictive performance, machine learning has the potential to improve kidney transplant outcome prediction and aid medical decision making

Keywords : Machine learning; Predictive models; Kidney transplant; Graft failure

(5)

4 1 Introduction

1.1 Graft failure after kidney transplant

Increasing prevalence of Chronic Kidney Disease (CKD) and end stage of kidney disease over recent years has resulted in increased demand for kidney replacement therapy (KRT) ^{(1, 2)}. Among the available KRT modalities, kidney transplantation has demonstrated superior quality of life and survival rates ⁽³⁾. However, health systems around the world have not been able to meet the growing demand for kidney grafts, as evidenced by the increased prevalence of other KRT modalities ⁽⁴⁾. Including kidney dialysis and transplant, current demand is estimated at 2,692 per million population in Japan (2015)⁽⁵⁾, 1700 per million population in the United States (2009)⁽⁶⁾ and 782 per million population in the European Union (2013)⁽⁷⁾.

The ability to predict graft failure across different cohorts is crucial in systems of organ allocation to minimise the flow of people returning to an already-burdened waiting list ⁽⁸⁾. Models that can accurately predict graft failure following transplant may therefore help inform medical decision-making. A number of predictive models based on regression methods (eg: logistic and cox regression) are currently being used to predict graft failure among people with kidney transplants ^(9-12). These models have yielded mixed predictive power, therefore motivating the need for alternative modelling approaches.

1.2 Machine learning in predicting graft failure following kidney transplant Machine learning (ML) is a suite of methods whose theoretical construct may lead to

improved predictive performance over conventional statistical modelling ⁽¹³⁾(Supplementary File 1) ML is an efficient way of analysing large quantities of data and identifying hidden associations in complex data sets ^{(14, 15)}. ML has evolved dramatically over recent decades and is already commonly used in medical diagnostics ^{(16, 17)}. Its use in building predictive models to diagnose different disease conditions continues to expand ^(18-21).

For patients who receive KRT, the presence of multiple co-existing comorbidities and the complex nature of the immune response makes predicting graft outcomes extremely challenging. Several kidney graft outcome models using ML methodology have been published ^{(8, 22-24)}. However, the comparative performance of these models is unclear,

(6)

5 therefore limiting their translation into clinical practice. To address this uncertainty, we conducted a systematic analysis of ML methods applied to predict graft outcomes among kidney transplant recipients. Our results are intended to inform on the utility of ML-based predictive models in clinical decision making

2 Methods Search Strategy

A systematic review was undertaken to identify all published studies where ML methods were used to predict graft outcomes among kidney transplants recipients.

Searches accessed the Medline, the Cumulative Index to Nursing and Allied Health

Literature (CINAHL), EMBASE, PsycINFO and Cochrane databases by using pre-specified key words (Supplementary material 2). Reference lists of retrieved articles and review articles in the field were also searched to identify additional publications that met predefined

inclusion and exclusion criteria (Table 1).

Table 1: Inclusion and exclusion criteria for review Inclusion criteria

• Primary development of a clinical prediction model to predict long term graft survival following kidney transplant among Chronic Kidney Disease patients

• Models based on ML algorithms

• Full text article available

• Based in an adult patient population

• Written in English Exclusion criteria

• Paediatric patients

• Predictive models to predict the graft survival following acute renal failure

• Non-English references and conference abstracts were excluded.

• Histology and molecular level based predictive models

• Prediction models for acute rejection

• Prediction models on sub-populations (CKD patients with SLE)

• Models not based on ML algorithms

• Did not contain an original analysis (e.g. editorials, reviews)

• Did not provide full details on methods (e.g. letters)

Data extraction

To identify the ML methods used, key details of methodology and results were recorded on a data extraction sheet. Data extraction was conducted by two independent reviewers (SS and SK) and discrepancies were resolved by discussion. Data elements extracted included

(7)

6 study name, year of publication, country, study population, feature selection method used, the type of input variables (pre-transplant and/or post-transplant), ML method used, size of the training and validation data sets, validation methods and results, and follow up

duration.

Input variables used in the ML models were organised into three categories. Models that used donor and recipient variables available before kidney transplant only were categorised as “Pre-transplant input”. Models that used pre-transplant, peri-transplant (e.g. cold

ischemia time) and post-transplant (e.g. immunosuppressive regimen) input variables were categorised as “Pre and post-transplant input”. Finally, models that used post-transplant variables only were categorised as “Post-transplant input”.

Quality assessment

The quality of studies included in the review were assessed by criteria introduced by Qiao in 2019⁽²⁵⁾ (Supplementary material 3). The instrument has five categories: unmet need (limits in current non-machine-learning approach), reproducibility (feature engineering methods, platforms/packages, hyperparameters), robustness (valid methods to overcome over-fit, the stability of results), generalizability (external data validation) and clinical significance (predictors explanation and suggested clinical use). A quality assessment table was provided by listing ‘yes’ or ‘no’ of corresponding items in each category.

3 Results

A total of 295 articles were identified and reviewed, and 18 met the inclusion criteria (8, 22-24, 26-39). The reasons for the exclusion of 277 articles are described in Figure 1 in accordance with the PRISMA reporting guideline ⁽⁴⁰⁾. Of the 18 studies, 12 (66.7%) of the studies were published after 2010 (8, 23, 27, 30-38). Seven studies have been done in the USA (8, 24, 26, 28, 29, 31, 37), three in Iran (33, 34, 36), two in Italy ^{(23, 32)}and one study each in the UK ⁽³⁵⁾, Australia ⁽³⁹⁾, Korea ⁽³⁸⁾, Belgium ⁽²⁷⁾, Germany ⁽³⁰⁾ and Egypt ⁽²²⁾.

(8)

7 Figure 1: PRISMA flowchart for the selection of articles for the review

Quality assessment of the studies included in the review

The quality of the studies included in the review were generally satisfactory (Table 2). The feature selection method was not mentioned in nine (23, 26, 28, 31, 32, 34-36, 38) of the papers. Two thirds (n=12) of the studies mentioned the platforms/package used, while more than three fourths (n=14) had mentioned the hyperparameters which are needed for study replication.

None of the studies had validated the algorithm in an external data set.

MEDLINE (N=139)

CINAHL (N=06)

EMBASE (N=137)

PsycINFO (N=02)

Cochrane (N=11)

Initial search (N=295)

Title read (N=236)

Abstract read (N=57)

Full article read (N=29)

Selected papers (N=18)

Duplicates (N=59)

Conference proceedings - 09 Not English – 02

Mathematical model - 01

(9)

8 Table 2 : Quality assessment of machine learning studies used in the review

Limits in current

non- machine-

learning approach

Reproducibility Robustness Generalizability Clinical significance Feature

engineering

Platforms, packages

Hyperparameters Valid methods for over-

fitting

Stability of results

External data validation

Predictors explanation

Suggested clinical use

Shaikhina et al ⁽³⁵⁾ Yes No Yes Yes Yes yes No Yes Yes

Decruyenaere et al. ⁽²⁷⁾ Yes Yes Yes No Yes Yes No Yes Yes

Brier et al ⁽²⁶⁾ Yes No No No Yes Yes No Yes Yes

Yoo et al. ⁽³⁸⁾ Yes No Yes Yes Yes Yes No Yes Yes

Topuz et al ⁽³⁷⁾ Yes Yes No Yes Yes Yes No Yes Yes

Nematollahi et al ⁽³³⁾ No Yes Yes No No No No Yes No

Shahmoradi et al ⁽³⁴⁾ Yes No Yes Yes Yes No No Yes Yes

Brown et al.⁽⁸⁾ Yes Yes Yes Yes Yes No No Yes Yes

Lasserre et al ⁽³⁰⁾ Yes Yes No Yes Yes Yes No Yes Yes

Lofaro et al ⁽³²⁾ No No Yes Yes Yes Yes No Yes No

Greco et al ⁽²³⁾ No No No No Yes No No No No

Li et al ⁽³¹⁾ Yes No Yes Yes Yes No No Yes Yes

Akl et al ⁽²²⁾ No Yes Yes Yes Yes Yes No Yes Yes

Lin et al ⁽²⁴⁾ Yes Yes Yes Yes Yes Yes No No Yes

Krikov et al ⁽²⁹⁾ Yes Yes Yes Yes Yes Yes No Yes Yes

Goldfarb et al ⁽²⁸⁾ Yes No Yes Yes Yes No No Yes Yes

Petrovsky et al ⁽³⁹⁾ Yes Yes No Yes Yes No No No Yes

Tapak et al ⁽³⁶⁾ Yes No No Yes Yes Yes No Yes Yes

(10)

9 Evaluation of the machine learning algorithms used in the studies included in the review Ten studies utilised data from both living and deceased donor transplants in proposed models (23, 24, 29, 31-36, 38, 39). Six studies used data from deceased donor transplant records only (8, 26-28, 30, 37), and one study used living donor transplant information ⁽²²⁾ (Table 3).

Four of the seven ML predictive models developed in the USA utilised large data sets, including more than 30,000 kidney transplant recipients (24, 28, 29, 37). Half of the studies identified (n=9; 50.0%) developed models using datasets with less than 1,000 patients ^{(23, 26,}

27, 30, 32-36), and two studies had only 80 patients ^{(32, 34)}.

Several studies implemented more than one ML method to predict the same outcome.

Decision trees (n=8) (23, 27-29, 32, 34, 35, 38) were the most commonly used ML method for predicted graft outcome, followed by artificial neural networks (n=6) (22, 24, 26, 33, 34, 36, 39) and Bayesian belief networks (n=3) (8, 31, 37). Eight studies each used pre-transplant variables ^(8,

24, 28, 30, 31, 34, 35, 37) and pre- and post-transplant input variables (22, 23, 26, 27, 29, 33, 36, 38, 39). Only one model used exclusively the post-transplant laboratory blood and urine tests collected six months after transplantation ⁽³²⁾.

Most studies included a large number of input variables, and some used feature selection methods to identify the most important input variables prior to modelling. Nine (23, 26, 28, 31, 32, 34-36, 38) of the papers reviewed in the present study did not mention a feature selection method, while four had used literature review and expert opinion (8, 24, 33, 37). The input variables used in the models ranged from 6⁽³⁰⁾ to 71 ⁽²⁴⁾ with 50.0% (n=9) of the models using less than 20 variables.

ML methods were applied to predict different graft outcomes at different time points.

Shaikhina et al. (2017)⁽³⁵⁾ used decision tree and random forest methods to predict acute anti-body mediated rejection at 30 days post kidney transplant; Decruyenaere et al. (2015)

(27) and Brier et al. (2003) ⁽²⁶⁾ used ML methods such as decision tree, random forest, Linear and Radial Support vector machines and artificial neural networks to predict delayed graft function within one week of kidney transplant. The majority of models sought to predict graft survival beyond one year post transplant, with two studies predicting outcomes after

(11)

10 10 years post-transplant ^{(29, 31)}. Seven studies (8, 24, 29, 31, 34, 37, 38) had developed predictive models to predict the outcome at different time points (e.g.: outcome prediction at one year, three years and five years) but the majority reported outcomes at a single time point.

One study presented a prediction model that used time-to-event (survival) information ⁽³⁸⁾.

A number of approaches were used validate the ML model outputs across studies. Seven studies (23, 24, 27, 30-32, 37) used cross validation methods, while 10 studies (8, 22, 26, 28, 29, 34-36, 38, 39)

used a training and a test data set to derive the validation parameters. Common

parameters used included area under the curve (AUC), sensitivity, specificity and accuracy.

For descriptive purpose, the identified studies were divided into three groups based on the time duration, namely; predict graft failure before one year (early graft failure), predict graft failure using time-to-event (survival) data and predict graft failure at and after one year (late graft failure). It was evident that the validation parameters did not differ significantly between the prediction models developed to predict graft failure with in short versus longer term (Table 2).

In the six artificial neural network machine learning models, the AUC ranged from 0.67 to 0.88, with 0.67 in the model that was used to predict the delayed graft function within one week post kidney transplant and others being used to predict graft outcome at five years.

The artificial neural network method, developed by Akl et al. (2008) ⁽²²⁾, was evaluated using an independent data set, and it predicted the graft survival at five years with an accuracy of 95%. Sensitivity was assessed in four (23, 27, 34, 35) of the eight studies that used decision tree, which ranged from 29.5%⁽²⁷⁾ to 88.2%⁽²³⁾. According to Shaikhina et al. (2017)⁽³⁵⁾, who developed a model to predict acute anti-body mediated rejection post kidney transplant 30 days, random forest (AUC – 0.854) outperformed decisin tree (AUC – 0.819). Furthermore, artificial neural networks (AUC – 0.865) outperformed support vector machine (AUC – 0.769) in predicting graft survival at five years, based on the study by Nematollahi et al.

(2017)⁽³³⁾.

(12)

11 Table 3 : Studies evaluating machine learning algorithms used for kidney graft outcome prediction

First Author, Year of Publication, Reference &

country

Population Input variable (Pre/

Post/

Both)

Feature selection method

Number of inputs in the final model

Output ML method Training

and Test set sizes

Validation results

Predict graft failure before one year (early graft failure) Shaikhina et al

(2017) ⁽³⁵⁾ UK

80 HLA incompatible KT (both living and deceased donor)

Pre Not mentioned 14 Acute anti-body

mediated rejection post KT 30 days

Decision tree Tr - 75%; Ts – 25%

Ac - 85%; Sn - 81.8%;

Sp - 88.9%; PPV - 90%;

NPV - 80%; AUC - 0.854 Random Forest Tr - 75%; Ts

– 25%

Ac - 85%; Sn – 92.3%;

Sp – 71.4%; PPV – 85.7%; NPV – 83.3%;

AUC - 0.819 Decruyenaere

et al. (2015)⁽²⁷⁾ Belgium

497 deceased donor KT patients

Pre & post Recursive feature elimination procedure

20 Delayed graft

function (Dialysis within 1st week after KT)

LDA Cross

validation method

Sn - 27.6%; PPV - 42.3%; AUC -0.822

QDA Sn - 37.6%; PPV -

37.9%; AUC - 0.796

Linear SVM Sn - 83.8%; PPV -

30.6%; AUC - 0.843

Decision Tree Sn - 29.5%; PPV -

14.2%; AUC - 0.525 Random Forest Sn - 16.4%; PPV -

43.9%; AUC 0.739

SGB Sn - 16.2%; PPV -

58.3%; AUC - 0.772

Radial SVM Sn - 88.8%; PPV -

23.6%; AUC - 0.833 Polynomial

SVM

Sn - 10.9%; PPV - 24.0%; AUC - 0.798

(13)

12 Brier et al

(2003) ⁽²⁶⁾ USA

Pre & post Not mentioned 10 Delayed Graft Function within one week post KT

ANN Tr - 65%; Ts

– 35%

Sn – 63.5%; Sp – 64.8%; AUC - 0.668

Predict graft failure using time-to-event (survival) data Yoo et al.

(2017)⁽³⁸⁾ Korea

3,117 KT patients (both living and deceased donor)

Pre & post Not mentioned 33 Graft failure Survival decision tree model

Tr - 80%; Ts – 20%

Index of concordance - 0.80

Predict graft failure at and after one year (late graft failure) Topuz et al

(2018) ⁽³⁷⁾ USA

31,207 deceased donor KT patients

Pre • Literature review

• Elastic net

• SVM, ANN &

bootstrap forest combined with sensitivity analysis and information fusion

Not specifically mentioned

3 output levels

• high risk (GF before 3 years)

• medium risk (GF between 3 -7 years)

• low risk (GF after 7 years

Bayesian belief network model

Cross validation method

Total Ac - 68%; Ac high risk - 71%; Ac medium risk - 74%; Ac low risk - 59%; Sn - 41%; Sp - 84%; F measure - 0.60;

G mean - 0.49

Yoo et al.

(2017)⁽³⁸⁾ Korea

Pre & post Not mentioned 33 Graft failure at 10 years)

Decision tree Tr - 80%; Ts – 20%

Index of concordance - 0.71

Nematollahi et al (2017)⁽³³⁾ Iran

717 KT patients (both living and deceased donor)

Pre & post Clinical expertise and current available evidence

07 Graft failure at 5 years post KT

SVM Not

mentioned

Ac 85.9%; Sn – 97.3%;

Sp 26.1 AUC 0.769

ANN Ac 90.4%; Sn - 98.2%;

Sp 49.6%; AUC 0.865

(14)

13 Shahmoradi et

al (2016)⁽³⁴⁾ Iran

513 KT patients (donor method not specified)

Pre Not mentioned 11 Graft survival at 1,

2, 3, 4, 5, & 6 yrs after KT

ANN Tr - 70%; Ts

– 30%

Sn - 87.1%; Sp - 65.0%;

Ac - 83.7%

C & R Tree Model

Sn - 87.1%; Sp - 57.3%;

Ac - 83.3%

C5.0 Model Sn - 90.8%; Sp - 52.0%;

Ac - 87.2%

Brown et al.

(2012) ⁽⁸⁾ USA

Pre Clinical expertise and current available evidence

52 Graft survival at 1 year & 3 years post KT

Bayesian model

Tr - 70%; Ts – 30%

Prediction at 1 year : AUC - 0.63; Sn - 39.9%;

Sp - 79.9%

Prediction at 3 years : AUC - 0.63; Sn - 39.8%;

Sp - 80.2%

Lasserre et al (2012) ⁽³⁰⁾ Germany

Pre Recursive feature elimination

6 eGFR at 1 year

post KT

SVM with a Gaussian kernel (G-SVM)

Pearson correlation coefficient 0.48

Lofaro et al (2010) ⁽³²⁾ Italy

Post Not mentioned 23 Chronic Allograft

Nephropathy at 5 years

Decision tree Cross validation method

Sn - 62.5%; Sp - 92.8%;

AUC - 0.847

Greco et al (2010) ⁽²³⁾ Italy

Pre & post Not mentioned 09 Graft failure at 5 years post KT

Decision tree Cross validation method

Sn - 88.2%; Sp - 73.8%

Li et al (2010) ⁽³¹⁾ USA

Pre Not mentioned 70 Graft survival up

to 1 year

Bayesian model

Sn - 85.8%; Sp- 95.7%;

Pr - 89.3%; F Measure - 0.875; AUC - 0.967 Graft survival >1 -

5 years

Sn - 63.8%; Sp - 88%;

Pr - 63.8%; F Measure - 0.629; AUC - 0.866

(15)

14 Graft survival >5 -

10 years

Sn - 54.2%; Sp - 86%;

Pr - 54.2%; F Measure - 0.542; AUC - 0.824 Graft survival > 10

years

Sn - 64.6%; Sp - 89%;

Pr - 63.3%; F Measure - 0.639; AUC - 0.856 Akl et al

(2008) ⁽²²⁾ Egypt

1,900 live donor KT

Pre & post Factors significantly associated with graft survival in the univariate analysis

11 Graft survival at 5 years

ANN Tr - 83%; Ts

– 17%

Sn - 88.4%; Sp - 73.2%;

PPV - 82.1%; NPV - 82.0%; Ac - 95%; AUC - 0.88

Lin et al (2008)

(24)

USA

Pre Clinical expertise and current available evidence

71 Graft survival at 1 year, 3 years, 5 years & 7 years

Single output ANN

AUC - 1 yr 0.73; 3 yr 0.75; 5 yr 0.77; 7 yr 0.82 & % of non- monotonic predictions 2.34%

Multiple output ANN

AUC - 1 year 0.61; 3 year 0.68; 5 year 0.73;

7 year 0.82 & % of non-monotonic predictions 5.46%

Krikov et al (2007) ⁽²⁹⁾ USA

Pre & post Significant predictors of survival analysis and multiple logistic regression. Additional variables added considering the clinical relevance

29 Graft survival at 1, 3, 5, 7 and 10 years after KT

Tree based model

Tr - 66%; Ts – 34%

AUC > 1 year - 0.626, AUC >2 year - 0.640, AUC >5 year - 0.717, AUC >7 year - 0.830, AUC > 10 year - 0.901

Goldfarb et al (2003) ⁽²⁸⁾ USA

Pre Not mentioned 17 Graft survival at 3

years

Tree based model

Tr - 66%; Ts – 34%

Correlation between the prediction probability and the observed survival (r) -

(16)

15 0.984; PPV - 76%; NPV 53.8%

Duration not specified Petrovsky et al (2002) ⁽³⁹⁾ Australia &

New Zealand

Pre & post Principal Component Analysis (PCA)

22 GF. A time

duration has not been specified

ANN Tr - 70%; Ts

– 30%

Ac – 71.7%

Tapak et al (2017) ⁽³⁶⁾ Iran

378 KT (both living and deceased donor) patients

Pre & post Not mentioned 19 GF. A time duration has not been specified

ANN Tr - 70%; Ts

– 30%

Sn - 91%; Sp - 74%;

PPV 27%; NPV - 98%;

Ac - 75%; AUC 0.88;

Kendall tau-b 0.41 (0.002); Kappa 0.17 (<0.001)

KT – Kidney Transplant; SVM – Support Vector Machines; ANN – Artificial neural networks; GF – Graft failure; Ac – Accuracy; Sn – Sensitivity; Sp – Specificity; Pr- Precision;

Tr – Training; Ts – Testing; PPV – Positive Predictive value; NPV - Negative predictive value; AUC – Area under curve; LDA - Linear Discriminant Analysis; QDA - Quadratic Discriminant Analysis; SGB - Stochastic Gradient Boosting; PCA – Principal Component Analysis; FFS – Forward Feature Selection; RFE – Recursive feature elimination

(17)

16 ML Compared with Other Predictive Methods

Seven studies compared the performance of ML models to other conventional methods ^(22,

24, 26-28, 33, 36) (Table 4). Six studies compared ML models, four of which were artificial neural network (24, 26, 33, 36), with logistic regression modelling (24, 26-28, 33, 36). The performance of all four artificial neural network models was reported as superior to logistic regression models developed. The prediction accuracy was 20% and 5.7% higher in artificial neural network models compared to logistic regression in the two studies of Tapak et al. (2017)⁽³⁶⁾ (Accuracy

Artificial neural networks 75% vs Accuracy Logistic 55%) and Nematollahi et al. (2007)⁽³³⁾ (Accuracy

Artificial neural networks 90.4% vs Accuracy Logistic 84.7%) respectively. Furthermore, according to Lin et al. (2008)⁽²⁴⁾ (AUC Artificial neural networks 0.77 vs AUC Logistic 0.71) and Brier et al. (2003)⁽²⁶⁾ (AUC Artificial neural networks 0.668 vs AUC Logistic 0.636) the AUC was about 5% higher in the artificial neural network models.

Nematollahi et al (2007)⁽³³⁾ (AUC Support vector machine 0.769 vs AUC Logistic 0.774) and

Decruyenaere et al. (2015) ⁽²⁷⁾ (AUC Support vector machine 0.843 vs AUC Logistic 0.817) employed SVM and reported slight model improvements compared with logistic regression. However Goldfarb et al. (2003)⁽²⁸⁾ (Correlation (r) Decision Tree 0.984 vs Correlation (r) Logistic 0.998) and Decruyenaere et al. (2015)⁽²⁷⁾ (AUC Decision Tree 0.525 vs AUC Logistic 0.817), using the decision tree ML approach, found logistic regression gave superior results.

Prediction models developed using Cox regression were compared to ML models developed using artificial neural network by Akl et al. (2008)⁽²²⁾ and Lin et al. (2008)⁽²⁴⁾. According to Akl et al. (2008)⁽²²⁾, the prediction accuracy (artificial neural network 95% vs Cox 90%) and AUC (artificial neural network 0.88 vs Cox 0.72) of artificial neural network were 5% and 0.16 higher compared to the Cox regression model.

(18)

17 Table 4 : Studies comparing machine learning methods with other predictive methods for kidney graft outcome prediction

Study Prediction

duration

ML method Regression method

ANN Decision Tree Random Forest SVM Logistic Cox

Tapak et al (2017) ⁽³⁶⁾

Not specified Sn - 91%

Sp - 74%

Ac - 75%

AUC 0.88

Sn - 91%

Sp - 51%

Ac - 55%

AUC 0.75 Nematollahi et

al (2007)⁽³³⁾

5 Years Ac 90.4%

Sn - 98.2%

Sp 49.6%

AUC 0.865

Ac - 85.9%

Sn – 97.3%

Sp - 26.1%

AUC 0.769

Ac - 84.7%

Sn – 97.5%

Sp 17.4%

AUC 0.774 Decruyenaere

et al. (2015) ⁽²⁷⁾

Delayed graft function within 1^st week of KT

Sn - 29.5%

PPV -14.2%

AUC - 0.525

Sn - 16.4%

PPV - 43.9%

AUC 0.739

Sn - 83.8%

PPV -30.6%

AUC - 0.843

Sn - 85.5%

PPV - 26.5%

AUC - 0.817 Akl et al (2008)

(22)

5 Years Sn - 88.4%

Sp - 73.2%

PPV - 82.1%

Ac - 95%

AUC - 0.88

Sn – 61.8%

Sp – 74.9%

PPV – 43.5%

Ac - 90%

AUC - 0.72 Lin et al (2008)

(24)

1 year AUC 0.73 AUC 0.71 AUC 0.65

3 year AUC 0.75 AUC 0.72 AUC 0.67

5 year AUC 0.77 AUC 0.75 AUC 0.71

7 year AUC 0.82 AUC 0.81 AUC 0.75

Brier et al (2003) ⁽²⁶⁾

Delayed graft function within 1^st week of KT

Sn – 63.5%

Sp – 64.8%

AUC - 0.668

Sn – 36.5%

Sp – 90.7%

AUC - 0.636 Goldfarb et al

(2003) ⁽²⁸⁾

Correlation between the prediction

probability and the

Correlation between the prediction

probability and the

(19)

18 observed survival

(r) - 0.984; PPV - 76%; NPV 53.8%

observed survival (r) - 0.998; PPV - 76%; NPV 63.0%

SVM – Support Vector Machines; ANN – Artificial neural networks; Ac – Accuracy; Sn – Sensitivity; Sp – Specificity; Pr- Precision; PPV – Positive Predictive value; NPV - Negative predictive value; AUC – Area under curve

(20)

19 4 Discussion

This is the first review to systematically review current ML methods that have been developed to predict clinical outcomes following kidney transplant. Results showed heterogeneity in the types of ML methods used, including artificial neural networks, decision trees and Bayesian belief networks. Variation in the size of the study populations and the input variables used was also observed. Furthermore, it was evident that there are inconsistencies in prediction accuracy of ML models compared to traditional predictive methods (based on logistic and cox regression).

ML techniques are being increasingly used in clinical and preclinical medical research. ML has been effectively used to predict survival of grafts after transplantation surgery, varying from stem cell transplants to heart transplants ⁽⁴¹⁾. Techniques such as artificial neural network have been applied to the study of cancer development and progression, and, according to Cruz and Wishart (2007), machine learning substantially improves the

predictive accuracy (10% - 25% absolute improvement) of biologically meaningful outcomes like cancer risk, recurrence and mortality ⁽⁴²⁾.

ML techniques used in Kidney transplants

It was interesting to note that models developed using a small number of cases reported similar prediction accuracy compared with models developed using larger numbers of cases.

For example, the prediction model developed by Lofaro et al. (2010)⁽³²⁾ using 80 records performed similar to the model developed by Krikov et al. (2007)⁽²⁹⁾, which was built on a national database of 92,844 patient records. Little is known about the minimum sample size needed to develop a predictive model using ML methods, but it is closely linked to the complexity of the prediction and the complexity of the ML method. The evidence points to larger sample sizes resulting in better prediction accuracy ⁽⁴³⁾. The number of events per variable is a third factor in the performance of the model. Larger numbers of events per variable are associated with better model stability and higher predictive accuracy ⁽⁴³⁾. Present day patient registries, with large volumes of data, ought to yield high fidelity ML predictions.

(21)

20 Eight studies used pre-transplant donor and recipient variables (8, 24, 28, 30, 31, 34, 35, 37) to predict graft outcomes. The models in these studies have immediate translation potential as

decision-making tools in kidney organ allocation. Brown et al. (2012) proposed using a model based on a Bayesian belief network in kidney allocation, to determine which donor- recipient match would yield the longest graft survival. They hypothesise that such a system would have the potential to prevent more than 40% of graft failures within the first year ⁽⁸⁾. Despite this ML approaches have not been widely adopted for organ allocation, possibly due to the non-acceptance of such methods by the organ allocation bodies ⁽³⁹⁾. The greatest strength of an organ allocation system based on ML techniques is that it circumvents human bias during the allocation process. However human bias may be desirable eg weighting of specific patient groups to deliver parity of access. However, bias, for whatever reason, ought not disqualify exploration of improving systems of organ allocation, particularly when they deliver more accurate assessment of the likely graft outcome ^{(39, 44)}.

The accuracy of a predictive model largely depends on the incorporation of prognostically significant variables in the model ⁽⁴⁵⁾. However, in a practical sense it is important that the number of input variables used are manageable. Thus, the selection of variables that account for most of the variation in the outcome is an important pre-step. Furthermore, over-fitting is a phenomenon that leads to poor model performance, and this is associated with having too many irrelevant parameters in the model. Over-fitting occurs when a machine learning model adapts to the details of the data set to the extent that it negatively impacts the performance of the model on new data ⁽⁴⁶⁾. The use of a feature selection method to identify the most important variables that need to be included in the model is commonly used in ML models to improve the model performance. Principal Component Analysis (PCA) is considered as one of the most popular methods for dimensionality reduction ⁽⁴⁷⁾. However, PCA had been used only in one study identified in this review ⁽³⁹⁾.

The studies in the review used different ML methods (artificial neural networks, decision trees, support vector machine, Bayesian belief networks) to develop the predictive models.

In the review, five studies used more than one method to develop models (27, 33-35, 38). The best ML method to develop a predictive model is a widely discussed topic ⁽⁴⁸⁾. The current consensus is that there is no one method that fits all data sets, with the complexity of the

(22)

21 data pivotal ⁽⁴⁹⁾. To get around this uncertainty investigators use multiple machine learning methods on single data sets and the best is chosen based on validation parameters. Abbas et al (2018) used six machine learning methods to classify foetal distress and hypoxia using machine learning approaches ⁽⁵⁰⁾ and Aljaaf et al (2018) used four machine learning methods to predict chronic kidney disease early ⁽⁵¹⁾, highlighting the importance of using multiple ML methods on a single data set.

Validation parameters have not yet been standardised in the literature. The studies included in the review have used different validation parameters such as sensitivity, area under the curve and accuracy. Standardisation of validation parameters will facilitate comparisons between different models

ML Compared with Other Predictive Methods

Conventional models such as Cox models and logistic regression assume that the predictors are independent of each other ^{(52, 53)}. They are less suited to handle complex interactions among predictors, and are not often used to model non-linear relationships among

predictors and outcomes ^{(24, 54)}. The survival of the graft after a kidney transplant depends on many factors, and evidence indicates that accurate prediction of the outcome using conventional statistical modelling is imprecise ⁽⁵⁵⁾.

However, according to our review it was evident that there are inconsistencies in prediction accuracy of ML models compared to traditional predictive methods (based on logistic and cox regression). This is consistent with the currently available literature. A good number of studies which compared ML and traditional statistical methods have concluded that their results are mixed ^(56-58). However, Oermann et al. (2015) revealed that ML was superior in predicting the outcome after stereotactic radiosurgery for cerebral arteriovenous

malformations compared to conventional prognostic scoring systems ⁽⁵⁹⁾. The superiority of ML, compared to expert opinion, in predicting outcomes has also been demonstrated in medical literature ⁽⁶⁰⁾. Similarly, Senders et al. (2017) reported ML methods outperformed logistic regression in predicting the outcome of neurosurgery in a systematic review of seven studies.

(23)

22 Limitations

This review has limitations. Though broad search strategies were used to find the relevant articles, we may have not identified all the machine learning predictive models reported in the specific area. Studies and outcome measures demonstrating high-performing ML models might be published or reported more often, thus publication and/or outcome bias could be present in this review.

Conclusion

This first systematic review of ML methods used in the field of kidney transplant found that the main ML methods used were artificial neural networks, decision trees and Bayesian models. There is a wide variation in the size of the study population and the input variables used in these ML models. Only one ML- based prediction model had modelled time-to- event (survival) information but instead used the binary outcome of failure or not. Adding the additional information of the timing of the event could lead to improved model predictions. The prediction accuracy provided mixed results when ML and traditional predictive methods (based on logistic and cox regression) are compared. However, ML has a potential to improve the prediction of kidney transplant outcomes. It is a novel tool for clinicians making decisions about the scant community resource of transplant organs. The barriers in the practical implementation of ML methods into clinical settings, including the ethical and societal implications of adoption briefly alluded to, warrant further exploration.

Future research should focus on modelling time-to-even information using machine learning methods such as Survival tree ⁽⁶¹⁾, Random survival forest⁽⁶²⁾ and Survival Support Vector Machine⁽⁶³⁾.

Authors’ Contributions

SS, NW & SK; Research idea, study design, analysis and interpretation. SS; Drafting of the manuscript. NW, SK, NG, HH, KB; Data analysis, interpretation, supervision and mentorship.

Acknowledgement

Sameera Senanayake is a recipient of Australian Government Research Training Program (RTP) for Postgraduate Research (PhD) Scholarship and Queensland University of

Technology International Postgraduate Research (PhD) Scholarship (2018 -2021)

(24)

23 Conflict of Interest Statement: The paper has not been published previously in whole or part. The authors of this manuscript have no conflicts of interest to disclose.

Support: This study received no specific funding.

Financial Disclosure: The authors declare that they have no relevant financial interests.

Summary Points

What was already known on the topic?

• Along with the increasing prevalence of Chronic Kidney Disease, the prevalence of patients in the end stage of renal disease, and the demand for renal replacement therapy have increased over the years

• The ability to predict graft failure before kidney transplant becomes crucial to facilitate donations to the most suitable recipient, and to minimise the flow of patients returning to the already-burdened transplant waiting list

• Several kidney graft outcome prediction models developed using machine learning are available in the literature

What this study added to our knowledge?

• We did a systematic review of the different machine learning methods used to predict graft outcomes among kidney transplant patients, and assess their predictive performance compared with traditional statistical methods.

• Most of the machine learning based predictive models predicted graft failure with high sensitivity and specificity.

• However, the prediction accuracy provided mixed results when machine learning and traditional regression based predictive methods are compared.

• Based on reported gains in predictive performance, machine learning has the potential to improve kidney transplant outcome prediction and aid medical decision making.

(25)

24 References

1. WANG T, XI Y, LUBWAMA RN, KORO C. Chronic Kidney Disease (CKD) in US Adults with Self- Reported Cardiovascular Disease (CVD)—A National Estimate of Prevalence by KDIGO 2012 Classification. Am Diabetes Assoc; 2018.

2. Valley TS, Nallamothu BK, Heung M, Iwashyna TJ, Cooke CR. Hospital Variation in Renal Replacement Therapy for Sepsis in the United States. Critical care medicine.

2018;46(2):e158-e65.

3. Karam VH, Gasquet I, Delvart V, Hiesse C, Dorent R, Danet C, et al. Quality of life in adult survivors beyond 10 years after liver, kidney, and heart transplantation. Transplantation.

2003;76(12):1699-704.

4. Cecka J, Gritsch H. Why are nearly half of expanded criteria donor (ECD) kidneys not transplanted? American Journal of Transplantation. 2008;8(4):735-6.

5. Masakane I, Taniguchi M, Nakai S, Tsuchida K, Goto S, Wada A, et al. Annual Dialysis Data Report 2015, JSDT Renal Data Registry. Renal Replacement Therapy. 2018;4(1):19.

6. Foley RN, Collins AJ. The USRDS: what you need to know about what it can and can’t tell us about ESRD. Clinical Journal of the American Society of Nephrology. 2013;8(5):845-51.

7. Luxardo R, Kramer A, González-Bedat MC, Massy ZA, Jager KJ, Rosa-Diez G, et al. The epidemiology of renal replacement therapy in two different parts of the world: the Latin American Dialysis and Transplant Registry versus the European Renal Association-European Dialysis and Transplant Association Registry. Revista Panamericana de Salud Pública.

2018;42:e87.

8. Brown TS, Elster EA, Stevens K, Graybill JC, Gillern S, Phinney S, et al. Bayesian modeling of pretransplant variables accurately predicts kidney graft survival. Am J Nephrol.

2012;36(6):561-9.

9. Moore J, He X, Shabir S, Hanvesakul R, Benavente D, Cockwell P, et al. Development and evaluation of a composite risk score to predict kidney transplant failure. American Journal of Kidney Diseases. 2011;57(5):744-51.

10. Foucher Y, Daguin P, Akl A, Kessler M, Ladrière M, Legendre C, et al. A clinical scoring system highly predictive of long-term kidney graft survival. Kidney International. 2010;78(12):1288- 94.

11. Tiong H, Goldfarb D, Kattan M, Alster J, Thuita L, Yu C, et al. Nomograms for predicting graft function and survival in living donor kidney transplantation based on the UNOS Registry. The Journal of urology. 2009;181(3):1248-55.

(26)

25 12. Rao PS, Schaubel DE, Guidinger MK, Andreoni KA, Wolfe RA, Merion RM, et al. A

comprehensive risk quantification score for deceased donor kidneys: the kidney donor risk index. Transplantation. 2009;88(2):231-6.

13. Kaplan B, Schold J. Transplantation: neural networks for predicting graft survival. Nature Reviews Nephrology. 2009;5(4):190.

14. Ghahramani Z. Probabilistic machine learning and artificial intelligence. Nature.

2015;521(7553):452.

15. Obermeyer Z, Emanuel EJ. Predicting the future—big data, machine learning, and clinical medicine. The New England journal of medicine. 2016;375(13):1216.

16. Patel VL, Shortliffe EH, Stefanelli M, Szolovits P, Berthold MR, Bellazzi R, et al. The coming of age of artificial intelligence in medicine. Artificial intelligence in medicine. 2009;46(1):5-17.

17. Shortliffe EH. The adolescence of AI in medicine: will the field come of age in the'90s?

Artificial intelligence in medicine. 1993;5(2):93-106.

18. Lee HC, Yoon HK, Nam K, Cho YJ, Kim TK, Kim WH, et al. Derivation and Validation of Machine Learning Approaches to Predict Acute Kidney Injury after Cardiac Surgery. Journal of clinical medicine. 2018;7(10).

19. Ryu S, Lee H, Lee DK, Park K. Use of a Machine Learning Algorithm to Predict Individuals with Suicide Ideation in the General Population. Psychiatry investigation. 2018:0.

20. Tolmeijer E, Kumari V, Peters E, Williams SCR, Mason L. Using fMRI and machine learning to predict symptom improvement following cognitive behavioural therapy for psychosis.

NeuroImage Clinical. 2018;20:1053-61.

21. Xie Y, Jiang B, Gong E, Li Y, Zhu G, Michel P, et al. Use of Gradient Boosting Machine Learning to Predict Patient Outcome in Acute Ischemic Stroke on the Basis of Imaging, Demographic, and Clinical Information. AJR American journal of roentgenology. 2018:1-7.

22. Akl A, Ismail AM, Ghoneim M. Prediction of graft survival of living-donor kidney transplantation: nomograms or artificial neural networks? Transplantation.

2008;86(10):1401-6.

23. Greco R, Papalia T, Lofaro D, Maestripieri S, Mancuso D, Bonofiglio R. Decisional Trees in Renal Transplant Follow-up. Transplantation Proceedings. 2010;42(4):1134-6.

24. Lin RS, Horn SD, Hurdle JF, Goldfarb-Rumyantzev AS. Single and multiple time-point prediction models in kidney transplant outcomes. J Biomed Inform. 2008;41(6):944-52.

25. Qiao N. A systematic review on machine learning in sellar region diseases: quality and reporting items. Endocrine connections. 2019.

(27)

26 26. Brier ME, Ray PC, Klein JB. Prediction of delayed renal allograft function using an artificial

neural network. Nephrology Dialysis Transplantation. 2003;18(12):2655-9.

27. Decruyenaere A, Decruyenaere P, Peeters P, Vermassen F, Dhaene T, Couckuyt I. Prediction of delayed graft function after kidney transplantation: comparison between logistic

regression and machine learning methods. BMC medical informatics and decision making.

2015;15(1):83.

28. Goldfarb-Rumyantzev AS, Scandling JD, Pappas L, Smout RJ, Horn S. Prediction of 3-yr cadaveric graft survival based on pre-transplant variables in a large national dataset. Clinical Transplantation. 2003;17(6):485-97.

29. Krikov S, Khan A, Baird BC, Barenbaum LL, Leviatov A, Koford JK, et al. Predicting kidney transplant survival using tree-based modeling. Asaio Journal. 2007;53(5):592-600.

30. Lasserre J, Arnold S, Vingron M, Reinke P, Hinrichs C. Predicting the outcome of renal transplantation. Journal of the American Medical Informatics Association. 2012;19(2):255- 62.

31. Li J, Serpen G, Selman S, Franchetti M, Riesen M, Schneider C. Bayes net classifiers for prediction of renal graft status and survival period. World Academy of Science, Engineering and Technology. 2010;39.

32. Lofaro D, Maestripieri S, Greco R, Papalia T, Mancuso D, Conforti D, et al. Prediction of Chronic Allograft Nephropathy Using Classification Trees. Transplantation Proceedings.

2010;42(4):1130-3.

33. Nematollahi M, Akbari R, Nikeghbalian S, Salehnasab C. Classification models to predict survival of kidney transplant recipients using two intelligent techniques of data mining and logistic regression. International Journal of Organ Transplantation Medicine. 2017;8(2):119- 22.

34. Shahmoradi L, Langarizadeh M, Pourmand G, Fard ZA, Borhani A. Comparing Three Data Mining Methods to Predict Kidney Transplant Survival. Acta Informatica Medica.

2016;24(5):322-7.

35. Shaikhina T, Lowe D, Daga S, Briggs D, Higgins R, Khovanova N. Decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation.

Biomedical Signal Processing and Control. 2017.

36. Tapak L, Hamidi O, Amini P, Poorolajal J. Prediction of Kidney Graft Rejection Using Artificial Neural Network. Healthcare Informatics Research. 2017;23(4):277-84.

(28)

27 37. Topuz K, Zengul FD, Dag A, Almehmi A, Yildirim MB. Predicting graft survival among kidney

transplant recipients: A Bayesian decision support model. Decision Support Systems.

2018;106:97-109.

38. Yoo KD, Noh J, Lee H, Kim DK, Lim CS, Kim YH, et al. A Machine Learning Approach Using Survival Statistics to Predict Graft Survival in Kidney Transplant Recipients: A Multicenter Cohort Study. Sci. 2017;7(1):8904-.

39. Petrovsky N, Tam SK, Brusic V, Russ G, Socha L, Bajic VB. Use of artificial neural networks in improving renal transplantation outcomes. Graft. 2002;5(1):6.

40. Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Annals of internal medicine. 2009;151(4):264-9.

41. Sousa FS, Hummel AD, Maciel RF, Cohrs FM, Falcão AEJ, Teixeira F, et al., editors. Application of the intelligent techniques in transplantation databases: a review of articles published in 2009 and 2010. Transplantation proceedings; 2011: Elsevier.

42. Cruz JA, Wishart DS. Applications of machine learning in cancer prediction and prognosis.

Cancer informatics. 2006;2:117693510600200030.

43. van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol.

2014;14(1):137.

44. Schold JD, Segev DL. Increasing the pool of deceased donor organs for kidney transplantation. Nature Reviews Nephrology. 2012;8(6):325.

45. Kattan MW. When and how to use informatics tools in caring for urologic patients. Nature Reviews Urology. 2005;2(4):183.

46. Cawley GC, Talbot NL. On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research. 2010;11(Jul):2079-107.

47. Khalaf M, Hussain AJ, Al-Jumeily D, Baker T, Keight R, Lisboa P, et al., editors. A Data Science Methodology Based on Machine Learning Algorithms for Flood Severity Prediction. 2018 IEEE Congress on Evolutionary Computation (CEC); 2018: IEEE.

48. Yousef AH. Extracting software static defect models using data mining. Ain Shams Engineering Journal. 2015;6(1):133-44.

49. Lorena AC, Garcia LP, Lehmann J, Souto MC, Ho TK. How Complex is your classification problem? A survey on measuring classification complexity. arXiv preprint arXiv:180803591.

2018.