Comparative study of machine learning methods

As it can been seen in the Tables 3.1 to 3.5, we have tried 3 popular machine learning algorithms to build joint and stage-specific survivability models. As we discussed earlier, these algorithms are the most popular techniques to build survivability models. We compared the performance of stage-specific models, created with these three algorithms to figure out which algorithm gives a model with better performance.

Table 3.6: Algorithms comparison result forbreast cancersurvivability predictive models.

Comparison result

Localized AD3 >Logistic regression >Na¨ıve Bayes Regional Logistic regression>Na¨ıve Bayes,

AD3 >Na¨ıve Bayese

Distant No statistically significant differences found between used algorithms

Table 3.7: Algorithms comparison result for colon cancersurvivability predictive models.

Comparison result

Localized Logistic regression >AD3 >Na¨ıve Bayes Regional Logistic regression >Na¨ıve Bayes,

Logistic regression >AD3

Distant Logistic regression >Na¨ıve Bayes

Table 3.8: Algorithms comparison result for lung & bronchus cancer survivability predictive models.

Comparison result

Localized Logistic regression >AD3 >Na¨ıve Bayes Regional Logistic regression >AD3 >Na¨ıve Bayes Distant Logistic regression >AD3

In the Tables 3.6 to 3.10, one can see the results of these comparisons. For some models, we did not find any statistically significant difference (p value less than 0.05). We considered two-tailed paired t-test to compare the corresponding values (values at the same row but obtained from different machine learning techniques). The results are denoted by using the “greater than” symbol (>). Therefore, where the algorithm is statistically significantly better than the other algorithm, we expressed it greater than the other.

From the Tables 3.6 to 3.10, it can be seen that logistic regression and AD3 generally do better performance on localized and regional stages in building survivability prediction models than na¨ıve Bayes for cancers in the study (except for corpus uteri cancer). In general, there is not much difference between them on distant stage instances.

Table 3.9: Algorithms comparison result forcorpus uteri cancersurvivability predictive models.

Comparison result Localized Na¨ıve Bayes >AD3,

Na¨ıve Bayes >Logistic regression Regional Na¨ıve Bayes >AD3,

Logistic regression >AD3

Distant Na¨ıve Bayes >Logistic regression

Table 3.10: Algorithms comparison result for stomach cancer survivability predictive models.

Comparison result

Localized AD3 >Logistic regression >Na¨ıve Bayes Regional AD3 >Logistic regression >Na¨ıve Bayes Distant No statistically significant differences found

Chapter 4 Conclusion and Outlook

In past, only joint survival prediction models were built by training machine learning methods on all the stages together. They were also evaluated together on all the stages. In [33] stage-specific models were built for breast cancer survivability and it was reported that joint models offer no advantage over stage-specific models for predicting breast cancer survivability and are often worse. It was also reported that evaluating on all stages together, as was always done in past, leads to an overestimation of performance.

In this study, we investigated whether the above observations made about breast cancer in [33] also generalizes to other cancers. To this end, we built joint and stage-specific survivability predictive models for five cancers - breast, colon, lung & bronchus, corpus uteri, and stomach. We used SEER dataset along with three machine learning algorithms - na¨ıve Bayes, logistic regression, and decision tree to make the survivability predictive models. According to the obtained results, joint models do not provide better performance than stage-specific models, and in most cases, their performance are statistically significantly worse than the stage-specific models. We also saw that the order of importance of features, as determined using information gain statistic, is different for each stage which further supports building separate models for every stage instead of building joint models. Based on these results, we recommend building separate survivability predictive models for separate stages. We also found that the evaluation of models on all stages together is misleading, since it tends to overestimate the performance. Therefore, to see the real performance of a model, we recommend evaluating its performance for each stage separately. Hence we found that the observations reported in [33] for breast cancer also generalize to other cancers.

We also saw that there is no specific pattern to suggest using a specific algorithm to make accurate predictive models for cancer survivability, but all the algorithms used here

can give a reasonable performance. However we note that na¨ıve Bayes requires much less computational time and the models generated with logistic regression and AD3 were better for most cancers. We also observed that the performance of most models was generally worse on the distant stage. In future, more training data of that stage may help improve the performance.

Bibliography

[1] Centers for disease control and prevention-united states cancer statistics (uscs). 2016. https://nccd.cdc.gov/uscs/toptencancers.aspx.

[2] Medical news today. 2016. http://www.medicalnewstoday.com/articles/282929.php. [3] National cancer institute. 2016. https://www.cancer.gov/about-

cancer/understanding/what-is-cancer.

[4] National cancer institute-seer stat fact sheets. 2016.

http://seer.cancer.gov/statfacts/html/urinb.html.

[5] National cancer institute-seer stat fact sheets. 2016. https://www.cancer.gov/about- cancer/diagnosis-staging/staging.

[6] Seer data. 2016. http://seer.cancer.gov/data/.

[7] Surveillance, Epidemiology, and End Results (SEER) Program

(www.seer.cancer.gov) Research Data (1973-2013), National Cancer Institute, DCCPS, Surveillance Research Program, Surveillance Systems Branch, released

April 2016, based on the November 2015 submission. 2016.

[8] A. Agrawal, S. Misra, R. Narayanan, L. Polepeddi, and A. Choudhary. Lung cancer survival prediction using ensemble data mining on seer data.Scientific Programming, 20(1):29–42, 2012.

[9] E. A. Asare, L. Liu, K. R. Hess, E. J. Gordon, J. L. Paruch, B. Palis, A. R. Dahlke, R. McCabe, M. E. Cohen, D. P. Winchester, et al. Development of a model to predict breast cancer survival using data from the national cancer data base.Surgery, 159(2):495–502, 2016.

[10] D. M. Ashley, S. Gupta, T. Tran, L. Wei, P. K. Lorgelly, D. M. Thomas, S. B. Fox, and S. Venkatesh. Machine-learning prediction of cancer survival: A prospective study examining the impact of combining clinical and genomic data. In ASCO

Annual Meeting Proceedings, volume 33, page 6521, 2015.

[11] A. Bellaachia and E. Guven. Predicting breast cancer survivability using data mining techniques. Age, 58(13):10–110, 2006.

[12] C. A. Bertelsen, A. U. Neuenschwander, J. E. Jansen, M. Wilhelmsen, A. Kirkegaard- Klitbo, J. R. Tenma, B. Bols, P. Ingeholm, L. A. Rasmussen, L. V. Jepsen, et al. Disease-free survival after complete mesocolic excision compared with conventional colon cancer surgery: a retrospective, population-based study. The Lancet Oncology, 16(2):161–168, 2015.

[13] H. Y. Chang, D. S. Nuyten, J. B. Sneddon, T. Hastie, R. Tibshirani, T. Sørlie, H. Dai, Y. D. He, L. J. van’t Veer, H. Bartelink, et al. Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proceedings of the National Academy of Sciences of the United States of

America, 102(10):3738–3743, 2005.

[14] C.-P. Chen. Predicting medical expenditure and survivability of the lung cancer patient by bayesian network. 2016.

[15] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000.

[16] J. A. Cruz and D. S. Wishart. Applications of machine learning in cancer prediction and prognosis. Cancer informatics, 2, 2006.

[17] J. A. Cruz and D. S. Wishart. Applications of machine learning in cancer prediction and prognosis. Cancer informatics, 2, 2006.

[18] D. Delen, G. Walker, and A. Kadam. Predicting breast cancer survivability: a comparison of three data mining methods. Artificial intelligence in medicine, 34(2):113– 127, 2005.

[19] D. Delen, G. Walker, and A. Kadam. Predicting breast cancer survivability: a comparison of three data mining methods. Artificial intelligence in medicine, 34(2):113– 127, 2005.

[20] D. Dooling, A. Kim, B. McAneny, and J. Webster. Personalized prognostic models for oncology: A machine learning approach. arXiv preprint arXiv:1606.07369, 2016. [21] S. A. Forbes, D. Beare, P. Gunasekaran, K. Leung, N. Bindal, H. Boutselakis, M. Ding, S. Bamford, C. Cole, S. Ward, et al. Cosmic: exploring the world’s knowledge of somatic mutations in human cancer. Nucleic acids research, 43(D1):D805– D811, 2015.

[22] L. Frederick, D. L. Page, I. D. Fleming, A. G. Fritz, C. M. Balch, D. G. Haller, M. Morrow, et al. AJCC cancer staging manual, volume 1. Springer Science & Business Media, 2002.

[23] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.

[24] J. Han, J. Pei, and M. Kamber. Data mining: concepts and techniques. Elsevier, 2011.

[25] F. E. Harrell Jr. Cox proportional hazards regression model. InRegression Modeling Strategies, pages 475–519. Springer, 2015.

[26] S. Haykin. Neural networks, a comprehensive foundation. 1994.

[27] M. Hilario, A. Kalousis, M. M¨uller, and C. Pellegrini. Machine learning approaches to lung cancer prediction from mass spectra. Proteomics, 3(9):1716–1719, 2003. [28] D. W. Hosmer Jr and S. Lemeshow. Applied logistic regression. John Wiley & Sons,

2004.

[29] N. Japkowicz and M. Shah. Evaluating learning algorithms: a classification perspec- tive. Cambridge University Press, 2011.

[30] T. Jonsdottir, E. T. Hvannberg, H. Sigurdsson, and S. Sigurdsson. The feasibility of constructing a predictive outcome model for breast cancer using the tools of data mining. Expert Systems with Applications, 34(1):108–118, 2008.

[31] K.-W. Jung, Y.-J. Won, C.-M. Oh, H.-J. Kong, H. Cho, D. H. Lee, and K. H. Lee. Prediction of cancer incidence and mortality in korea, 2015. Cancer Research and Treatment, 47(2):142–148, 2015.

[32] M. Kantardzic. Data mining: concepts, models, methods, and algorithms. John Wiley & Sons, 2011.

[33] R. J. Kate and R. Nadig. Stage-specific predictive models for breast cancer survivability. International Journal of Medical Informatics, 97:304–311, 2017.

[34] K. D. Kochanek, S. L. Murphy, J. Xu, and B. Tejada-Vera. Deaths: final data for 2014. National vital statistics reports: from the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System, 65(4):1, 2016.

[35] K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I. Fotiadis. Machine learning applications in cancer prognosis and prediction. Computational and structural biotechnology journal, 13:8–17, 2015.

[36] D. D. Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In European conference on machine learning, pages 4–15. Springer, 1998. [37] Y. Liang, H. Chai, X.-Y. Liu, Z.-B. Xu, H. Zhang, and K.-S. Leung. Cancer survival

analysis using semi-supervised learning method based on cox and aft models with l 1/2 regularization. BMC medical genomics, 9(1):1, 2016.

[38] L. Macyszyn, H. Akbari, J. M. Pisapia, X. Da, M. Attiah, V. Pigrish, Y. Bi, S. Pal, R. V. Davuluri, L. Roccograndi, et al. Imaging patterns predict patient survival and molecular subtype in glioblastoma via machine learning techniques. Neuro-oncology, 18(3):417–425, 2016.

[39] R. S. Michalski, J. G. Carbonell, and T. M. Mitchell. Machine learning: An artificial intelligence approach. Springer Science & Business Media, 2013.

[40] P. M. Odell, K. M. Anderson, and R. B. D’Agostino. Maximum likelihood estima- tion for interval-censored data using a weibull-based accelerated failure time model. Biometrics, pages 951–959, 1992.

[41] K. Park, A. Ali, D. Kim, Y. An, M. Kim, and H. Shin. Robust predictive model for evaluating breast cancer survivability. Engineering Applications of Artificial Intelli- gence, 26(9):2194–2205, 2013.

[42] D. N. Pearlman, M. A. Clark, W. Rakowski, and B. Ehrich. Screening for breast and cervical cancers: the importance of knowledge and perceived cancer survivability. Women & health, 28(4):93–112, 1999.

[43] M. D. Podolsky, A. A. Barchuk, V. I. Kuznetcov, N. F. Gusarova, V. S. Gaidukov, and S. A. Tarakanov. Evaluation of machine learning algorithm utilization for lung cancer classification based on gene expression levels. Asian Pacific Journal of Cancer Prevention, 17(2):835–838, 2016.

[44] F. Provost and T. Fawcett. Data Science for Business: What you need to know about data mining and data-analytic thinking. ” O’Reilly Media, Inc.”, 2013.

[45] J. R. Quinlan. C4. 5: programs for machine learning. Elsevier, 2014.

[46] A. Silva, T. Oliveira, V. Julian, J. Neves, and P. Novais. A mobile and evolving tool to predict colorectal cancer survivability. In IFIP International Conference on Artificial Intelligence Applications and Innovations, pages 14–26. Springer, 2016. [47] A. C. Society. American cancer society: Cancer facts and figures 2016, 2016.

[48] B. W. Stewart and C. Wild. World cancer report 2014. international agency for research on cancer. World Health Organization, 505, 2014.

[49] D. G. Tang, L. Li, D. P. Chopra, and A. T. Porter. Extended survivability of prostate cancer cells in the absence of trophic factors: increased proliferation, evasion of

apoptosis, and the role of apoptosis proteins. Cancer research, 58(15):3466–3479, 1998.

[50] L. Wei. The accelerated failure time model: a useful alternative to the cox regression model in survival analysis. Statistics in medicine, 11(14-15):1871–1879, 1992. [51] I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and

techniques. Morgan Kaufmann, 2005.

[52] H. M. Zolbanin, D. Delen, and A. H. Zadeh. Predicting overall survivability in comorbidity of cancers: A data mining approach. Decision Support Systems, 74:150– 161, 2015.

[53] B. Zupan, J. DemˇsAr, M. W. Kattan, J. R. Beck, and I. Bratko. Machine learning for survival analysis: a case study on recurrence of prostate cancer.Artificial intelligence in medicine, 20(1):59–75, 2000.

In document Stage-Specific Predictive Models for Cancer Survivability (Page 34-44)