Comparison with windowed methods - Comparison of concept drift adaptation methods

5.2 Comparison of concept drift adaptation methods

5.2.4 Comparison with windowed methods

We performed the same experiment with the windowed methods windowed threshold, windowed retrain, and windowed hybrid. Figure 5.4 shows the running accuracy on 1000 consecutive files for baseline and the three windowed methods. The markers in the figure show when windowed retrain and windowed hybrid methods retrained the machine learning

0

10

20

30

40

50

60 Retrain count

0.9675

0.9700

0.9725

0.9750

0.9775

0.9800

0.9825

0.9850

Accuracy

0.89 0.9 0.91

0.92 0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.0 Accuracy threshold for retraining model

Figure 5.2: Accuracy and retrain count for different accuracy thresholds that decide when to retrain the model. The model is retrained whenever the running accuracy for 1000 consecutive samples falls below the threshold, and there are at least 500 new samples since the model was last retrained. Choosing a higher threshold leads to a higher accuracy but also increases the number of times the model has to be retrained. A threshold of 0.89 or lower means that the model is not retrained, as the running accuracy never falls below 0.89. On the other hand, a threshold of 1.0 means that the model is retrained every 500 samples regardless of its performance. Choosing a threshold higher than 0.98 notably increases the retrain count, while the accuracy does not improve significantly.

CHAPTER 5. EXPERIMENTAL EVALUATION 45

2018-09-012018-09-152018-10-012018-10-152018-11-012018-11-152018-12-012018-12-152019-01-01

Time

0.90

0.92

0.94

0.96

0.98

1.00 Accuracy

Baseline

Threshold method

Retrain method

Hybrid method

Retrain method model retrain

Hybrid method model retrain

Figure 5.3: Running accuracy of 1000 consecutive samples for baseline, threshold, retrain, and hybrid methods. The machine learning model was retrained with retrain and hybrid methods every 500 samples if the accuracy was below 0.98. The markers at the bottom indicate the times when the model was retrained. The baseline method has large spikes indicating sudden decreases in accuracy for files seen in the middle of November and in early December. In comparison, threshold, retrain, and hybrid methods are able to adapt and maintain higher accuracy. Most of the time, hybrid and retrain methods achieve the best accuracy.

Table 5.3: Total accuracy for the different methods of adapting to concept drift. For each method, the model was initially trained using the first half of the data and the accuracy was evaluated using the second half.

Method Accuracy

baseline 0.967

threshold 0.973

retrain 0.984

hybrid 0.985

Table 5.4: Total accuracy for the windowed versions of methods for adapting to concept drift. The accuracy for each method was evaluated using the second half of files in our dataset.

Method Accuracy

windowed threshold 0.973

windowed retrain 0.981

windowed hybrid 0.981

model. Compared to the non-windowed methods in Figure 5.3, the windowed methods have slightly lower accuracy most of the time.

In some periods, the baseline method performs better than windowed

retrain and windowed hybrid. This could be explained by the fact

that the model in baseline was trained using half of the dataset for 30733 files, whereas the models in windowed retrain and windowed hybrid methods were trained and retrained using only 10000 files. However, like the non-windowed methods for concept drift adaptation, the windowed variants are also able to mitigate the sudden drops in accuracy that appear with baseline.

Overall, the windowed methods are also an improvement to the baseline method by achieving higher accuracy. Table 5.4 gives the total accuracy for each of the windowed methods. While the windowed versions are more scalable, the retrain and hybrid methods in Table 5.3 achieved better accuracy in our experiment. This implies that the windowed variants would bene- fit from more closely mimicking the behaviour of the non-windowed methods by using a larger window size.

While windowed retrain and windowed hybrid achieved similar accuracy, windowed hybrid retrained the machine learning model slightly fewer times. The model retrain counts for retrain, hybrid, windowed

CHAPTER 5. EXPERIMENTAL EVALUATION 47

Table 5.5: The number of times different methods retrained the machine learning model when evaluated on the second half of our dataset. While the windowed variants retrained the model more often, each time the training took approximately the same amount of time. In contrast, with the non- windowed versions, training took increasingly longer amounts of time each time the model was retrained because all of the accumulated previous data was used for training. The hybrid and windowed hybrid did not need to retrain the model as often as the respective retrain methods.

Method Retrain count

retrain 23

hybrid 19

windowed retrain 32

windowed hybrid 30

retrain, and windowed hybrid methods are presented in Table 5.5. Overall, the hybrid methods outperform the respective retrain methods by having similar or better accuracy while retraining the model fewer times. While the windowed variants retrained the model more often, each retraining takes less time than for the non-windowed methods. Instead, the non-windowed versions train the model using increasing amounts of data as more files are collected over time.

2018-09-012018-09-152018-10-012018-10-152018-11-012018-11-152018-12-012018-12-152019-01-01

Time

0.90

0.92

0.94

0.96

0.98

1.00 Accuracy

Baseline

Windowed threshold method

Windowed retrain method

Windowed hybrid method

Windowed retrain method model retrain

Windowed hybrid method model retrain

Figure 5.4: Running accuracy of 1000 consecutive samples for baseline, windowed threshold, windowed retrain, and windowed hybrid methods. The markers at the bottom indicate the times when the model was retrained with windowed retrain and windowed hybrid methods.

Chapter 6

Conclusions and discussion

In this Master’s thesis, we created a classifier for malware detection. For this purpose, we collected a dataset of 19648 malware and 41818 benign files and extracted features using static analysis of the files. In our experiments, we compared four different classifiers: random forest, gradient boosted decision trees, neural network, and support vector machines. Gradient boosted decision trees achieved higher accuracy and AUC than the other methods. Furthermore, by using a random forest model to select most important features before training the gradient boosted decision trees model, we were able to reduce the total training time of our classifier. These results support previous research showing that gradient boosted decision trees perform well in the task of malware detection using static analysis [30].

Additionally, we tested the classifier on files that were sorted by when they were first seen by F-Secure. The classification accuracy was lower on temporally ordered data than when using cross-validation, which indicates that concept drift should be taken into account in malware detection. To deal with the concept drift, we then compared retrain, threshold, and hybrid methods that adapt the classifier to changes in the data over time. All three methods improved the accuracy of the classifier, and the best results were achieved by the hybrid method, which merges the two other methods. These results show that combining different methods for adapting to concept drift can give a better malware detector than the individual methods.

We also compared the methods for adapting to concept drift with windowed variants that only consider most recent files. While the windowed variants are more practical when the number of files is large, in our experiments the windowed methods achieved lower accuracy than the original methods. To better assess how forgetting older files influences malware detection, more studies on windowed methods should be carried out using a larger dataset.

In this thesis, the methods for adapting to changes in the data require that we obtain labels indicating if new files are malicious or clean. However, having human analysts create a large number of these labels is time-consuming and expensive. Future work could study how labeling only a fraction of the new files would impact the classifier over time and how to choose which files should be labeled by an analyst. In addition, it remains to be studied whether the classifier becomes unreliable if it is updated by using labels that are automatically created. Alternatively, further research could incorporate dynamic analysis of files for potentially more accurate malware detection.

Bibliography

[1] AV-Test. Test modules under windows, 2019. https://www.av-test.

org/en/about-the-institute/test-procedures/. Accessed 20.9.2019.

[2] AV-Test. Malware statistics & trends report, 2019. https://www.

av-test.org/en/statistics/malware/. Accessed 26.7.2019.

[3] Stephen H Bach and Marcus A Maloof. Paired learners for concept drift. In 2008 Eighth IEEE International Conference on Data Mining, pages 23–32. IEEE, 2008.

[4] Zahra Bazrafshan, Hashem Hashemi, Seyed Mehdi Hazrati Fard, and Ali Hamzeh. A survey on heuristic malware detection techniques. In The 5th Conference on Information and Knowledge Technology, pages 113–120. IEEE, 2013.

[5] BBC News. Where has ransomware hit hardest?, 2017. https://www.

bbc.com/news/world-39919249. Accessed 5.8.2019.

[6] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001. [7] Christopher JC Burges. A tutorial on support vector machines for pat-

tern recognition. Data mining and knowledge discovery, 2(2):121–167, 1998.

[8] Carnegie Mellon University. CERT Advisory CA-1999-02 Trojan Horses,

1999. https://resources.sei.cmu.edu/asset_files/WhitePaper/1999_

019_001_496184.pdf. Accessed 10.9.2019.

[9] Ero Carrera. pefile, 2018. https://github.com/erocarrera/pefile. Ac-

cessed 18.9.2019.

[10] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.

[11] Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, and Yuan Tang. Xgboost: extreme gradient boosting. R package version 0.4-2, pages 1–4, 2015.

[12] Fred Cohen. Computer viruses: theory and experiments. Computers & security, 6(1):22–35, 1987.

[13] Nello Cristianini, John Shawe-Taylor, et al. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000.

[14] Tom Dietterich. Overfitting and undercomputing in machine learning. ACM computing surveys, 27(3):326–327, 1995.

[15] Ke-Lin Du and Madisetti NS Swamy. Neural networks and statistical learning. Springer Science & Business Media, 2013.

[16] Eastlake, Donald and Jones, Paul. US secure hash algorithm 1 (SHA1), 2001.

[17] Tom Fawcett. An introduction to roc analysis. Pattern recognition

letters, 27(8):861–874, 2006.

[18] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.

[19] Jo˜ao Gama, Indr˙e ˇZliobait˙e, Albert Bifet, Mykola Pechenizkiy, and Ab-

delhamid Bouchachia. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):44, 2014.

[20] Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. Malware analysis and classification: A survey. Journal of Information Security, 5(02):56, 2014. [21] Thomas Hancock, Tao Jiang, Ming Li, and John Tromp. Lower bounds on learning decision lists and trees. Information and Computation, 126 (2):114–122, 1996.

[22] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference and prediction. Springer, 2nd edition, 2009.

[23] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to support vector classification, 2003.

BIBLIOGRAPHY 53

[24] Hemant Ishwaran. The effect of splitting on random forests. Machine Learning, 99(1):75–118, 2015.

[25] Peter H. M. Janssen, Petre Stoica, T S¨oderstr¨om, and Pieter Eykhoff.

Model structure selection for multivariable systems by cross-validation methods. International Journal of Control, 47(6):1737–1758, 1988. [26] Roberto Jordaney, Kumar Sharad, Santanu K Dash, Zhi Wang, Davide

Papini, Ilia Nouretdinov, and Lorenzo Cavallaro. Transcend: Detect- ing concept drift in malware classification models. In 26th {USENIX} Security Symposium ({USENIX} Security 17), pages 625–642, 2017. [27] Alex Kantchelian, Sadia Afroz, Ling Huang, Aylin Caliskan Islam, Brad

Miller, Michael Carl Tschantz, Rachel Greenstadt, Anthony D Joseph, and JD Tygar. Approaches to adversarial drift. In Proceedings of the 2013 ACM workshop on Artificial intelligence and security, pages 99– 110. ACM, 2013.

[28] Saurabh Karsoliya. Approximating number of hidden layer neurons in multiple hidden layer BPNN architecture. International Journal of En- gineering Trends and Technology, 3(6):714–717, 2012.

[29] Ron Kohavi and George H John. Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273–324, 1997.

[30] J Zico Kolter and Marcus A Maloof. Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research, 7(Dec):2721–2744, 2006.

[31] SB Kotsiantis, Dimitris Kanellopoulos, and PE Pintelas. Data prepro- cessing for supervised leaning. International Journal of Computer Sci- ence, 1(2):111–117, 2006.

[32] Sotiris B Kotsiantis, I Zaharakis, and P Pintelas. Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering, 160:3–24, 2007.

[33] Simon Kramer and Julian C Bradfield. A general definition of malware. Journal in computer virology, 6(2):105–114, 2010.

[34] Cheng Li. A Gentle Introduction to Gradient Boosting, 2019.

http://www.chengli.io/tutorials/gradient_boosting.pdf. Accessed 9.10.2019.

[35] Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts. Un- derstanding variable importances in forests of randomized trees. In Ad- vances in neural information processing systems, pages 431–439, 2013. [36] Mead, Steve. Unique file identification in the national software reference

library. Digital Investigation, 3(3):138–150, 2006.

[37] Brad Miller, Alex Kantchelian, S Afroz, R Bachwani, R Faizullabhoy, L Huang, V Shankar, MC Tschantz, Tony Wu, George Yiu, et al. Back to the future: Malware detection with temporally consistent labels. Under submission, 2015.

[38] Andreas Moser, Christopher Kruegel, and Engin Kirda. Limits of static analysis for malware detection. In Twenty-Third Annual Computer Se- curity Applications Conference (ACSAC 2007), pages 421–430. IEEE, 2007.

[39] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[40] Matt Pietrek. Peering inside the pe: a tour of the win32 (r) portable executable file format. Microsoft Systems Journal-US Edition, 9(3):15– 38, 1994.

[41] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for

activation functions. arXiv preprint arXiv:1710.05941, 2017.

[42] ReversingLabs. TitaniumScale Enterprise Scale File Analysis, 2019.

https://cdn2.hubspot.net/hubfs/3375217/ReversingLabs%20Data% 20Sheets/2019-Jan-RL-TitaniumScale-Integration-Datasheet-EN.

pdf. Accessed 10.9.2019.

[43] Lior Rokach and Oded Maimon. Data Mining with Decision Trees: The- ory and Applications. Singapore: World Scientific, 2008.

[44] S Rasoul Safavian and David Landgrebe. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics, 21(3):660–674, 1991.

[45] Igor Santos, Yoseba K Penya, Jaime Devesa, and Pablo Garcia Bringas. N-grams-based file signatures for malware detection. ICEIS (2), 9:317– 320, 2009.

BIBLIOGRAPHY 55

[46] Warren S. Sarle. Neural Network FAQ, 2002. http://www.faqs.org/

faqs/ai-faq/neural-nets/. Accessed 23.10.2019.

[47] Michael Satran, Mark Leblanc, Colin Robertson, Karl Bridge, John Kennedy, Drew Batchelor, and Christopher Warrington. PE Format

- Windows applications, 2019. https://docs.microsoft.com/en-us/

windows/win32/debug/pe-format. Accessed 17.9.2019.

[48] Joshua Saxe and Konstantin Berlin. Deep neural network based malware detection using two dimensional binary program features. In 2015 10th International Conference on Malicious and Unwanted Software (MAL- WARE), pages 11–20. IEEE, 2015.

[49] Enrique F Schisterman, David Faraggi, Benjamin Reiser, and Jessica Hu. Youden index and the optimal threshold for markers with mass at zero. Statistics in medicine, 27(2):297–315, 2008.

[50] Matthew G Schultz, Eleazar Eskin, F Zadok, and Salvatore J Stolfo. Data mining methods for detection of new malicious executables. In Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001, pages 38–49. IEEE, 2000.

[51] Anshuman Singh, Andrew Walenstein, and Arun Lakhotia. Tracking concept drift in malware families. In Proceedings of the 5th ACM workshop on Security and artificial intelligence, pages 81–92. ACM, 2012. [52] KP Soman, Shyam Diwakar, and V Ajay. Data mining: theory and

In document Mukautuminen muutoksiin haittaohjelmien tunnistuksessa (Page 43-55)