9 Future work - Challenges and Open Questions of Machine Learning in Computer Security

Although above sections have presented some solutions to problems identified in Intro- duction, they are not solved at all. Quite contrary, many questions remained unanswered and new question have been raised.

Structured domains and learning from unlabeled data

Section 5have described a practical and general approach to reflect the structure of the data in an architecture of a neural network. The main advantage is its simplicity and generality, as almost any data with a tree structure (e.g. JSON documents) can be without much thinking encoded in the architecture of a neural network. Moreover, any progress within the field of neural networks, such as new optimization methods, layers, transfer functions, etc. can be readily applied.

Presently, the solution is applicable only to supervised training problems. For simple data represented by vectors8 _{there exist works trying to decrease dependency on labeled}

samples, e.g.semi-supervised learning[31,49] orone-shot learning[54,58,25]. At the moment the candidate is not aware of any similar solution for structured data. Similarly, prior art on anomaly detection for structured data is missing, except works restricted to the simplest case of multi-instance learning [40,33,30].

The simplicity with which the presented framework allows to encode the structured data is so impressive that it motivates us to further work in this area, mainly along the lines of decreasing the number of needed labeled samples. But there are also interesting theoretical questions. It is not known, how general is the framework and what are its lim- itations. For example, the selection of aggregation function (mean, maximum, learned function) is not entirely clear. From the theory behind MMD it seems like that mean should be sufficiently general to be able to differentiate any probability distribution function, yet for some applications, maximum is clearly more efficient in the finite sample and finite computational resources setting. Similarly it is not known if the solution would not suffer from vanishing gradients or forgetting similarly to recurrent neural networks when the tree grows taller. Some of these questions can be answered by applying the model to as many problems as possible, but some theoretical justifications would be in place.

Game theory (GT) is by no means a correct mathematical tool to describe the adversarial nature of intrusion detection systems, but it is rarely used. Candidate’s own experience suggests that it is because GT is computationally expensive, an improvement over non- GT models is hard to measure, GT frequently uses simplified and inaccurate models of the environment, and the attackers not being always rational.

Why is it difficult to assess improvement of GT models? Supervised classifiers are developed over data collected and labeled in the past and these data are frequently used for the evaluation. This corresponds to the scenario, where defender (detector) knows attacker’s strategy, which seems to be bizarre, yet it is used in all antivirus and intrusion detection solutions. The reason for this is the presence of many attackers employing similar tools on many different targets. Therefore detecting already known attacks is a good strategy, because a detector not detecting them would be considered useless. Moreover, the detector will be evaluated on known attacks by third-party evaluators and if it fails to detect known attacks, its vendor will quickly go out of the business. This implies that the success of GT solutions have to be measured in the long term mainly on attack of new type, which is rarely done.

Solving the problem using Game Theory means that the detector has to be optimized with respect to all possible attacks. Putting for a moment aside, how these attacks can be found, the detector has to make a trade-off between false positive rate and detection accuracy. Increasing detection accuracy on unknown attacks causes either increasing the false positive rate or decreasing detection accuracy on known attacks, which is again against the usual evaluation framework.

Although solving GT models can be complex, the complexity depends on the assump- tions and the chosen type equilibrium. Experimental results in [18] implies that if the attacker has a full knowledge about the domain, the solution can be found quickly and it works well enough for the case when an attacker does not have a full knowledge, where the solution is expensive to find. Similarly, finding Stackelberg equilibrium [12] might be easier than finding other types of equilibria while the loss of performance might not be dramatic. Finally, there remains open question, if it is better to have sub-optimal solution of a precise model or optimal solution to an imprecise one. Superiority of neural networks over support vector machines (or Gaussian processes) suggests the former approach to lead to more interesting solutions.

Finding a game-theoretic optimum is also interesting from the point of view of anomaly detectors, as it converts the problem of anomaly detection to that of supervised classification, which is a more researched area. Moreover, it would change the paradigm of the intrusion detection system, as instead of detecting attacks similar to already seen in the wild and identified by a security analyst, it would be able to detect never seen attacks.

lished [13,41,57], they are restricted to simple domains without complicated constraints. These constraints can be for example practical feasibility of the attack and satisfaction of requirements on the attack (e.g. the number of tried passwords in brute-force password cracking). Solutions to this problem will probably combine tools from many fields, such as planning, reinforcement learning, supervised learning, constraint satisfaction, automa- tion of test development. Benefits of automatically finding new attacks go beyond Game Theoretic optimization. It can help to secure critical systems by identifying security holes, or it can guide the representation of the application domain, as it can reveal, which parts are not modeled yet they are important for the security.

Neyman-Pearson classification

Neyman-Pearson classification paradigm [53,52] seems to be more appropriate for security domains than the usual Bayesian approach because it is easier to limit the number of false alarm (or the total number of alarms) than to define costs for all types of error and know the class ratio between malicious and benign use. Moreover, as identified in [8] Neyman-Pearson classification is important for learning from positive-unlabeled data, which is found in security domains.

Solutions presented in Section7 have been developed and experimentally evaluated with linear classifiers. It is not known yet, how to extend them to non-linear classifiers. Generally, there seem to be few works dealing with this problem especially in conjunction with non-linear classifiers such as neural networks. The optimization problem there is not convex, therefore there will be no guarantees on optimality, yet the solution might lead to more understandable hyper-parameters and more precise classification around the operation point of interest.

References

[1] https://www.darpa.mil/program/cyber-grand-challenge. [2] http://www.informationisbeautiful.net/visualizations/worlds-biggest-data- breaches-hacks/. [3] https://www.cisco.com/c/en/us/products/collateral/security/ cognitive-threat-analytics/datasheet-c78-736557.html. [4] https://www.omnicoreagency.com/facebook-statistics/. [5] miproblems.org.

[6] Jaume Amores. Multiple instance classification: Review, taxonomy and comparative study. Artificial Intelligence, 201:81–105, 2013.

tion. Journal of Machine Learning Research, 11(Nov):2973–3009, 2010.

[9] Stephen Boyd, Corinna Cortes, Mehryar Mohri, and Ana Radovanovic. Accuracy at the top. InAdvances in neural information processing systems, pages 953–961, 2012. [10] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen.Classification

and regression trees. CRC press, 1984.

[11] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and J¨org Sander. Lof: identifying density-based local outliers. InACM sigmod record, volume 29, pages 93–104. ACM, 2000.

[12] Michael Br ¨uckner and Tobias Scheffer. Stackelberg games for adversarial prediction problems. InProceedings of the 17th ACM SIGKDD international conference on Knowl- edge discovery and data mining, pages 547–555. ACM, 2011.

[13] S. Rota Bul`o, B. Biggio, I. Pillai, M. Pelillo, and F. Roli. Randomized prediction games for adversarial machine learning. IEEE Transactions on Neural Networks and Learning Systems, 28(11):2466–2478, Nov 2017.

[14] Christian Cachin. An information-theoretic model for steganography. InInternational Workshop on Information Hiding, pages 306–318. Springer, 1998.

[15] Kevin M Carter, Nwokedi Idika, and William W Streilein. Probabilistic threat prop- agation for network security. IEEE Transactions on Information Forensics and Security, 9(9):1394–1405, 2014.

[16] Andreas Christmann and Ingo Steinwart. Universal kernels on non-standard input spaces. InAdvances in neural information processing systems, pages 406–414, 2010. [17] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learn-

ing kernels based on centered alignment. Journal of Machine Learning Research, 13(Mar):795–828, 2012.

[18] Karel Durkota, Viliam Lisý, Christopher Kiekintveld, Karel Horák, Branislav Boˇsanský, and Tomáˇs Pevný. Optimal strategies for detecting data exfiltration by internal and external attackers. In Stefan Rass, Bo An, Christopher Kiekintveld, Fei Fang, and Stefan Schauer, editors, Decision and Game Theory for Security, pages 171– 192, Cham, 2017. Springer International Publishing.

[19] Tom´aˇs Filler, Andrew D Ker, and Jessica Fridrich. The square root law of steganographic capacity for markov covers. In Media Forensics and Security, volume 7254, page 725408. International Society for Optics and Photonics, 2009.

[20] Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. InIcml, volume 96, pages 148–156. Bari, Italy, 1996.

main. Computer Networks, 107:55 – 63, 2016. Machine learning, data mining and Big Data frameworks for network monitoring and troubleshooting.

[23] M. Grill, T. Pevn´y, and M. Reh´ak. Reducing false positives of network anomaly detection by local adaptive multivariate smoothing. Journal of Computer and System Sciences, 83(1):43 – 57, 2017.

[24] Jan Jusko, Martin Rehak, Jan Stiborek, Jan Kohout, and Tomas Pevny. Using behav- ioral similarity for botnet command-and-control discovery. IEEE Intelligent Systems, 31(5):16–22, 2016.

[25] L. Kaiserz, O. Nachum, A. Roy, and A. Bengio. Learning to remember rare events. In 5st International Conference on Learning Representations (ICLR) (workshop poster), May 2017.

[26] Andrew D Ker. Batch steganography and pooled steganalysis. InInternational Work- shop on Information Hiding, pages 265–281. Springer, 2006.

[27] Andrew D Ker. A curiosity regarding steganographic capacity of pathologically non- stationary sources. In Media Watermarking, Security, and Forensics III, volume 7880, page 78800E. International Society for Optics and Photonics, 2011.

[28] Andrew D. Ker. The square root law of steganography: Bringing theory closer to practice. InProceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security, IH&MMSec ’17, pages 33–44, New York, NY, USA, 2017. ACM.

[29] Andrew D Ker and Tom´as Pevn´y. A new paradigm for steganalysis via clustering. InMedia Watermarking, Security, and Forensics III, volume 7880, page 78800U. Interna- tional Society for Optics and Photonics, 2011.

[30] Andrew D Ker and Tom´aˇs Pevn´y. The steganographer is the outlier: realistic large- scale steganalysis. IEEE Transactions on information forensics and security, 9(9):1424– 1435, 2014.

[31] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. InAdvances in Neural Infor- mation Processing Systems, pages 3581–3589, 2014.

[32] J. Kohout and T. Pevn´y. Automatic discovery of web servers hosting similar applications. In2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), pages 1310–1315, May 2015.

[33] Jan Kohout and Tom´aˇs Pevn´y. Network traffic fingerprinting based on approximated kernel two-sample test.IEEE Transactions on Information Forensics and Security, 13(3):788–801, 2018.

traffic anomalies. InACM SIGCOMM Computer Communication Review, volume 34, pages 219–230. ACM, 2004.

[37] Aleksandar Lazarevic and Vipin Kumar. Feature bagging for outlier detection. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 157–166. ACM, 2005.

[38] Xin Li, Fang Bian, Mark Crovella, Christophe Diot, Ramesh Govindan, Gianluca Ian- naccone, and Anukool Lakhina. Detection and identification of network anomalies using sketch subspaces. InProceedings of the 6th ACM SIGCOMM conference on Inter- net measurement, pages 147–152. ACM, 2006.

[39] Siwei Lyu and Hany Farid. Steganalysis using higher-order image statistics. IEEE transactions on Information Forensics and Security, 1(1):111–119, 2006.

[40] Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, and Bernhard Sch¨olkopf. Learning from distributions via support measure machines. In Advances in neural information processing systems, pages 10–18, 2012.

[41] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. InSecurity and Privacy (SP), 2016 IEEE Symposium on, pages 582–597. IEEE, 2016. [42] Tom´aˇs Pevn´y. Loda: Lightweight on-line detector of anomalies. Machine Learning,

102(2):275–304, 2016.

[43] Tom´aˇs Pevn´y and Jessica Fridrich. Novelty detection in blind steganalysis. InPro- ceedings of the 10th ACM workshop on Multimedia and security, pages 167–176. ACM, 2008.

[44] Tom´aˇs Pevn´y and Andrew D Ker. The challenges of rich features in universal steganalysis. In Media Watermarking, Security, and Forensics 2013, volume 8665, page 86650M. International Society for Optics and Photonics, 2013.

[45] Tom´aˇs Pevn´y and Andrew D Ker. Towards dependable steganalysis. InMedia Water- marking, Security, and Forensics 2015, volume 9409, page 94090I. International Society for Optics and Photonics, 2015.

[46] Tomáˇs Pevný, Martin Rehák, and Martin Grill. Identifying suspicious users in cor- porate networks. InProceedings of workshop on information forensics and security, pages 1–6, 2012.

[47] Tomas Pevny and Petr Somol. Discriminative models for multi-instance problems with tree structure. In Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security, AISec ’16, pages 83–91, New York, NY, USA, 2016. ACM.

[50] M. Reh´ak, M. Pˇechoucek, M. Grill, J. Stiborek, K. Bartoˇs, and P. ˇCeleda. Adaptive multiagent system for network traffic monitoring. IEEE Intelligent Systems, 24(3):16– 25, May 2009.

[51] Martin Rehak, Michal Pechoucek, Karel Bartos, Martin Grill, and Pavel Celeda. Net- work intrusion detection by means of community of trusting agents. InProceedings of the 2007 IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IAT ’07, pages 498–504, Washington, DC, USA, 2007. IEEE Computer Society.

[52] Clayton Scott. Performance measures for neyman–pearson classification.IEEE Trans- actions on Information Theory, 53(8):2852–2863, 2007.

[53] Clayton Scott and Robert Nowak. A neyman-pearson approach to statistical learning. IEEE Transactions on Information Theory, 51(11):3806–3819, 2005.

[54] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pages 935–943, 2013.

[55] Jan Stiborek, T. Pevn´y, and Martin Reh´ak. Multiple instance learning for malware classification. Expert Systems with Applications, 93:346–357, 2018.

[56] Jan Stiborek, Tomáˇs Pevný, and Martin Rehák. Probabilistic analysis of dynamic malware traces. Computers & Security, 2018.

[57] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. 12 2013. [58] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching net-

works for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.

102(2), s. 275-304, IF = 1.848. Authorship = 100%.

p.53 — Grill, M., Pevn´y, T., Reh´ak, M.: Reducing False Positives of Network Anomaly De- tection by Local Adaptive Multivariate Smoothing. Journal of Computer and System Sciences. 2017, 83(1), s. 43-57, IF = 1.678 Authorship = 20%.

p.69 — KER, A. D. a Pevn´y, T.:The Steganographer is the Outlier: Realistic Large-Scale Ste- ganalysis. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECU- RITY. 2014, 9(9), s. 1424-1435, IF = 4.332, Authorship = 50%.

p.81 — Kohout, J. a Pevn´y, T.: Network traffic fingerprinting based on approximated kernel two-sample test. IEEE Transactions on Information Forensics and Security. 2017, PP(99), IF = 4.332 Authorship = 50%.

p.95 — Pevn´y, T. a SOMOL, P.:Using Neural Network Formalism to Solve Multiple-Instance Problems. In: Advances in Neural Networks - ISNN2017. 2017, s. 135–142. Author- ship = 50%.

p.103 — Pevn´y, T. a SOMOL, P.:Discriminative Models for Multi-instance Problems with Tree Structure. In: Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security. 2016, s. 83-91. Authorship = 50%.

p.113 — Jusko, J. and Reh´ak, M. and Stiborek, J. and Kohout, J. and Pevn´y, T.:Using Behav- ioral Similarity for Botnet Command-and-Control Discovery. IEEE Intelligent Systems. 2016, 31(5), s. 16-22, IF = 2.374 Authorship = 5%.

p.121 — Stiborek, J., Pevn´y, T., Reh´ak, M.:Probabilistic analysis of dynamic malware traces. Computer & Security. to appear, IF = 3.928 Authorship = 40%.

p.139 — Stiborek, J., Pevn´y, T., Reh´ak, M.:Multiple instance learning for malware classification. Expert Systems with Applications. 2018, 93, s. 346–357, IF = 3.928 Authorship = 40%.

p.151 — Pevn´y, T. a KER, A. D.:Towards dependable steganalysis. In: Proceedings of SPIE Media Watermarking, Security, and Forensics 2015. SPIE Photonics West 2015, 07.02.2015 - 12.02.2015. Authorship = 67%.

In document Challenges and Open Questions of Machine Learning in Computer Security (Page 30-39)