Review on Machine Learning Techniques for Android Malware Detection and Categorization

(1)

Review on Machine Learning Techniques for

Android Malware Detection and

Categorization

Neha Sharma 1, Pooja Yadav 2

P.G. Student, Department of Computer Science & Engineering, SRCEM at Palwal, Haryana, India1 Assistant Professor, Department of Computer Science & Engineering, SRCEM at Palwal, Haryana, India 2

ABSTRACT: Android is right now the most famous cell phone working framework. In any case, clients feel their

private data at danger, confronting a quickly expanding number of malware for Android which fundamentally surpasses that of different stages. Antivirus programming guarantees to adequately secure against malware on cell phones and numerous items are accessible for nothing or at sensible costs. Their adequacy is upheld by different reports, authenticating extremely high identification rates. In any case, a progressively nitty gritty examination is required so as to comprehend the genuine hazard level emerging from malware for the Android stage. Neither do the exceedingly high quantities of various malware variations mirror the genuine danger in contrast with different stages, nor do the after effects of testing antivirus programming against a lot of definitely known malware tests (review tests) give an unmistakable image of the capacities and restrictions of antivirus programming on the Android stage. The essential target of this scheme is along these lines to encourage corporate and private clients to survey the genuine hazard levels forced by Android malware from one perspective i.e. using machine learning vide Support Vector Machine, and the assurance level offered by anti malware programming then again. For this reason, we examine how malware spreads and which constraints anti malware applications are liable to. We at that point assess how well Android anti malware programming using machine learning performs under real world conditions, instead of review identification rate tests. In view of our proposed scheme. For this scheme, we inculcate different tests on a few malware based applications for Android to examine the density and harmfulness of malware. As we plan to reflect genuine dangers superior to review tests, in which the proposed scheme is tried for perceiving known malware tests, our test setup considers the capacity to adapt to common malware appropriation channels, contamination schedules, and benefit heightening systems. We found that it is simple for malware to avoid identification by most anti malware applications with just unimportant modifications to their bundle records. So as to test distinctive malware location systems. This verification with classification scheme of anti malware proposes propelled usefulness which is absent in the greater part of the present Android malware, and is planned to decide how Android anti malware scheme will manage obscure and forthcoming malware..

KEYWORDS: Machine Learning, Malware Detection, Support Vector Machines, Anti Malware, Android Package.

I. INTRODUCTION

(2)

antivirus or antimalware arrangements are bore witness to a location rate of 90% or more, and over half of android antivirus or antimalware applications allegedly recognize over 65% of the tests. However, these procedures and tests come full circle in a rapidly expanding apparent risk level from one viewpoint, and apparently exceptionally high insurance rates on the other. In this manner, the general hazard circumstance for android clients or users, because of the danger level by malware, and the assurance level offered by android anti malware programming, is in-transparent and difficult to evaluate. Consequently, under the scheme we will develop the antimalware detection model using the machine learning techniques vide supervised learning model pertaining the best de-facto method namely Support Vector Machine and forming or utilizing tests of known Android malware and Android misuses to detect or identify the malwares under the framework and ecosystem residing as part of android package (APK). We will unequivocally not test review discovery rates. Rather, we will acquaint minor changes with malware tests to test discovery of re-disseminated malware. We plan to decide location strength against scarcely changed malware. Additionally, we will mimic one regular contamination channel – a position of safety dropper application using Support Vector Machine with the capacities of identifying benefit acceleration misuses within the range and context of dalvik virtual machine, which are utilized by more than 33% of malware to increase boundless command over an objective gadget. In this way, we will recreate malware re-conveyance or malware detection methodology which hereby benefits heightening situations to pick up knowledge into certifiable execution of ant malware programming on Android.

To test current Android anti malware programming against obscure and progressed forthcoming malware, we will likewise lead tests utilizing a proof of idea malware created as a major aspect of various past works or proposed solutions.. As it isn't conveyed in the wild, it is obscure to Android anti malware programs. It additionally exhibits different usefulness just found in the latest Android malware, however which is relied upon to be sent all the more generally later on, for example, cross-stage contamination. No uncommon confusion or stealth procedures have been utilized, and the centre pernicious usefulness could be actualized by malware creators likewise. On the specialized dimension, this trial of an obscure malware danger addresses anti malware applications and their investigation abilities.

To understand the locus standi of malware we hereby elaborating the channels which is responsible to distribute or dispense these harmful segments of program using underneath resources. Subsequently, by and large, any information transmission and correspondence channel can fill in as an assault and malware conveyance vector. Channels which are not implied for transmitting programming may at present be taken control of by regular methods for expecting command over any application's or administration's control stream because of mistaken programming. These vectors incorporate USB, Bluetooth, and NFS (Network File System) and their associations, scanner tags, QR codes, unbound remote associations which can be abused to infuse information, mistaken GSM/UMTS/LTE radio bundle dealing with, and some more. Normal correspondence channels which are intended to convey programming are the official Google Play Store and outsider application markets. In the accompanying, we will concentrate on the most imperative contamination channels for regular malware which can be found in the world today. For additional inside and out data on Android engendering channels, tenacious contamination, and malware approach all in all, allude to our best of information are :-

1. App Markets: The official Android application showcase at Google Play store is genuinely all around

(3)

are fated for this. Additionally, clients are deliberately searching for such pilfered applications. Therefore, these purported repackaged applications are of high appeal to clients, and thus likewise to malware designers and merchants. The level of their allure likewise connects with the usefulness the applications offer. The greater usefulness, the more consents are required for the application without giving the client a chance to get suspicious. This consent is then likewise accessible to the malware.

2. Rooting: Some users ”root” their phone, meaning they make the root account on their phone always accessible.

Root access is normally not possible for end users or end-user software. Rooting one’s device is necessary for some modifications, some installed software, and often for flashing alternative operating system images to a device. In order to grant root access to apps, the su utility is often installed on rooted devices. This utility can also be used by malware, which then does no longer need to run privilege escalation exploits on its own to acquire root privileges. Thus, rooting one’s own device, just like using third party app markets, can introduce higher risks.

3. PC-to-Device : Proliferation from PCs to Android gadgets can be accomplished most effectively by means of

Android Debug Bridge (adb). To start with, malware can show manufactured refresh warnings, advising the client to actuate Android Debug Bridge (adb) for the refresh to be introduced. At that point, the malware dynamic on a PC may begin an adb daemon and introduce malware on the Android gadget. Comparable contamination is as of now endeavored by the Zeus-in-the-Mobile (ZitMo) malware, however it is for the most part directed utilizing social building. ZitMo endeavors to accomplish cross gadget diseases giving its objectives manufactured SMS instant messages inciting them to introduce a refresh, likewise giving the connection purportedly pointing at the refresh itself. The refresh is an APK document the client needs to download and introduce physically.

To understand the locus standi of malware we hereby elaborating the channels which is responsible to distribute or dispense these harmful segments of program using underneath resources. Subsequently, by and large, any information transmission and correspondence channel can fill in as an assault and malware conveyance vector. Channels which are not implied for transmitting programming may at present be taken control of by regular methods for expecting command over any application's or administration's control stream because of mistaken programming. These vectors incorporate USB, Bluetooth, and NFS (Network File System) and their associations, scanner tags, QR codes, unbound remote associations which can be abused to infuse information, mistaken GSM/UMTS/LTE radio bundle dealing with, and some more. Normal correspondence channels which are intended to convey programming are the official Google Play Store and outsider application markets. In the accompanying, we will concentrate on the most imperative contamination channels for regular malware which can be found in the world today. For additional inside and out data on Android engendering channels, tenacious contamination, and malware approach all in all, allude to our best of techniques using machine learning.

(4)

endeavours to adapt to the shortcomings of both recognition approaches. It is another line of barrier to enlarge proficient yet permeable antivirus resistances and less-dependable, more asset serious heuristics. Such a resistance would permit the antivirus layer to square known dangers, while keeping even most of new dangers from achieving the frameworks. Following the comparable instinct, we proposed a viable recognition procedure expelling the imperfections of both existing discovery frameworks against malwares which utilizes the half breed approach. It for the most part expands the possibility of mark based strategies what's more with programmed conduct examination. Bolster Vector Machine (SVM), a regulated machine learning procedure is utilized for arrangement of noxious and amiable virtual products. Our proposed framework will distinguishes malware based on two highlights first by application code thickness of executables and framework call highlights of libraries used by dalvik virtual machine and powerfully following the conduct of executables amid runtime. We manifest around diminishing element categorization in the first dataset space and reduce the bogus caution rate by dissecting conduct of executables. We connected application code include separating to prune unessential and most critical highlights using classification techniques. Therefore, proposed scheme naturally examines the conduct of each malware executables and creates reports investigating framework call highlights. Further it prepares and constructs reference model of SVM classifier to approve test dataset segregating pernicious and generous executables given as information.

II. RELATEDWORK

Justin Sahs and Latifur Khan [1] depicts that, the recent emergence of mobile platforms capable of executing increasingly complex software and the rising ubiquity of using mobile platforms in sensitive applications such as banking, there is a rising danger associated with malware targeted at mobile devices. The problem of detecting such malware presents unique challenges due to the limited resources avalible and limited privileges granted to the user, but also presents unique opportunity in the required metadata attached to each application. In this article, we present a machine learningbased system for the detection of malware on Android devices. Our system extracts a number of features and trains a OneClass Support Vector Machine in an offline (off-device) manner, in order to leverage the higher computing power of a server or cluster of servers.

Milosevic, N., Dehghantanha, A. and Choo, K.-K.R [2] depicts that, malware have been used as a means for conducting cyber attacks for decades. Wide adoption of smartphones, which store lots of private and confidential information, made them an important target for malware developers. Android as the dominant mobile operating system has always been an interesting platform for malware developers and lots of Android malware species are infecting vulnerable users every day which make manual malware investigation an impossible mission. Leveraging machine learning techniques for malware forensics would assist cyber forensic investigators in their fight against malicious programs. In this paper, we present two machine learning aided approaches for static analysis of the mobile applications: one based on permissions , while the other based on source code analysis that utilizes a bag of words representation model. Our source code based classification achieved F-score of 95.1%, while the approach that used permission names only performed with F-measure of 89%. Our approach provides a method for automated static code analysis and malware detection with high accuracy and reduces smartphone malware analysis time.

(5)

models at detecting malicious apps under a certain category. The intensive machine learning experiments proved that category-based classifiers report a remarkable higher average performance compared to non-category based.

Ivan Firdausi, Charles lim, Alva Erwin [4] depicts that increase of malware that are exploiting the Internet daily has become a serious threat. The manual heuristic inspection of malware analysis is no longer considered effective and efficient compared against the high spreading rate of malware. Hence, automated behavior-based malware detection using machine learning techniques is considered a profound solution. The behavior of each malware on an emulated (sandbox) environment will be automatically analyzed and will generate behavior reports. These reports will be preprocessed into sparse vector models for further machine learning (classification). The classifiers used in this research are k-Nearest Neighbors (kNN), Naïve Bayes, J48 Decision Tree, Support Vector Machine (SVM), and Multilayer Perceptron Neural Network (MlP). Based on the analysis of the tests and experimental results of all the 5 classifiers, the overall best performance was achieved by J48 decision tree with a recall of 95.9%, a false positive rate of 2.4%, a precision of 97.3%, and an accuracy of 96.8%. In summary, it can be concluded that a proof-of-concept based on automatic behavior-based malware analysis and the use of machine learning techniques could detect malware quite effectively and efficiently.

III.PROPOSED METHODOLOGY

SVM Based Classification: SVM is a strategy for information characterization; it can create a nonlinear choice plane and orders information which has non-customary dissemination. It stays away from properties with more prominent numeric reaches commanding those with littler numeric extents and it maintains a strategic distance from numerical challenges amid the count as piece esteems more often than not rely upon the internal results of highlight vectors. SVM works in two stages preparing stage and testing stage. Amid the preparation stage a SVM takes a lot of info focuses as Attribute Relation File Format, every one of which is set apart as having a place with one of two classifications, and assembles a model speaking to the information focuses in such way that the purposes of various classifications are isolated by a reasonable hole that is as wide as could be expected under the circumstances. From that point, another information point is mapped into a similar space and anticipated to have a place with a class dependent on which side of the hole it falls on. A straight SVM show isolates information having a place with various classifications by utilizing a hyper-plane so the separation from its closest information point on each side is expanded. The part trap enables the SVM calculation to end up nonlinear to isolate focuses by a hyper-plane in a changed component space. The SVM is designed and prepared to cross through two kinds of highlights. At first SVM features those records whose framework calls are having digressing conduct than typical conduct of amiable documents and other one is SVM pinpoints those records having codes that are having positive effect on the grouping of amiable and noxious programming. At long last amid testing stage SVM approves the dataset separating the documents into sets of kind and malignant records using dynamic sort mechanism, the below diagram depicts the workflow diagram of proposed scheme :-

Figure 1: Work Flow Diagram of Proposed Scheme depicting various phases like data collection, pre-processing, feature extraction, profile building, binary vector generation, SVM classification and malware detection and

(6)

IV.CONCLUSION

Under the proposed plan we consider three viewpoints to create the powerful and increasingly suitable system. To start with, checking beneficial stationary highlights, for example, capacities brings in structure the scientific classification models to show signs of improvement comprehension of the procedures that applications may lunch in an approach to expand the location precision of the classifiers. Second, put into activity the proposed arrangement on an expansive scale level by building profile models for different classes and sub classifications dependent on profiles and vectors utilizing SVM. Third, testing the attainability of coordinating our answer with dynamic recognition procedures by profiling dynamic highlights for every classification; dynamic highlights like framework calls, arrange associations, assets utilization, and so on.

REFERENCES

1. Justin Sahs and Latifur Khan, A Machine Learning Approach to Android Malware Detection, 2012 European Intelligence and Security Informatics Conference

2. Milosevic, N., Dehghantanha, A. and Choo, K.-K.R. (2017) Machine learning aided Android malware classification. Computers & Electrical Engineering, 61. pp. 266-274. ISSN 0045-7906

3. Alatwi, Huda Ali, "Android Malware Detection Using Category-Based Machine Learning Classifiers" (2016). Thesis. Rochester Institute of Technology.

4. Ivan Firdausi ; Charles lim ; Alva Erwin ; Anto Satriyo Nugroho Analysis of Machine learning Techniques Used in Behavior-Based Malware Detection 2010 Second International Conference on Advances in Computing, Control, and Telecommunication Technologies. 5. Wen-Chieh Wu, Shih-Hao Hung : DroidDolphin: a dynamic Android malware detection framework using big data and machine learning

6. Ying-Chih Shen1, Roger Chien1, and Shih-Hao Hung1, Toward Efficient Dynamic Analysis and Testing for Android Malware 1Academia Sinica 2016, Taiwan, National Taiwan University, Taiwan

7. D. Arp, M. Spreitzenbarth, M. Hübner, H. Gascon, K. Rieck, and C. Siemens. Drebin: Effective and explainable detection of android malware in your pocket. 2014.

8. S.-J. Chang. Ape: A smart automatic testing environment for android malware. 2013.

9. S. Y. Yerima, S. Sezer and G. McWilliams. “Analysis of Bayesian Classifcation Aprroaches for Android Malware Detection,” IET Information Security, Vol 8, Issue 1, January 2014.

10. S. B. Kotsiantis, “Decision trees: a recent overview,” Artificial Intelligence Review (2013) 39: pp 261-283, 2013.

11. I. H. Witten, E. Frank and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed., Morgan Kaufmann, Burlington: MA 01803, USA, 2011.

12. S. Y. Yerima, S. Sezer, G. McWilliams and I. Muttik, “A New Android Malware Detection Approach Using Bayesian Classification,” Proc. 27th IEEE int. Conf. on Advanced Inf. Networking and Applications (AINA 2013), Barcelona, Spain, March 2013.

13. J. Sahs and L. Khan, “A Machine Learning approach to Android malware detection,” 2012 European Intelligence and Security Informatics Conference, 2012.

14. L. Batyuk, M. Herpich, S. A. Camtepe, K. Raddatz, A. Schmidt, S. Albayrak, "Using static analysis for automatic assessment and mitigation of unwanted and malicious activities within Android applications," Proc. 6th International Conference on Malicious and Unwanted Software (MALWARE), 2011.

15. H. Peng, C. Gates, B. Sarma, N. Li, A. Qi, R. Potharaju, C. NitaRotaru and I. Molloy, “Using probabilistic generative models for ranking risks of Android apps. In Proceedings of the 19th ACM Conf on Computer and Comms. Security (CCS 2012), Oct. 2012.

16. G. Dini, F. Martinelli, A. Saracino, and D. Sgandurra, “MADAM: A Multi-level Anomaly Detector for Android Malware,” Proc. 6th Int. Conf. on Mathematical Methods, Models and Architectures for Computer Network Security (MMM-ACNS 2012), Saint Petersburg, Russia.