A Review on Data Mining Concept for Diagnosis and Classify Malware from Code

(1)

A Review on Data Mining Concept for Diagnosis and Classify Malware from Code

(Paper ID: 24ET3003201508)

Dhaval Patel Sanjay Patel

M.E Student, Computer Engineering Asst. Prof, Computer Engineering

IIET, Dharmaj, India IIET, Dharmaj, India

Gujarat Technological University Gujarat Technological University

Abstract: In this review paper we discuss on various Data mining techniques and Malware detection methods. Malware is a computer program called malicious code which is used to harm or damage computer or network. So, different Malware detection methods have been proposed to detect and remove Malware. In now a day’s Malware is one of the major cyber frauds. Malware composer use Application Programming Interface (API) calls to perform this crime. The Data mining Classification techniques like decision tree Classification, Bayesian Classification and Malware detection techniques like Signature Based Detection, Heuristic Based Detection, Behaviour Based Detection, Log mining approach and Rev Match Model are used to detect Malware. RevMatch is an efficient and powerful model configured with VirusTotal API and Data mining tool WEKA which can make accurate Malware detection decision on the based on feedback.

Keywords: Data Mining, Malware, Classification, Rev Match, Virus Total, WEKA.

I. INTRODUCTION

Data mining, “the extraction of hidden predictive information from large databases”, is a powerful new technology with great potential to help for detecting the malware. Database is take important role to store various kinds of useful and meaningful information [13]. Data mining technique like classification is easily extract the information from large dataset [13]. In Malware Detection Data mining is used extensively, because of limited scalability, adaptability and validity. In Malware Detection data is gathered from various sources like network log data, host data etc. d u e t o l a r g e n e t wo r k t r a f f i c the analysis of data is too difficult. So, the need of different data mining techniques for Malware detection has been need [14].

II. MALWARE

“A Malware is a set of instructions that run on user computer and make system do something that an attacker wants it to do. A program that can infect other programs by modifying them to include a, possibly evolved, version of itself”.

Malware is shorter form of malicious software. It is software specifically designed to damage user’s computer data in some way .Malware consists of programming (code, scripts, active content, and other software’s) designed to disrupt or deny

operation, gather information that leads to loss of privacy or exploitation, gain unauthorized access to system resources, and other abusive behaviour. The expression is a general term used by computer professionals to mean a variety of forms of hostile, intrusive, or annoying software or program code.

Very often people call everything that corrupts their system as virus, not aware of what it actually means or does which is known as Malware.

Fig. 1. Malware Classification [15]

A malicious executable is defined to be a program that performs a malicious function, such as compromising a system’s security.

III. MALWARE REPARTITION

Any program that is purposefully created to harm the computer system operations or data is termed as malicious programs. Malware programs include viruses, worms, Trojans, backdoors, adwares, spywares, bots, root kits etc. All Malware are sometimes loosely known as virus (viruses, worms, Trojans specifically).

(2)

Fig. 2. Malware Repartition [16]

IV. REVIEW OF MALWARE DETECTION TECHNIQUES

A) Signature Based Detection

The most widely and common technique to detect malware is signature-based malware detection approach. It uses a simple pattern-matching approach to detect malicious code. This approach is highly accurate [6]. This involves searching for known malicious patterns within suspicious ﬁles. Signature- based detection performs fast and usually has a low FP [9].

Ideally, a signature should be able to detect any malware exhibiting the malicious behavior specified by the signature [8]. Once a signature has been created, it is added to the signature-based method’s knowledge (i.e. repository

Fig. 3. Signature Based Detection [10]

Figure 3 illustrates the major disadvantage of signature-based methods. Since the set of possible malicious behaviors, U, is inﬁnitely large, there are no known techniques for accurately representing U via signatures. Furthermore, a repository of signatures is a weak approximation to U [10].

Disadvantages:

 The main drawback of this method is that it can’t be detect the unseen Malware and also failure to detect the polymorphic and metamorphic Malware [6, 3].

 Another drawback of signature-based methods is that human involvement is usually needed to develop the signatures. This not only allows for the introduction of human error, but takes considerably more time than if signature development was completely automated.[10]

B) Behaviour Based Detection

In the behavioural based method the Malware is detect based on the static and dynamic behaviours of the Malware code. If the code of the any Malware is complicated but the detection methodology does not change the behaviour of the Malware.

This approach is also use in machine learning for program behaviours According to behaviours used by them, usually there are two types of behaviour based detection methods, semantics based formal detection method and system call based method [6]. The behavior-based approach analyzes behavior log or graph of a suspicious ﬁle, such as a sequence of system calls, and matches them with known malware behavior patterns [9]. This method can able to detect newly introduced malware which is also known as zero-day attack [9].

Fig. 4. system call interception (p is analyzed program) [11]

Disadvantages:

 Normal behavior for complicated programs is very difficult. For example, the set of behaviors of Mozilla Firefox are very complicated. Therefore, it is very difficult to develop a model of normal behavior of a complicated program [12].

 Non-availability of promising False Positive Ratio (FPR) and also high amount of scanning time are the main disadvantages of these behavior based malware detection methods [6].

 Behavioural security is useful when program or file has not previously been classified as “good” or

“bad.” It is an efficient (but not ideal) way to detect new threats without waiting for them to first do harm.

C) Heuristic Based Detection

This approach used to detect Malware because this method provides the counter action against the targeted and zero-day attacks. There are two main approaches focused on Malware detection using heuristic approach: making decisions based, processing dynamic data (the group of dynamic detection

(3)

methods) [7].Heuristic malware detection methods use data mining and machine learning techniques to learn the behavior of an executable file[6]. Malware detection methods must be applied with classification techniques where these methods require some features which represents the input instance in the way that can be used for classification [6].

The features have been classified in below figure:

Fig. 5. Heuristic Method Features [8]

API/System calls

Around all programs use application programming interface (API) calls to send the requests to the Operating System [8, 9].API call sequences is one of the most effective way that follow the behavior of a piece of code like malware. Hofmeyr et al, were among the first ones who concern API call sequences as a feature of a malware [8, 10].

OpCode

An OpCode which is a shorter form of (Operational Code) is the subdivision of a machine language instruction that classify the operation to be executed. A program is defined as a series of ordered assembly instructions. An instruction is a pair composed of an operational code and an operand or a list of operands [8]. The most significant research on OpCodes has been done by Bilar [8, 20].

N-Grams

N-Grams are all substrings of a larger string with a length of N [8, 29]. For example, the string “VIRUS” can be division into several 3-grams: “VIR”, “IRU”, “RUS” and so on [8].

Tesauro et al. [8, 30] were the first who try to use N-Grams as a feature for malware detection domain. They used N-Grams to detect Boot Sector Viruses using Artificial Neural Networks (ANN). A Boot Sector Virus is a malware variant which infects DOS Boot Sector or Master Boot Record (MBR) [8]. A certain researches have been motivated on the detection of unknown malware based on its binary code content. Schultz et al. [8, 8] were the first who introduced the idea of applying Malware techniques for detection of diverse malwares based on their own binary codes [8].

CFG (Control Flow Graph)

CFG is a directed graph which is used to represent control flow of the program, where each node represents a statement of the program and each edge represents control flow between

the statements (i.e. what happens after what). Statements may be assignments, copy statements, branches etc [8]. Zhao [8, 40] develop a detection method based on features of the control flow graph for PE files. At first, he created CFG for each executable file then he used features which extracted from CFG as the train data [8].

Hybrid Features

The performance of machine learning classifiers is control by two main factors: features and algorithms. Certain researchers set out to improve the attention of machine learning through the features. They combine features to get better accuracy. Eskandari et al. [8, 45] used the simple CFG and API calls to detect metamorphic malware. CFG was used to understand semantic of malware. Assuming called API on CFG; they got more semantic aspects of new malware [8].

A) RevMatch Model

A collaborative decision model named as RevMatch is based on feedback which is used to obtain high detection accuracy, the goal of the decision model is to achieve efficiency. This model can make powerful and effective as well as accurate Malware detection decision on the based of the feedback from neighbour [9].

Given a labelled history consisting of the feedback of n Avs on m files whose ground truth are known (Malware or goodware), decide whether a suspicious file is Malware based on the feedback set y from a subset of the Avs [9].

To solve this problem, the decision model describe as follows. Suppose a scenario where a set of Avs N are consulting each other for Malware assessment. AVi (i ∈ N ) sends a suspicious file to other Avs in its neighbour list Ni for consultation. Let random variable Yi :=

[Yj ]j∈Ni denote the feedback vector that contains the scanning results from the neighbours. Note that Yj ∈ {0, 1}, and Y=1 and Yj=0 indicate the suspicious file is a Malware or goodware respectively1 . Suppose AVi sends a suspicious file to its neighbours for consultation and receives a feedback set y = {y1,..., y|Ni |} from its neighbours, where yj ∈ {0, 1} is the feedback from neighbour j. AVi needs to decide whether the suspicious file is Malware or not, based on the feedback y [9].

In this model they describe two algorithm like Feedback Relaxation and Ground Truth Update algorithm [9].

(4)

Feedback Relaxation algorithm

The previous results are based on the condition that M(y)+G(y) ≥Tc , where Tc >0 is a system parameter to specify the minimum number of matches in order to reach some “conﬁdence” in decision making M(y) +G(y) < Tc

indicates there are not enough matches and thus no conﬁdent decision can be made. The RevMatch model handles this problem using feedback relaxation [9].

Fig. 6. Feedback Relaxation algorithm [9]

Ground Truth Update Algorithm

The labeled history (ground truth set) is highly important since all decisions are based on the ground truth (GT) search for matches. AVs in CMD may collect labeled history by sending test ﬁles to neighbors and recording their feedbacks and GT. Real consultation ﬁles can also be used when their GT are revealed afterward [9].

Fig. 7. Ground Truth Update Algorithm [9]

Several collaborative decision models for Malware detection have been evaluated which are listed as:

1. Static Threshold 2. Weighted Average 3. Decision Tree 4. Bayesian Decision

1) Static Threshold: The static threshold (ST) model [19]

raises an alarm if the total number of Malware diagnosis in the result set is higher than a defined threshold. This model is straight forward and easy to implement. The tunable threshold can be used to decide the sensitivity in Malware detection. However, the ST model considers the quality of all Avs equally, making the system vulnerable to attacks by colluded malicious insiders.

2) Weighted Average: The weighted average (WA) model [12] takes the weighted average of all feedback from Avs. If the weighted average is larger than the threshold, then the system raises an alarm. The weight of each AV can be the trust value or quality score of the AV. The impact from high-quality Avs is larger than from low- quality Avs. The WA model also provides a tuneable threshold for the sensitivity of detection.

3) Decision Tree: The decision tree (DT) model [9] uses a machine-learning approach to produce a decision tree, in order to maximize decision accuracy [9]. The decision tree approach can provide a fast, accurate, and easy-to- implement solution to the collaborative Malware detection problem. The training data with labelled samples is used to generate a binary tree and decisions are made based upon the tree. However, the decision tree approach does not work well with partial feedback, i.e., when not all participants give feedback. It is also not flexible (no easy way to tune the sensitivity of detection) since decision trees are usually precomputed [9].

4) Bayesian Decision: The Bayesian decision (BD) model [9, 13] is another approach for feedback aggregation in intrusion detection (or Malware detection). In this approach, the conditional probability of Malware/Goodware given a set of feedback is computed using Bayes’ theorem and the decision with the least risk cost is always chosen.

The BD model is based on the assumption that feedbacks from collaborators are independent, which is usually not the case [9].

V. CONCLUSION

From study and review different Malware detection techniques with different data mining techniques like classification the RevMatch Model is perform better then other Malware detection techniques.

VI. ACKNOELEDGEMENT

It gives me immense pleasure to thank Mr. Sanjay R. Patel (Assistant Professor, CE Department, I.I.E.T ) who took great efforts in providing good guidance and was enthusiastic about my work. I learnt from him the skills of writing the survey paper, and also the patience and persistence required to analyze one’s own work and address peer review comments. I owe him a great debt of gratitude.

(5)

I would like to thank and express my deep and sincere sense of gratitude to Mr. Premal J Patel (Head, CE Department, I.I.E.T) for providing me encouragement and good facilities to prepare my survey. Besides, I am also grateful to all the faculties of CE Department for providing timely supports and sharing of knowledge.

These acknowledgements would be incomplete without thanking my fellow M.E. students for always encouraging each other to work hard.

Finally, I am indebted to my family members, who have encouraged me and provided the moral support throughout till date.

REFERENCES

[1] Takumi Yamamoto, Kiyoto Kawauchi, “Proposal of a method detecting malicious Processes”, IEEE 28th International Conference on Advanced Information Networking and Applications Workshops, pref., 247-8501, 2014.

[2] Asmitha K A, Vinod P, “A Machine Learning Approach for Linux Malware Detection”, IEEE, 2014.

[3] G. Ganesh Sundarkumar, Vadlamani Ravi, “Malware Detection by Text and Data Mining”, IEEE, 2013.

[4] Ms. Apashabi Chandkhan Pathan, Mrs. Madhuri A.

Potey,“DETECTION OF MALICIOUS TRANSACTION IN DATABASED USING LOG MINING APPROACH”, IEEE, International Conference on Electronic Systems, Signal Processing and Computing Technologies, 2014.

[5] Tawfeeq S. Barhoom, Hanaa A. Qeshta, “Adaptive Worm Detection Model Based on Multi classifiers”, IEEE, Palestinian International Conference on Information and Communication Technology, 2013.

[6] XIAO XIAO, DING YUXIN, ZHANG YIBIN, TANG KE, DAI WEI,

“MALWAR DETECTION BASEDD ON OBJECTIVE-ORENTED ASSOCIATION MINING”, IEEE, Proceedings of the 2013 International Conference on Machine Learning and Cyberetics, 2013.

[7] Dmitriy Komashinskiy and Igor Kotenko, “Malware Detection by Data Mining Techniques Based on Positionally Dependent Features”, IEEE, 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, 2010.

[8] Zahra Bazrafshan, Hashem Hashemi, Seyed Mehdi Hazrati Fard, Ali Hamzeh, “A Survey on Heuristic Malware Detection Techniques”, IEEE, 5th Conference on Information and Knowledge Technology, 2013.

[9] Carol J. Fung, Disney Y. Lam, Raouf Boutaba, David R. Cheriton

“RevMatch: An Efficient and Robust Decision Model for Collaborative Malware detection”, IEEE, 2014

[10] Nwokedi Idika, Aditya P. Mathur ”A Survey of Malware Detection Techniques”, Department of Computer Science, Purdue University, West Lafayette

[11] Lorenzo Martignoni, Roberto Paleari, and Danilo Bruschi, “A framework for behavior-based malware analysis in the cloud”, Dipartimento di Fisica

[12] Ashwini Mujumdar, Gayatri Masiwal, Dr. B. B. Meshram, “Analysis of Signature-Based and Behavior-Based Anti-Malware Approaches”

[13] Nileshkumar D. Bharwad, Prof. Mukesh M. Goswami, “ Classification for Multi-Relational Data Mining based on Bayesian Belief Network”, International Conference on Convergence of Technology – 2014 [14] Abhaya, Kaushal Kumar, Ranjeeta Jha, Sumaiya Afroz, “Data

Mining Techniques for Intrusion Detection:A Review”, International Journal of Advanced Research in Computer and Communication Engineering Vol. 3, Issue 6, June 2014.

[15] Malware Analysis, http://www.techstin.com/category/basic- lessons/page/2/

[16] Malware Repartition, http:// www.securelist.com