Analyzing Big Data Using Updatable Classifiers

(1)

Analyzing Big Data Using Updatable Classifiers

Shrwan Ram

Department of Computer Science and Engineering M.B.M Engineering College

Jodhpur,India

Shloak Gupta

Department of Computer Science and Engineering M.B.M Engineering College

(2)

Abstract---There is exponential growth in amount of data that is generated in various sectors like genetics, telecommunication, banking sectors etc. This enormous data has lot patterns and information stored in it. There is need to extract this intel from the data. Machine learning is used in identifying these patterns and the relation between inputs and outputs. Traditional algorithms like decision tree, neural network, random forest and others were used in machine learning models. But these models become inefficient with large number of instances and when input data varies with time like in stock markets, spams or biological viruses. In this paper, techniques to analyse big data using updatable classifiers in WEKA will be discussed. With the development in algorithms that automate adversarial sample generation like deep neural networks. There is an urgent need to fight these perturbations that are introduced to yield adversary selected misclassifications. The adversaries adapt to the data miner's reactions, and data mining algorithms constructed based on a training dataset degrades quickly. These kind chunks in classifiers help spammers and hackers to exploit our privacy. To help in these environments there is need for classifiers that can update themselves with evolving inputs.

Keywords—Updatable Classifiers, WEKA, Big Data, Adversarial

Learning, Cyber Security, Spam Filters, Machine Learning.

I. INTRODUCTION

The industrial revolution was a major turning point in the history of humanity. It enabled businesses to be more productive, create more jobs, and raise the overall standard of living. Today, we are on the precipice of another revolution. With machine learning done right, organizations can develop insights instantly and dramatically grow their business.

Machine learning enables cognitive systems to learn, reason and engage with us in a natural and personalized way. Think Netflix movie recommendations, Internet ads based on browsing habits, or even stock trades — these are all ways machine learning is helping us navigate our world in powerful new ways. Learning here is not by remembering and following step by step instructions but recognizing complex patterns and makes intelligent decisions based on data. The difficulty lies in the fact that the set of all possible decisions given all possible inputs is too complex to describe. To tackle this problem the field of Machine Learning develops algorithms that discover knowledge from specific data and experience, based on sound statistical and computational principles

With the exponential growth in the amount of data that is being generated there was a great opportunity to exploit that by finding patterns and relations between the data. Traditionally algorithms like decision trees, random forest, neural networks etc. were used and they are very efficient in what their results are but they have their limitations.

Problem with these algorithms were that they needed the whole data in the memory while training their models. Another limitation with these logarithms that they were prone to

adversarial samples that are crafted to force a target model to

3.2 IB1/IBk

Classify them in a class different from their legitimate class which lead to many security issues like the failing of spam filters, image recognizers etc.

In this paper, we discuss how to apply concept of machine learning on Big data using updatable classifiers and using similar kind of algorithms in to deal with adversarial opponents. We will be using WEKA to analyze the working of updatable classifiers. Updatable classifiers do not need whole training data in memory while building the model they update the model tuple by tuple. They have the edge when dealing with inputs which change with time like spam mails, as models can then be updated seeing how it is performing on new inputs.

2. OPEN SOURCE DATA MINING TOOL: WEKA

Waikato Environment for Knowledge Analysis (Weka) is a popular suite of machine learning software written in Java, developed at the University of Waikato [9]. Weka is a work bench that contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to these functions. Weka supports several standard data mining tasks, more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. Weka's main user interface is the Explorer, but essentially the same functionality can be accessed through the component-based Knowledge Flow interface and from the command line. There is also the Experimenter, which allows the systematic comparison of the predictive performance of Weka's machine learning algorithms on a collection of datasets. In this paper we will be using explorer and the data generator of the WEKA tool [10].

3. UPDATABLE CLASSIFIERS

3.1 Naive Bayes Updatable

It is an incremental form of Bayesian networks, as it assumes that each feature is not dependent on the remaining features. The naive Bayes algorithm usually used for a batch learning, because when algorithm handles each training sample separately, it could not perform its operations well, described in. As per the characteristics of the incremental learning algorithm, the naive Bayes algorithm can be trained by using one pass only as per the steps below[3]:

1. Initialize count and total=0

 Go through all the training samples, one sample at a time.

 Each training sample, t (x, y) will have its label associated with it.

 Increment the value of count, as it goes through the particular training sample.

2. The probability is calculated by dividing individual count by the set of training data samples of the similar class attribute.

3. Compute the previous probabilities as the portion of entirely training samples which are in classy.

(3)

for a test instance at time. The particular case is classified by using majority voting of its neighbor with the case being assigned to a class most common amongst its neighborhood by using distance function.

IBk implements kNN. It uses normalized distances for all attributes on different scales have the same impact on the distance function. The number of neighbor’s returns from it may be more than k, if there are ties in the neighbors. The neighbors are voted to form a final classification.

3.3 Kstar

It is a sample based learner, where the test sample case is decided by using the class label of training samples based on some kind of similarity function. It uses an entropy based distance function, based on the probability of transforming one instance in to another by randomly choosing between all possible transformations and turns out to be much better than Euclidean distance for classification [2].

3.4 NNge

It is a non-nested generalized exemplars. This algorithm is based on IB1, IBk and kNN algorithms. In the instance based learning, the classification time required is more, as there will be one instance at a time in memory, so the generalized exemplars can be the solution to deal with this. Generalized exemplars are the one which is representation of more than one of the actual instances in the training set. In NNge, the generalization is formed by, a new examples are added in the database each time by joining it to its nearest neighbor of the same class [2].

3.5 LWL

The performance of naïve Bayes becomes less and less favorable as the size of data increases compared to sophisticated classifiers when the sample size increases. A locally weighted version of naive Bayes that relaxes the independence assumption by learning local models at prediction time. Experimental results show that locally weighted naive Bayes rarely degrades accuracy compared to standard naive Bayes and, in many cases, improves accuracy dramatically. The main advantage of this method compared to other techniques for enhancing naive Bayes is its conceptual and computational simplicity [2].

3.6 AODE

Averaged one-dependence estimators (AODE) is a probabilistic classification learning technique. It was developed to address the attribute-independence problem of the popular naïve Bayes classifier. It frequently develops substantially more accurate classifiers than naive Bayes at the cost of a modest increase in the amount of computation. Like naive Bayes, AODE does not perform model selection and does not use tunable parameters. As a result, it has low variance. It supports incremental learning whereby the classifier can be updated efficiently with information from new examples as they become available. It predicts class probabilities rather than simply predicting a single class, allowing the user to determine the confidence with which each

3.7 SPeagasos

Implements the stochastic variant of the Pegasos (Primal Estimated sub-GradientSolver for SVM) method of Shalev-Shwartz et al (2007). This implementation globally replaces all missing values and transforms nominal attributes into binary ones. It also normalizes all attributes, so the coefficients in the output are based on the normalized data.

4. DATASET

We used agarwal dataset set for this problem. This data set can be generated from generator in weka explorer. We generated tuples of this dataset to analyze updatable classifiers in WEKA. This dataset was first introduced in the paper Database Mining: A Performance Perspective by R. Agrawal, T. Imielinski, A. Swami (1993). IEEE Transactions on Knowledge and Data Engineering. Every tuple in this database has the nine attributes given in Table 1. Attributes elevel, car, and zipcode are categorical attributes; all others are non categorical attributes. Attribute values were randomly generated [12].

TABLE 1.

Attribute Description Value Salary

Commission

Age Elevel Car Zipcode

Hvalue

Hyears loan

Salary

Commission

Age

education level make of the car zip code of the town

value of the house

years house owned total loan amount

uniformly distributed from 20000 to 150000

salary >= 75000 =) commission = 0 else uniformly distributed from 10000 to 75000

uniformly distributed from 20 to 80 uniformly chosen from 0 to 4 uniformly chosen from 1 to 20 uniformly chosen from 9 available zip codes

uniformly distributed from 0.5k 100000 to 1.5k 100000 where k depends on zip code

uniformly distributed from 1 to 30 uniformly distributed from 0 to 500000

5. ANALYSING

(4)

TABLE 2.

Classifiers Efficiency

AODE

NaiveBayesUpdatable

NNge

SPegasos

IBk(k=5)

Kstar

LWL

95.042%

88.297%

93.017%

67.344%

73.119%

57.625%

[image:4.595.21.281.75.247.2]

67.789%

Fig. 1. Accuracy of classifiers

These results show that the Naïve Bayes Updatable, NNge, AODE are the better performers in the case of this data set. As there are only two groups to be classified and they are in the ratio of 67% to 33% so it is found out that KStar performed very poorly and IBK, SPegasos, LWL were also not efficient. When we work in big data environment time also plays an important factor in classifying data. So we should be very selective in using Lazy Learners like IBk, KStar and LWL.

6. ADVERSARIAL LEARNING

Developments in machine learning have led to transformational new fields of technology and introduced capabilities not previously thought off. Emerging applications in self-driving cars, data analytics on massive datasets, face recognition, and Web

search and sentiment analysis are but a few of the technologies that will impact society in the decades to come.

Perhaps no technology field has relied on or benefited from advances in machine learning more than systems and computer security. Machine learning is the basis for almost all non signature-based detection, whether identifying malware, network intrusions, spam, rogue processes, fraudulent transactions, or other malicious activity. Indeed, machine learning has become so intertwined with security that the technical community’s ability to apply machine learning securely will likely be crucial to future environments [1].

In machine learning, each sample is input into the classification process as a vector of features that describe the sample. For email, typical features might be keywords, sender and recipient domain names, existence of embedded content, or number of emails of a particular type. The system determination is based on how that set of input features is interpreted by the model for the classification process—in this case, a model of how email input features indicate spam or not.

Models are generally built on how set of features relate to the output class and relations between them affect output they are iteratively trained to set the weights of each relation. The key assessment metric for these systems has been accuracy: How often does the model pick the correct class for a sample?.Accuracy can be viewed as a measure of the system’s average performance, whereas the security evaluation is interested in worst-case performance [7].

One of the limitations of machine learning in practice is that it’s subject to adversarial samples. Adversarial samples are carefully modified inputs crafted to dictate a selected output. In the context of classification, adversarial samples are crafted to force a target model to classify them in a class different from their legitimate class instance, spam emails that bypass a spam filter. The modifications, called perturbations, are introduced to yield a specific adversary-selected misclassification. At the top of it we now face the another problem which is that there are now algorithms like deep learning used to craft adversarial samples very efficiently.

Consider a case of an automated car which uses neural network to identify the signs on the road. Perturbations are now added to the input images to that network most of these changes cannot be spotted by humans these adversaries can lead to accidents. Problem with conventional type of learning is that when training the model it is very difficult to consider all type of inputs that are possible. So their remains chunks in decision boundaries of these trained models. These chunks are found by the hackers or spammers by repeatedly changing the input and testing them on the model to find inputs on which the model misbehaves.

Here now we use the term model resilience which is the term use to define how efficient the model is to these adversaries. There are several methods to increase model resilience and one of them is using updatable classifiers or adversarial learning models. 0.00%

(5)

7. ADVERSARIAL MODELS

In sequential learning, the learning problem can be thought of as a game between two players (the learner vs. nature), and the goal is to minimize losses regardless of the move played by the other player. The game proceeds as follows.

 Learner receives an input  Learner outputs prediction

 Nature looks at output and send the learner the true label  Learner suffers loss and updates its model.

Since no distributional assumptions are made about the data, the goal here is to perform as well as if the entire sequence of examples could be viewed ahead of time[11]. This is how classifiers are updated in adversarial environments to fight with the perturbations introduced.

8. CONCLUSION

 With the large amount of data produced these days it will become more difficult in future to have all the training data in memory. So in this paper we gave an insight how to handle big data using updatable classifiers in WEKA  It was also observed that classifiers base do naive Bayes

like NaïveBayesUpdatable and AODE were both efficient and took less time in giving results in comparison with Lazy Learners for agarwal dataset.

 In this paper, we also had a look at adversarial environments and how classifiers can be updated to protect adversaries manipulate our results.

REFRENCES

1. Patrick McDaniel, Nicolas Papernot, and Z. Berkay Celik,

"Machine Learning in Adversarial Settings",In Pennsylvania State University.

2. Roshani Ade, P. R. Deshmukh Ph.D, "Instance-based vs

Batch-based Incremental Learning Approach for Students

Classification", In International Journal of Computer Applications (0975 – 8887) Volume 106 – No.3, November 2014.

3. Liangxiao Jiang, Dianhong Wang, ZhihuaCai, Xuesong Yan,

―Survey of Improving Naive Bayes for Classification‖, Advances in Artificial Intelligence, Vol 8109, 2013, pp 159-167

4. D. Aha and D. Kibler, ―Instance-based learning algorithms.

Machine Learning‖, vol. 6, pp. 37–66, 1991

5. Nilesh Dalvi, Pedro Domingos, Mausam, SumitSanghai,

Deepak Verma, "Adversarial Classification", In Department of Computer Science and Engineering University of Washington, Seattle

6. Ling Huang, Anthony D. Joseph, Blaine Nelson, Benjamin I. P.

Rubinstein, J. D. Tygar, "Adversarial Machine Learning∗ ", In Proceedings of 4th ACM Workshop on Artificial Intelligence and Security, October 2011, pp. 43-58

7. Marco Barreno, Blaine Nelson, Russell Sears, Anthony D.

Joseph, J. D. Tygar, "Can Machine Learning Be Secure?", In Proceedings of the ACM Symposium on Information, Computer, and Communication Security, March 2006

8. Yan-Nei Law and Carlo Zaniolo, ―An Adaptive Nearest

Neighbor Classification Algorithm for Data Streams‖,pp. 108–

120, Springer-Verlag Berlin Heidelberg 2005.

9. Ian H. Witten; Eibe Frank; Mark A. Hall (2011). ―Data mining:

Practical Machine learning tools and techniques, 3rd edition‖ Morgan Kaufmann, San Francisco. Retrieved 2011-01-19.

10. G. Holmes; A. Donkin; I.H. Witten (1994). ―WEKA: A

Machine Learning Workbench‖. Proc Second Australia and New Zealand Conference on Intelligent Information Systems, Brisbane, Australia. Retrieved 2007-06-25.

11. Shalev-Shwartz, Shai (2011). "Online Learning and Online Convex Optimization". Foundations and Trends® in Machine Learning. pp. 107–194

12. R. Agrawal, T. Imielinski, A. Swami (1993). Database