• No results found

A Comparison of Approaches for Large-Scale Data Mining

N/A
N/A
Protected

Academic year: 2022

Share "A Comparison of Approaches for Large-Scale Data Mining"

Copied!
9
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

A Comparison of Approaches for Large-Scale Data Mining

Utilizing MapReduce in Large-Scale Data Mining

Amy Xuyang Tan

Dept. of Computer Science University of Texas at Dallas

[email protected]

Valerie Li Liu

Dept. of Computer Science University of Texas at Dallas

[email protected]

Murat Kantarcioglu

Dept. of Computer Science University of Texas at Dallas

[email protected] Bhavani Thuraisingham

Dept. of Computer Science University of Texas at Dallas

[email protected] ABSTRACT

Massive data sets that are generated in many applications ranging from astronomy to bioinformatics provide various opportunities and challenges. Especially, scalable mining of such massive data sets is an challenging issue that attracted some recent research. Some of these recent work use MapRe- duce paradigm to build data mining models on the entire data set. In this paper, we analyze existing approaches for large scale data mining and compare their performance to the MapReduce model. Based on our analysis, a data min- ing framework that integrates MapReduce and sampling is introduced and discussed.

Keywords

MapReduce, large-scale data mining

1. INTRODUCTION

Currently, enormous amounts of data are created every day.

With the rapid expansion of data, we are moving from the Terabyte to the Petabyte Age. At the same time, new tech- nologies make it possible to organize and utilize the massive amounts of data currently being generated. However this trend creates a great demand for new data storage and anal- ysis methods. Especially, theoretical and practical aspects of extracting knowledge from massive data sets have become quite important. Consequently, large-scale data mining has attracted tremendous interest in data mining community [7, 14, 9]. Based on these research, three alternatives have emerged for dealing with massive data sets:

• Sampling: Massive data set is sampled and data min- ing model is built using the sampled data. Building a model on sampling data is fast and efficient, and can be as accurate as building a model on the entire data

set [11]. However, some questions about sampling are challenging such as,What is the right sample size? How do we increase the quality of samples?

• Ensemble Methods: In this case, data is partitioned and mined in parallel to create ensemble of classifiers.

Basically, this approach takes the “divide and conquer”

idea, partitioning the entire data set into several sub- sets, then running the in-memory mining algorithm on each subset, while finally merging the local classifiers to build global classifier to do the classification [31].

• MapReduce based Approaches: Recently, cloud computing platforms have been developed for process- ing large data sets. Examples are Google file system [16], BigTable [10], and MapReduce [14] infrastructure, Amazon’s S3 storage cloud, SimpleDB data cloud, and EC2 compute cloud [1], sector/sphere infrastructure [20, 6], and the open source Hadoop system [2], con- sisting of the Hadoop Distributed File System (HDFS), and HBase. MapReduce infrastructure has provided data mining researchers with a simple programming interface for parallel scaling up of many data mining algorithms on large data sets. This certainly make it possible to mine larger data sets without the con- straints of the memory limit.

This paper compares the three main large scale data mining approaches using a large real-world data set. The data set is composed of large number of web pages retrieved from the Internet and the document classification is our data mining task. Next, the results and analysis of the experiments are discussed in detail. Finally, ideas are proposed and explored for further studies in large-scale data mining in the future.

This paper is structured as follows: in section 2, an overview of related works in large scale data mining is given. In sec- tion 3, the MapReduce framework and data mining algo- rithm used in our work are detailed. In section 4, the three different approaches used in this papers’ experiments for the large-scale data mining are discussed. The experimental re- sults are provided in section 5. In section 6, a new framework that combines sampling and MapReduce is explained. Fi- nally conclusions are drawn and future work is outlined in section 7.

(3)

2. RELATED WORK

Recently, large-scale data mining has been extensively in- vestigated. Taking advantages of MapReduce techniques to tackle the problems in the large scale data mining domain gains more and more attention [30, 12, 29, 17, 8]. It is very clear that the MapReduce framework provides good perfor- mance with respect to scalability, reliability and efficiency.

Utilizing MapReduce is one of major methods along with sampling [31, 15] and optimal algorithm design [31, 33, 26]

in large scale and parallel data mining.

MapReduce has been intensively used for large scale ma- chine learning problems at Google Inc. [14, 29, 13]. Other researchers have used MapReduce in many data mining ap- plications [12, 17, 30, 24, 35, 21]. Besides MapReduce, other cloud infrastructures have been used for data mining tasks as well [28, 18]. Chu et al presented an applicable data mining framework of MapReduce [12]. In their work, they showed the possible implementation of machine learning algorithms using their framework. Furthermore they illustrated that the algorithms that fit the Statistical Query model [23] can be written in a certain “summation form”, which can be eas- ily parallelized in the MapReduce framework. They demon- strated that the computation speed is linear with the num- ber of computer cores. Gillick et al proposed the initiative of the implementations of MapReduce framework, specifically, for applications in the nature language processing area [17].

Later Papadimitriou et al presented the MapReduce based framework for the entire data mining process [30]. Panda et al presented a framework using a paralleling decision tree algorithm with MapReduce [29]. Ensemble methodology is one of the promising approaches. Provost et al summa- rized the methods for scaling up inductive algorithms in [31]

and presented ensemble methods. In another direction, ap- proaches using sampling methods are well reviewed in [19].

This work focuses on comparing the data mining task perfor- mance using these three major approaches in large scale data mining. Although our work is still in the initial phase, we have obtained some interesting observations from our exper- iments, and hence we propose a very promising MapReduce framework for future work.

3. PRELIMINARY DEFINITION 3.1 MapReduce

Proposed in early this century by Google, MapReduce [14]

is a programming model and it provides distributed com- putation framework. Computation under this framework is declared through the Map and Reduce functions, executed in this order. The Map function takes a pair of < key, value >

as input, and emits intermediate < key, value > pairs to the Reduce function. The reduce function processes all interme- diate values associated with the same intermediate key. By using MapReduce framework to execute distributed comput- ing in a cluster, users only need to consider the logical aspect of the computation, and do not need to consider distributed programming issues, since the system automatically takes care of all the issues, such as data partitioning and distri- bution, failure handling, load balancing and communication within cluster. Recently, thanks to the Hadoop project [2], a open source software implementing MapReduce framework, MapReduce has been largely used in academia and indus- try. Hadoop, coming with Hadoop Distributed File System

(HDFS), providing a simple programming model, serves a scalable, reliable and efficient infrastructure for clustering computing on large data sets. MapReduce framework has been used in many large scale data tasks, for example, dis- tributed grep and sort, large scale machine learning and graph computation. However, not every computation can be parallelized in MapReduce framework. The independences of computation and the output order of every map step are the requirements for the MapReduce paralleled execution.

3.2 Naïve Bayes Classifier

Among all classifiers, the Na¨ıve Bayes classifier is simple and efficient in most classification tasks. It is a probabilistic method based on the Bayes theorem and strong indepen- dence assumptions. The Na¨ıve Bayes classifier function is defined as follows. The Na¨ıve Bayes classifier selects the most likely classification Cnbas [27]

Cnb= argmaxCj∈CP (Cj)Y

i

P (Xi|Cj) (1) where X = X1, X2, ..., Xndenotes the set of attributes, C = C1, C2, .., Cd denotes the finite set of possible class labels, and Cnb denotes the class label output by the Na¨ıve Bayes classifier.

In a text categorization task, the Na¨ıve Bayes classifier is one of the two most commonly used classifiers, and the other one is SVM. In a text categorization context, the classifier can be written as equation 2 [25].

C(d) = argmaxCj∈CP (Cj) Y

wi∈d

P (wi|Cj) (2) where d denotes a document represented as a stream of words w = w1, w2, ..., wn, C = C1, C2, .., Cd denotes the finite set of possible class labels, and C(d) denotes the class label output by the Na¨ıve Bayes classifier.

To avoid zero in P (wi|Cj), a smoothing method is applied.

Laplace smoothing method, shown as equation 3, is one of most popular methods, which is used in our experiments.

P (wi|Cj) = 1 + f (wi, Cj)

|W | + f (Cj) (3) where f (wi, Cj) denotes the frequency of word wi seen in the documents whose label is Cj. And f (Cj) denotes the total number of words seen in the documents whose label is Cj. |W | denotes the total number of distinct words seen in all documents.

4. APPROACHES FOR LARGE-SCALE DATA MINING

• Sampling Model: As stated in [19], there are several sampling methods used in data mining. Among them, random sampling is straightforward and it works in many cases. In our sampling model, we use a random number to randomly select the data from the training set and use the sampled data set to train the Na¨ıve Bayes classifier.

(4)

Figure 1: Ensemble approach.

• Ensemble model: In this model, we randomly divide the whole training set into several groups, each of which has the same number of files. Because we randomly as- sign each file to a group, every group contains the same category distribution as the original data set. The idea of the ensemble model is to build individual classifier on each group and use each classifier to classify the testing files, then use majority vote to decide the final category of each file, shown in figure 1.

• Hadoop model: Thanks to the capability of reliable, scalable and distributed computing, Hadoop provides a framework to build the classifier taking large scale data set. The model builds a classifier on the entire training set and the classifier is evaluated against the testing set. Both the training and evaluation process can be parallelized in MapReduce framework, shown in figure 2.

5. EXPERIMENTAL RESULTS 5.1 Data Set

In our experiments, the purpose is to compare the three dif- ferent data mining methods in a real-world large scale data set. The existing most popular data sets used in data mining field can not meet our requirement of size for the real data set. Therefore, we chose to collect a real world large data set from the Web. Our data set of html pages are fetched from www.dmoz.org. The Open Directory Project [4] is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a vast, global community of volunteer editors. DMOZ OPD is one of the

most frequently used web directories in the web page classifi- cation research domain [32]. The DMOZ website contains a classification structure of 15 top-level categories, from which we choose to get 8 of them as our experiment data set. The eight categories we choose, and the numbers of html pages from each category are listed in table 1. There is a total of 725, 608 html pages that constitute 30GB in size.

5.2 Data Preprocessing

Before we run topic classification, all html pages are pre- processed by our text preprocessor. The text preprocessor cleans the data for the training and classification usage by (1) removing all html tags and extracting the content from the “title” and “description” metatags, and any meaningful content in the “body” section; (2) removing stop words and any symbol other than the number or letters. After the step of preprocessing, every html file has been converted to a text file with a stream of terms. Since our purpose of the experiment is not to investigate the methodology of the web page classification, we prefer to study the accuracy of the classification of simple text files rather than complex struc- tured html files. Therefore, in data preprocessing phase, we remove all the html related features and only keep the mean- ingful and important information presented in html pages.

In the next part of the experiment, the classification task can be seen as a text categorization task.

5.3 Experiment Setup

In our sampling and ensemble methods, we use Rainbow [5]

to train the Na¨ıve Bayes Classifier. Rainbow has been widely used in text classification tasks. Rainbow reads the docu- ments and writes to disk a model containing their statistics.

By using the statistics, Rainbow computes the probabilities required by building a Na¨ıve Bayes classifier, as discussed in Section 3.2. The classifiers are trained by Rainbow with its default Laplace smoothing method and no feature selection.

According to [28], Shuffle and Chop are two data partition- ing methods. Shuffle distributes data randomly, while chop divides data by existing order. Because our data set has not pre-randomed and shuffle has been proven to be better per- formance than chop. In our ensemble approach, we adopt shuffle method to partition data.

To build the Na¨ıve Bayes classifier on the entire data set us- ing MapReduce, we utilize Mahout [3] Na¨ıve Bayes library in our experiment. Mahout is an open source project pro- viding scalable machine learning algorithms under Apache license. To train a Na¨ıve Bayes classifier, the training files statistics are retrieved through several rounds of MapRe- duce jobs. In the first round of job, each map function reads a chunk of the documents and emits three (key, value) pairs: (termCategory, 1); (term, 1); (category, 1) In the next rounds of MapReduce jobs, the statistics required by the model are calculated and written to disk. In the test- ing phase, the model is loaded and test files are sent to the map function to do classification, then the reduce function outputs the final confusion matrix.

5.4 Experimental Results

The Na¨ıve Bayes classifier has been proven a very simple and effective text classifier. In our experiment, we deploy the

(5)

Figure 2: MapReduce Classifier Training and Evaluation Procedure

Table 1: Numbers of HTML Pages from Eight Categories.

Category Art Business Computer Game Health Home Science Society Number 176, 340 188, 100 88, 830 39, 560 43, 680 22, 281 85, 197 81, 620

Table 2: Data mining accuracy of Hadoop approach.

Category Round1 Round2 Round3 Round4 Round5 Round6 Round7 Round8 Round9 Round10 Art 80.09% 80.12% 81.18% 80.31% 80.20% 80.17% 80.86% 79.70% 79.79% 80.06%

Business 55.82% 58.43% 53.77% 57.30% 57.13% 59.62% 54.98% 58.36% 59.15% 57.98%

Computer 82.37% 81.93% 82.88% 82.68% 82.48% 82.18% 82.81% 82.22% 81.69% 82.14%

Game 78.83% 78.84% 76.86% 77.70% 78.45% 78.94% 78.17% 79.78% 78.49% 79.02%

Health 78.77% 80.49% 79.68% 80.42% 79.95% 80.33% 79.76% 79.71% 80.14% 81.27%

Home 68.01% 66.78% 67.97% 66.41% 67.16% 67.12% 67.49% 67.13% 66.16% 67.44%

Science 48.47% 50.64% 49.98% 50.19% 49.26% 47.89% 48.34% 49.32% 49.62% 48.53%

Society 63.07% 61.00% 61.10% 61.62% 61.77% 61.50% 61.92% 62.16% 62.16% 61.39%

(6)

Figure 3: Sampling data mining accuracy

Na¨ıve Bayes classifier in the three models: Hadoop model, sampling model and ensemble model. We run each experi- ment 10 times. In each round, we randomly select 50% of the files from the whole data set as training set and remaining 50% of the files as the test set. The three models respectively use the training set to train their classifiers and evaluate the classifiers again the test set. Then we get the average ac- curacy of each model. We use classification accuracy as our accuracy messure.

5.4.1 Sampling Model

In sampling model, we randomly select the data from the training set and use the sampled data set to train the Na¨ıve Bayes classifier. In order to investigate how accuracy varies with different sample sizes, in each round, we build four classifiers using four different sample sizes. The four sample sizes are 1,000, 2,000, 5,000 and 10,000 files from each of the eight categories. Figure 3 shows the accuracy of the models in the four sample sizes. From Figure 3, we can see the clas- sifier built on sampling size of 10k has the highest accuracy in each round. The total average accuracy of 10 rounds sam- pling model improves from 64.24% to 67% by increasing the sampling data set from 1k to 10k files from each category.

Unsurprisingly, the result indicates that increasing sampling size improves model accuracy

5.4.2 Ensemble Model

In this model, we set the sub-model number to 5, 10, 15 and run the evaluations for every model in each round. Figure 4 shows the accuracy of the 10 rounds of 5, 10, 15 sub-models ensemble models. From figure 4, we can see the classifier us- ing 5 sub-models ensemble outperforms the other two classi- fiers which use 10 and 15 sub-models ensemble respectively.

The total average accuracy of 10 rounds sub-model ensem- bles decreases from 69.14% to 68.48% by increasing the sub model numbers from 5 to 15. This shows each sub-model built on more data producing higher accuracy leads to more accurate final result. However, the difference is slight.

5.4.3 Hadoop Model

In our Hadoop cluster, the Master node and 22 Slaves nodes are running Ubuntu Linux. The Master is configured with Intel(R) Pentium(R) 4 CPU 2.80GHz and 4G Memory. All the Slaves are configured with Intel(R) Pentium(R) 4 CPU 3.20GHz and 4G Memory. For each round of classification,

Figure 4: Ensemble data mining accuracy

Figure 5: Three approaches data mining accuracy

the model builds a classifier on the entire training set and the classifier is evaluated against the testing set. Table 2 shows the accuracy of 10 rounds evaluation in details. The average accuracy of 10 rounds evaluation is 68.34%.

We select the best performed classifiers in sampling and en- semble approaches, which is the one built on sampling size of 10k and 5 sub-models ensemble classifier, to compare with the Hadoop model. In Figure 5, we can see ensemble ap- proach overall has the highest accuracy among the three approaches. But the accuracy of the three approaches are quite close.

6. DISCUSSION

The experiments show the classifiers built on the entire data set have better accuracy than the one built on sampled data set. It is not surprising to see that using the entire data set to build the model will achieve better outcome than sampling.

Undoubtedly, MapReduce framework grants researchers a simple option to implement parallel and distributed data mining algorithms on large-scale data set. However, some data mining techniques such as Neural networks may not be easy to implement in MapReduce framework. At the same time, by increasing the sampling ratio, we achieve similar

(7)

Figure 6: MapReduce sampling framework.

accuracy. However, this observation has some limitations.

For example, for other classification algorithms, it is not clear whether we will observe similar results.

Choosing an appropriate sampling method and size is a key factor that determines the quality of the classifier in sam- pling approach. Any bad quality samples will generate a poor classifier. A good sampling method should have the ability of sampling data that is representative of the entire data set. Another important issue is the sample size. For Naive Bayes, using Hoeffding inequality [22], we can easily compute the sample size that we need for accurately estimat- ing f (wi, Cj). Basically, with sample size of fifty thousand, relative error in frequency estimate will be less than 0.007.

Of course, doing such analysis for other data mining tasks could be harder.

In our research work, we realized that it will take a lot of efforts to implement every single data mining algorithm in MapReduce fashion, and some algorithms such as Neural networks may be very difficult to implement in distributed fashion. Meanwhile, as our experiments indicate the sam- pling approach could yield acceptable data mining accuracy.

To fully take advantage of MapReduce and to reduce the implementation time, we propose to utilize MapReduce in the sampling phase. In this way, we only need to imple- ment sampling algorithm using MapReduce, and fully use the scalability of MapReduce to efficiently explore through the entire data set in the sampling phase, then feed the sam- pled data to the data mining tools, such as WEKA [34], to build data ming model. To get a good data mining model, we could design an important feedback loop as shown in figure 6.

The main idea of this framework is to utilize the MapReduce technique to continuously adjust the sampled data and the size of the sampling until we produce a good data mining

model. In figure 6, you can see, the system starts with ini- tially setting the sampling times threshold and the sampling iteration limit which the existing in-memory data mining toolkits can take, and then enter the self-adaptive feedback sampling loop. In this loop, the system use the entire data set for sampling purpose. The MapReduce component takes the entire data set as input, and the map function serves as the sampling function which either chooses the instance into the training set or discards it according to the particular sampling method the system follows. The reduce function outputs all selected instances as training set. Whenever a sampled data set is retrieved successfully, the system calls the data mining toolkit, e.g. WEKA to build a model. The system tests the model with the data that did not appear in the sampled data set, and then check if the accuracy has significantly improved compared to previous sample. If the model’s accuracy has improved significantly, the system does another round of sampling until there is no substantial im- provement in accuracy, or it reaches the sampling iteration threshold.

The system has the following advantages.

• First, by using MapReduce framework, the system can explore huge data sets for sampling by following simple and unified format.

• Second, it can utilize any existing data mining toolkit to build the models that are hard to implement in dis- tributed fashion.

• Third, the system can increase the accuracy by com- paring the multiple models to get a good model.

7. CONCLUSIONS AND FUTURE WORK

In this paper, we explored three large-scale data mining ap- proaches by using a real-world large-scale data set. We used

(8)

the same classification algorithm Na¨ıve Bayes with different methods dealing with training set. We compared the ac- curacy of three approaches in a series of experiments. The result of our experiments shows using more data in train- ing could build more accurate model. As sampling size in- creases, the sampling model also can get similar accuracy as the ones built on the entire data set. Furthermore, we ex- plored the relationship between model accuracy with sam- pling size in sampling model, and with data partitioning size in ensemble model.

Based on our observations, we proposed an idea of utilizing MapReduce framework to help improve the accuracy and efficiency of sampling. In the future, we plan to implement and explore our new framework by integrating with different data mining toolkits and evaluating with different data sets, and then compare it with existing large-scale data mining approaches.

8. REFERENCES

[1] Amazon web services:

http://aws.amazon.com/about-aws/.

[2] Hadoop: http://hadoop.apache.org.

[3] Mahout: http://lucene.apache.org/mahout.

[4] Odp: http://www.dmoz.org.

[5] Rainbow:

http://www.cs.cmu.edu/ mccallum/bow/rainbow/.

[6] Sector/sphere open source software:

http://sector.sourceforge.net/.

[7] J. Bacardit and X. Llor`a. Large scale data mining using genetics-based machine learning. In GECCO ’09:

Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference, pages 3381–3412, New York, NY, USA, 2009. ACM.

[8] H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li. Context-aware query suggestion by mining click-through and session data. In KDD ’08:

Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 875–883, New York, NY, USA, 2008. ACM.

[9] E. Y. Chang, H. Bai, and K. Zhu. Parallel algorithms for mining large-scale rich-media data. In MM ’09:

Proceedings of the seventeen ACM international conference on Multimedia, pages 917–918, New York, NY, USA, 2009. ACM.

[10] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.

Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E.

Gruber. Bigtable: a distributed storage system for structured data. In OSDI ’06: Proceedings of the 7th symposium on Operating systems design and

implementation, pages 205–218, Berkeley, CA, USA, 2006. USENIX Association.

[11] N. V. Chawla, L. O. Hall, K. W. Bowyer, and W. P.

Kegelmeyer. Learning ensembles from bites: A scalable and accurate approach. J. Mach. Learn. Res., 5:421–451, 2004.

[12] C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R.

Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. pages 281–288. MIT Press, 2006.

[13] A. S. Das, M. Datar, A. Garg, and S. Rajaram.

Google news personalization: scalable online

collaborative filtering. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, pages 271–280, New York, NY, USA, 2007. ACM.

[14] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI’04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, pages 10–10, Berkeley, CA, USA, 2004. USENIX Association.

[15] C. Domingo, R. Gavald`a, and O. Watanabe. Adaptive sampling methods for scaling up knowledge discovery algorithms. Data Min. Knowl. Discov., 6(2):131–152, 2002.

[16] S. Ghemawat, H. Gobioff, and S. Leung. The google file system. SIGOPS Oper. Syst. Rev., 37(5):29–43, 2003.

[17] D. Gillick, A. Faria, and J. DeNero. Mapreduce:

Distributed computing for machine learning.

[18] R. Grossman and Y. Gu. Data mining using high performance data clouds: experimental studies using sector and sphere. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 920–927, New York, NY, USA, 2008. ACM.

[19] B. GU, F. HU, and H. LIU. Sampling and its application in data mining: A survey. 2000.

[20] Y. Gu and R. Grossman. Sector and sphere: The design and implementation of a high performance data cloud. Theme Issue of the Philosophical Transactions of the Royal Society, E-Science and Global

E-Infrastructure, 367:2429–2455, 2009.

[21] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: a mapreduce framework on graphics processors. In PACT ’08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 260–269, New York, NY, USA, 2008. ACM.

[22] W. Hoeffding. Probability inequalities for sums of bounded random variables. In Journal of the American Statistical Association, pages 58:13–30, 1963.

[23] M. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998.

[24] J. Lee and S. Cha. Page-based anomaly detection in large scale web clusters using adaptive mapreduce (extended abstract). In RAID ’08: Proceedings of the 11th international symposium on Recent Advances in Intrusion Detection, pages 404–405, Berlin,

Heidelberg, 2008. Springer-Verlag.

[25] A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification, 1998.

[26] M. Mehta, R. Agrawal, and J. Rissanen. Sliq: A fast scalable classifier for data mining. In EDBT ’96:

Proceedings of the 5th International Conference on Extending Database Technology, pages 18–32, London, UK, 1996. Springer-Verlag.

[27] T. M. Mitchell. Machine Learning. mcgraw-hill, 1997.

[28] C. Moretti, K. Steinhaeuser, D. Thain, and N. V.

Chawla. Scaling up classifiers to cloud computers. In ICDM ’08: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pages 472–481, Washington, DC, USA, 2008. IEEE Computer Society.

(9)

[29] B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo.

Planet: Massively parallel learning of tree ensembles with mapreduce. Lyon, France, 2009. ACM press.

[30] S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with map-reduce: A case study towards petabyte-scale end-to-end mining. In ICDM ’08:

Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pages 512–521,

Washington, DC, USA, 2008. IEEE Computer Society.

[31] F. Provost, V. Kolluri, and U. Fayyad. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3:131–169, 1999.

[32] X. Qi and B. D. Davison. Web page classification:

Features and algorithms. ACM Comput. Surv., 41(2):1–31, 2009.

[33] J. C. Shafer, R. Agrawal, and M. Mehta. Sprint: A scalable parallel classifier for data mining. In VLDB

’96: Proceedings of the 22th International Conference on Very Large Data Bases, pages 544–555, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.

[34] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2 edition, 2005.

[35] H. C. Yang, A. Dasdan, R. L. Hsiao, and D. S. Parker.

Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD ’07:

Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1029–1040, New York, NY, USA, 2007. ACM.

References

Related documents

Changes to Earned Revenue Ratio below 2% (either positive or negative) were not considered significant for the purposes of this study and correlating years of comparative

Al-Hazemi (2000) suggested that vocabulary is more vulnerable to attrition than grammar in advanced L2 learners who had acquired the language in a natural setting and similar

It requires training for support brokers, care managers and their supervisors to develop the knowledge and skills needed to support this approach; without this, participant

[23] indicated that during a pilot test ConLegno’s (The Consorzio Servizi Legno Sughero) was one of the two first monitoring organizations recognized by the

With the information entered correctly (as shown above), click on the folder icon to establish the backup destination. When you do so the first time you will be asked a question

[87] demonstrated the use of time-resolved fluorescence measurements to study the enhanced FRET efficiency and increased fluorescent lifetime of immobi- lized quantum dots on a

For African Americans reaction to the 1883 decision reveals not only the continuity of black political participation in the public sphere beyond the Reconstruction years, but also

And the populist economics policy can be support to foster economic growth and support to increase economics development in the East Java society. Populist economic policy is