PARADIGMS FOR REALIZING MACHINE LEARNING ALGORITHMS

(1)

Abstract

The article explains the three generations of machine learning algorithms—with all three trying to operate on big

data. The first generation tools are SAS, SPSS, etc., while second generation realizations include Mahout and

RapidMiner (that work over Hadoop), and the third generation paradigms include Spark and GraphLab, among

others. The essence of the article is that for a number of machine learning algorithms, it is important to look

beyond the Hadoop’s Map-Reduce paradigm in order to make them work on big data. A number of promising

contenders have emerged in the third generation that can be exploited to realize deep analytics on big data.

Introduction

Google’s seminal paper on Map-Reduce (MR)1 was the

trigger that led to a lot of developments in the big data space. Though the Map-Reduce paradigm was known in the func-tional programming literature, the paper provided scalable implementations of the paradigm on a cluster of nodes. The paper, along with Apache Hadoop, the open source im-plementation of the MR paradigm, enabled end users to process large data sets on a cluster of nodes—a usability paradigm shift. Hadoop, which comprises the MR im-plementation along with the Hadoop Distributed File System (HDFS), has now become the de facto standard for data processing, with lot of Industrial game-changers such as Disney, Sears, Walmart, and AT&T having their own Hadoop cluster installations.

The big data puzzle can be understood better by looking at venture capitalist (VC) funding information. This informa-tion has been generated from various VC sources on the web. Some of the interesting areas of funding, along with the corresponding startups, are also given in Table 1. It captures a futuristic landscape of the big data space—our perspective of important players/start-ups/software of big data.

It is evident from the table that a number of companies are focusing on building analytics on top of the Hadoop framework, leading to the emergence of the term ‘‘big data analytics.’’ By the term big data analytics, we refer to the ability to ask questions on large data sets and answer them appropriately, possibly by using machine learning techniques as the foundation.

The emerging focus of big data analytics is to make tradi-tional techniques such as market basket analysis scale and work on large data sets. This is reflected in the approach of SAS and other traditional vendors to build Hadoop con-nectors. The other emerging approach for analytics focuses on new algorithms including machine learning and data mining techniques for solving complex analytical problems including those in video and real-time analytics. The goal of this article is to review the literature and practices in this emerging subject area and to explore the fundamental para-digms that are useful to realizing big data analytics. Our perspective is that Hadoop is just one such paradigm, with a whole new set of others that are emerging including Bulk Synchronous Parallel (BSP)-based paradigms and graph processing paradigms, which are more suited to realize iter-ative machine learning algorithms.

PARADIGMS

FOR REALIZING

MACHINE

LEARNING

ALGORITHMS

Vijay Srinivas Agneeswaran, PhD,

Pranay Tonpay, and Jayati Tiwary, BE

Impetus Infotech India Private Limited Bangalore, Karnataka, India

(2)

The rest of the article is organized as follows: The next sec-tion, ‘‘Big Data Analytics,’’ gives a bird’s eye view of the three generations of realizations of machine learning algorithms. The subsequent two sections explain the two third-generation paradigms, namely Spark and GraphLab. The final section provides concluding remarks for the article.

Big Data Analytics: Three Generations

of Machine Learning Realizations

We will explain the different paradigms available for im-plementing machine learning (ML) algorithms both from the literature and from the open source community. First of all, we would like to furnish a view of

the three generations of ML tools available to us today:

1. The traditional ML tools for machine learning and statistical analysis—including SAS, SPSS, Weka, and the R language— allow deep analysis on smaller data sets that can fit the mem-ory of the node on which the tool runs.

2. Second-generation ML tools such as Mahout, Pentaho, or RapidMiner allow what we call a shallow analysis of big data.

3. The third-generation tools such as Spark, Twister, HaLoop, Hama, and GraphLab facilitate deeper analysis of big data.

First-generation ML tools/paradigms

The first-generation ML tools can facilitate deep analytics, as they have a wide set of ML algorithms. However, not all of them can work on large data sets like tera-petabytes of data, due to scalability limitations (limited by the nondistributed nature of the tool). In other words, they are vertically scalable (i.e., you can increase the processing power of the node on which the tool runs), but not horizontally scalable (i.e., not all of them can run on a cluster). These limitations are being addressed by building Hadoop connectors as well as pro-viding clustering options—meaning that the vendors have made efforts to reengineer the tools such as R and SAS to scale horizontally. This would fall under the second/third-generation tools as covered subse-quently.

Second-generation ML tools/

paradigms

The second-generation tools (we can now term the traditional ML tools such as SAS as first-generation tools) such as Mahout (http:// mahout.apache.com), Rapidminer, or Pentaho provide the ability to scale to large data sets by implementing the algorithms over Hadoop, the open source MR implementation. These tools are maturing fast and are open source (especially Mahout). Mahout has a set of algo-rithms for clustering and classification as well as a very good recommendation algorithm.2 Mahout can thus be said to work on big data with a number of production use cases

Table1. Important Players, Start-Ups, and Software of Big Data

Area of funding Important players/start-ups/software

Analytics Analytical Apps and App Dev: Digital Reasoning, Klouddata, JackBe, Accretive, Tibco Spotfire, Pentaho, ParStream, Concurrent, Birst, SAS, ClearStory Data, Terracotta, MailChimp, WibiData, Palantir, Quibole

Analytics & Tools: Karmasphere, Pervasive (Actian), Zettaset, Datameer, Splunk, Alpine Data Labs, Knime, HStreaming, Skytree, Bloomreach, Versium, Alteryx, Zemetis, Gauvus, MuSigma, RevolutionR

Analytic DBs & Platforms: Teradata/Asterdata, IBM/Netezza, HP/Vertica, Pivotal/Greenplum DB, ParAccel/Actian (and Actian Vectorwise), 1010data

Video Analytics: OpenCV, Ooyala, TubeMoghul, 3VR, Video Breakouts Computing paradigms Map-Reduce – Hadoop, Spark, HaLoop, Twister MR,

Real-time/CEP – Storm, Kafka, S4, Akka

_{Graph Processing – GraphLab, Pregel , Apache Giraph, Stanford GPS, Golden ORB, GraphX, Graph}

Search (Facebook)

_{Interactive Query – Dremel, Apache Drill} _{Distributed SQL – Cloudera Impala, Shark}

Hadoop distributions _{Cloudera, HW, GP/Pivotal, MapR, IBM, DataStax, AWS (EMR), Intel, WanDisco}

Storage _{NoSQL/NewSQL: DataStax (Cassandra), 10gen (MongoDB), VoltDB, Couchbase, TinyDB, NuoDB,}

Phoenix, FoundationDB, SciDB, Neo4J, OrientDB, GraphDB, HBase, Redis, etc.

File/Storage: Appistry, AmpliStor, HDFS, RainStor, EMC Isilon, Lustre, QFS

Visualization Datameer, Pentaho, Actuate, Jaspersoft, QlikTech, Tableau, Platfora, chart.io, Cirro, Cognos, Dataspora, Intellicus, Ayasdi, Zoomdata, SiSense, Centrifuge Systems, JQ Plot, D3.js, etc.

Miscellaneous Software defined networking: Nicira (VMware), Contrail ( Juniper), Arista

Data Munging: Dataspora, Trifacta

Big-data Governance: Druid (Metamarkets), Infochimps, DataMarket, Timetric, Intel Rhino, Zettaset

‘‘THE FIRST-GENERATION ML

TOOLS CAN FACILITATE DEEP

ANALYTICS, AS THEY HAVE

A WIDE SET OF ML

(3)

mainly for the recommendation system. We have also used Mahout in production systems for realizing recommendation algorithms in financial domain and found it to be scalable, though not without issues (we had to tweak the source sig-nificantly). One observation about Mahout is that it imple-ments only a smaller subset of ML algorithms over Hadoop: only 25 algorithms are production quality, with only 8 of 9 usable over Hadoop, meaning scalable over large data sets. These include linear regression, linear support vector ma-chine (SVM), k-means clustering, etc. It does provide a fast sequential implementation of the logistic regression, with parallelized training. However, as several others have also noted (see Quora, for instance), it does not have im-plementations of nonlinear SVMs or multivariate logistic regression (otherwise known as discrete choice model). Overall, this article is not intended for Mahout bashing; however, our point is that it is quite hard to implement certain ML algorithms including the

kernel SVM and conjugate gradient descent (note that Mahout has an implementation of stochastic gradi-ent descgradi-ent) over Hadoop. This has been pointed out by several others as well—for instance, see the paper by Srirama.3 This paper makes detailed comparisons between Hadoop and Twister Map-Reduce4with respect to iterative algorithms such as Con-jugate Gradient Descent (CGD) and shows that the overheads can be significant for Hadoop.

What do we mean by iterative? A set of entities that perform a certain computation, wait for results from neighbors or other entities, and start the next iteration. The CGD is a perfect example of an iterative ML algorithm: each CGD can be broken down into daxpy, ddot, and matmul as the primitives. We will explain these three primitives: daxpy is an operation that takes a vectorx, multiplies it by a constant k and adds another vector y to it; ddot computes the dot product of two vectorsxand y; matmul multiplies a matrix by a vector and produces a vector output. This means 1 MR per primitive, leading to 6 MRs per iteration, and eventually 100s of MRs per CG computation, as well as few GBs of communication, even for small matrices. In essence, the setup cost per iteration (which includes reading from HDFS into memory) over-whelms the computation for that iteration, leading to perfor-mance degradation in Hadoop MR. In contrast, Twister distinguishes between static and variable data, allowing data to be in memory across MR iterations as well as a combine phase for collecting all reduce outputs, and hence performs signifi-cantly better. We would like to make the comment about Hadoop that it is good for embarrassingly parallel applications, but certain hindrances remain for enterprise adoption, in-cluding the following:

1. Lack of object database connectivity (ODBC). A lot of business intelligence (BI) tools are forced to build sep-arate Hadoop connectors.

2. Data splits.If data splits are interrelated or computation needs to access data across splits, this may involve joins and may not run efficiently over Hadoop.

3. Iterative computations.Hadoop MR is not well suited for two reasons: One is the overhead of fetching data from HDFS for each iteration (which can be amortized by a distributed caching layer) and second, is the lack of long-lived MR jobs in Hadoop. This implies that for each it-eration, new MR jobs need to be initialized – overhead of initialization could overwhelm computation for the iter-ation and this could cause significant performance hits. The other second-generation tools are the traditional tools that have been scaled to work over Hadoop. The choices in this space include the work done by Revolution Analytics to scale R over Hadoop and the work to implement a scalable runtime over Hadoop for R programs.5The SAS in-memory analytics, part of the High Performance Analytics toolkit from SAS, is another attempt at scaling a traditional tool by using a Hadoop cluster. However, the re-cently released version works over Greenplum/Teradata in addition to Hadoop: this could then be seen as a third-generation approach. The other interesting work is by a small start-up known as concurrent sys-tems, which provides a Predictive Modeling Markup Lan-guage (PMML) run-time over Hadoop. PMML is like the XML of modeling, allowing models to be saved in descriptive language files. Traditional tools such as R and SAS allow the models to be saved as PMML files. The runtime over Hadoop would allow these model files to be scaled over a Hadoop cluster, so this also falls under our second-generation tools/ paradigms.

Third-generation ML tools/paradigms

The limitations of Hadoop and its lack of suitability for certain classes of applications has motivated some researchers to come up with alternatives. Researchers at the University of California, Berkeley have proposed Spark6 as one such al-ternative. In other words, Spark could be seen as the next generation data processing alternative to Hadoop in the big data space. The key idea distinguishing Spark is its in-memory computation, allowing data to be cached in in-memory across iterations/interactions. The Berkeley researchers have proposed Berkeley Data Analytics Stack (BDAS) as a collec-tion of technologies that help in running data analytics tasks across a cluster of nodes. The lowest level component of the BDAS is Mesos, the cluster manager that helps in task allo-cation and resource management tasks of the cluster. The

‘‘THE SECOND-GENERATION

TOOLS

.

PROVIDE THE ABILITY

TO SCALE TO LARGE DATA

SETS BY IMPLEMENTING THE

ALGORITHMS OVER HADOOP,

THE OPEN SOURCE MR

IMPLEMENTATION.’’

(4)

second component is the Tachyon file system built on top of Mesos. Tachyon provides a distributed file system abstraction and provides interfaces for file operations across the cluster. In Spark, the computation paradigm is realized over Tachyon and Mesos in a specific embodiment, though it could be realized without Tachyon and even without Mesos for clus-tering. Shark, which is realized over Spark, provides an SQL abstraction over a cluster—similar to the abstraction Hive provides over Hadoop. This article explores Spark, which is the main ingredient over which ML algorithms can be built. The main motivation for Spark was

that the commonly used MR para-digm, while being suitable for some applications that can be expressed as acyclic data flows, was not suitable for other applications, such as those that need to reuse working sets across iterations. So they proposed a new paradigm for cluster computing that can provide similar guarantees or fault-tolerance as MR but would also be suitable for iterative and in-teractive applications.

The HaLoop work7 also extends Hadoop for iterative ML algorithms. HaLoop not only provides a programming ab-straction for expressing iterative applications, it also uses the notion of caching to share data across iterations and for fixpoint verification (termination of iteration), thereby im-proving efficiency. Twister (http://iterativemapreduce.org) is another effort similar to HaLoop.

The other important tool that has looked beyond Hadoop MR comes from Google: the Pregel framework for realizing graph computations.8 Computations in Pregel comprise a series of iterations known as supersteps. Each vertex in the graph is associated with a user-defined compute function; Pregel ensures at each superstep that the user-defined com-pute function is invoked in parallel on each edge. The vertices can send messages through the edges and exchange values with other vertices. There is also the global barrier, which moves forward once all compute functions are terminated. Readers familiar with bulk synchronous parallel (BSP) can realize why Pregel is a perfect example of BSP (a set of entities computing in parallel with global synchronization and able to exchange messages).

Apache Hama9is the open source equivalent of Pregel, being an implementation of the BSP. Hama realized BSP over HDFS, as well as the Dryad engine from Microsoft. It may be that they do not want to be seen as being different from the Hadoop community. But the important thing is that BSP is an inherently well-suited paradigm for iterative computations and Hama has parallel implementations of the CG, which we said is not easy to realize over Hadoop. It must be noted that the BSP engine in Hama is realized over message passing

interface (MPI), the father (and mother) of parallel pro-gramming literature (www.mcs.anl.gov/research/projects/ mpi/). The other projects that are similar to Pregel are Apache Giraph, Golden orb, and Stanford GPS. Google are yet to open source Pregel, to the best knowledge of the authors.

Third-Generation ML Tool: Spark

The key notion in Spark is the Resilient Distributed DataSet (RDD), which can be cached in memory on different nodes and can be used across iterations, thereby improving the performance significantly. Spark also addresses fault-tolerance for the RDDs, meaning that if a node crashes, there is enough information in other RDDs to recreate or reconstruct the lost RDD partition through what they call a ‘‘lineage.’’ Spark is im-plemented in Scala,10 a high-level language that is similar to Java and gaining popularity. The main dif-ference between Java and Scala is that Scala is purely object oriented, much like the classical object-oriented language Smalltalk. Scala also unifies functional programming (of the Lisp kind) with object-oriented programming. This means that some classes may inherit from functions [e.g., array type of Scala inherits from functions and is written as A(i)].

Spark offers some alternatives for building RDDs: 1. RDDs can be built from a file in HDFS.

2. They can be built by parallelizing a Spark collection (this is slicing like an array).

3. RDDs can also be built by transforming existing RDDs (specify operations such as map or filter or join). 4. RDDs can also be built by changing the persistence or

saving options of existing RDDs (‘‘save’’ allows an RDD to be saved to HDFS or cache constructs). It must be noted that the cache construct is only a hint: if there is not enough memory across the cluster, Spark will re-construct the RDD on the fly. This implies that Spark can scale (handle increasing data size with reduced performance) and will be fault tolerant.

RDDs can be used in actions—operations that return a value or export data to storage. Examples of such actions include count, collect, save, and reduce. Parallelism in RDDs is facilitated by constructs like foreach, reduce, and collect and the user-defined function, which is a first-class object in Scala. ‘‘Reduce’’ is an associative function that combines the data set elements to produce a result at the driver program. The ‘‘collect’’ construct collects all elements of the data set to the driver program. An array can be updated by parallelize, map and collect operations. The ‘‘foreach’’ construct allows a user provided function to be executed on each element of an RDD in parallel.

‘‘THE LIMITATIONS OF HADOOP

AND ITS LACK OF SUITABILITY

FOR CERTAIN CLASSES

OF APPLICATIONS

HAS MOTIVATED SOME

RESEARCHERS TO COME

UP WITH ALTERNATIVES.’’

(5)

Programmers can pass functions or closures to invoke the map, filter, and reduce operations in Spark. Normally, when Spark runs these functions on worker nodes, the local vari-ables within the scope of the function are copied. Spark has the notion of shared variables for emulating ‘‘globals’’ using the broadcast variable and accumulators. Broadcast variables are used by the programmer to copy read-only data once to all the workers (static matrices in CGD kind of algorithms can be broadcast variables). Accumulators are variables that can only be added by the workers and read by the driver program: parallel sums can be realized fault-tolerantly. It must be noted that the programming interface of Spark is similar to DyradLinQ.11However, DyradLinQ does not have the concept of RDDs or any way for data to be shared across iterations, the main differentiator of Spark. Spark is emerging as a promising Hadoop alternative in the big data analytics space, with a number of production use cases, mainly from start-ups, small, and medium enterprises.

We have conducted performance studies of Spark compared with second generation paradigms. In this direction, we have implemented several ML algorithms over Spark, including the

k-means clustering – a commony used algorithm for clustering data sets. The performance comparison of the Mahout’s k-means algorithm realized over Hadoop versus the k -means algorithm realized over Spark is given in Table 2. The performance studies were done over a three-node cluster, each node being a 32-core Xeon with 32 GB RAM and 32 GB swap. This tells us that the Spark implementation can be significantly

faster and would scale better over a cluster of nodes com-pared with the Mahout implementation. The main reason for the slow performance of Mahout is because of the use of sequence files and consequent disk accesses, whereas Spark

performs in-memory computation with the RDDs. Figure 1 shows the end-to-end performance of the logistic regression algorithm in Mahout and Spark on the same 3-node cluster. This comparison is in line with the comparison made by the AmpLab team on the Spark website for logistic regression (http://spark.incubator.apache.org/examples.html).

Third-Generation ML Tool: GraphLab

While Pregel is good at graph parallel abstraction, is easy to reason with, and ensures deterministic computation, it leaves it to the user to architect the movement of data. Further, like all BSP systems, it also suffers from the curse of the slow jobs, meaning that even a single slow job (which could be due to load fluctuations or other reasons) can slow down the whole computation. To alleviate some of the above issues, Graph-Lab has been proposed (http://graphlab.org). Their paper appeared inProceedings of the VLDB Endowment.12The main motivation for coming up with GraphLab was to build an asynchronous graph processing paradigm—one that is not affected by the curse of the slow job (asynchrony implies does not need barrier synchronization). It also allows dynamic it-erative computations. User-defined update functions live on each vertex and transform data in the scope of the vertex. It can choose to trigger neighbor updates (for example, can be triggered only if the rank changes drastically in a page-rank kind of algorithm) and can run without global synchronization. Importantly, while Pregel lives with sequential dependencies on the graphs, Graph-Lab allows parallelism, which is im-portant in certain ML algorithms, including collaborative filtering. GraphLab has implemented several ML algorithms including Alternating Least Squares (ALS), collaborative fil-tering, kernel SVM, belief propagation, matrix factorization, Gibbs sampling, etc. The paper also shows significant speed-up compared with Hadoop for an expectation maximization algorithm.

GraphLab213 is a new asynchronous shared memory ab-straction in which the user-defined vertex programs share a distributed graph, which has data associated with both ver-tices and edges. Each vertex program can access data associ-ated with the vertex or incident edges or neighboring vertices. It can also schedule computation to be run on the neigh-boring vertices in the future. Serializability is ensured, as GraphLab does not allow neighboring vertex programs to run simultaneously. It characterizes a graph computation as comprising three phases: gather (where values from neigh-bors are gathered by the vertex program), apply (these values are applied to the current vertex—sum or union of data on this and neighboring vertices and edges can be specified), and scatter (neighboring vertex programs can be scheduled to

Table2. Performance Comparison of the Mahout’s

k-means Algorithm Realized over Hadoop Versus thek-means Algorithm Realized over Spark

Time taken (in seconds) Number of points

clustered (in millions) Spark Mahout

0.1 4 363 0.2 6 368 0.3 7 374 0.4 9 391 0.5 11 407 0.6 13 425 0.7 16 428 0.8 19 433 0.9 21 435 1 24 437 2.5 91 570

‘‘THE KEY IDEA DISTINGUISHING

SPARK IS ITS IN-MEMORY

COMPUTATION, ALLOWING

DATA TO BE CACHED

IN MEMORY ACROSS

ITERATIONS/INTERACTIONS.’’

(6)

take the new value of this vertex). The gather and scatter phases can determine appropriate fan-in and fan-out of the vertices. By separating this out, GraphLab can be efficient on high fan-in or high fan-out edges,

which may occur in power law graphs.

The main challenges of natural graphs include the asymmetric fan-in/fan-out distribution leading to work imbalance in systems (such as Pregel) that treat vertices uniformly. The partitioning mechanism is an-other challenge. Since Pregel and Graphlab1 use hash-based random partitioning, they are not well suited for natural graphs. Handling natural graphs also requires parallelism ab-straction within individual vertices, which is not handled in systems such

as Pregel. GraphLab2 provides new ways of partitioning a power law graph (where high-degree vertices could limit parallelism) by defining an abstraction known as Power-Graph. Vertices are associated with nodes, while edges can span multiple nodes. PowerGraph gets the best of both Pregel

(associative, commutative gather concept) and GraphLab1 (shared memory abstraction). PowerGraph introduces novel-way vertex cuts, which allow it to represent the graph better in a distributed system compared with GraphLab1 or Pregel. PowerGraph factors the vertex program into gather, sum, apply, and scatter functions and can hence distribute the vertex program across the nodes of a distributed system. The gather function is invoked on the edges adjacent to the cur-rent one, depending on whether the gather_nbrs parameter is set to none, all, in, or out. The sum is a commutative and associative operator. The apply function computes a new value of the vertex. The scatter function is invoked on the neighbors (again based on the scatter_nbrs parameter). PowerGraph has both a synchronous (resembling Pregel) and an asynchronous execution mode (resembling Piccolo). However, asynchronous execution is nondeterministic in Piccolo, which could lead to instability or nonconvergence for certain ML algorithms such as statistical simulations.14 In contrast, PowerGraph enforces serializability, which implies that every parallel execution has a cor-responding serial execution order. While GraphLab1 provided the same, it was inefficient and unfair to high-degree vertices due to a sequential locking protocol. Pow-erGraph introduces a new parallel locking protocol that is fair to high-degree vertices, allowing paralleli-zation of the vertex program on a cluster. The PowerGraph paper also shows significant performance benefits for PageRank and triangle counting problems on the Twitter graph. The different graph proces-sing paradigms have been compared and contrasted in Table 3 for quick reference.

Table 4 carries a comparison of the various paradigms across different nonfunctional features such as scalability,

FIG. 1. Logistic regression—comparison of Spark versus Mahout.

Table3. Comparison of Graph Processing Paradigms

Graph paradigms

Computation (synchronous or

asynchronous) Determinism and its effects Efficiency vs. asynchrony

Efficiency of processing power law graphs

GraphLab1 Asynchronous Deterministic: serializable computation

Uses inefficient locking protocols

Inefficient: locking protocol is unfair to high degree vertices Pregel Synchronous: Bulk

Synchronous Parallel based

Deterministic: serial computation NA Inefficient: curse of slow jobs

Piccolo Asynchronous Nondeterministic: nonserializable computation

Efficient, but may lead to incorrectness

May be efficient GraphLab2

(PowerGraph)

Asynchronous Deterministic: serializable computation

Uses parallel locking for efficient, serializable, and asynchronous computations

Efficient: parallel locking and other optimizations for processing natural graphs

‘‘THE MAIN MOTIVATION FOR

COMING UP WITH GRAPHLAB

WAS TO BUILD AN

ASYNCHRONOUS GRAPH

PROCESSING PARADIGM—

ONE THAT IS NOT AFFECTED

BY THE CURSE OF THE SLOW

JOB (ASYNCHRONY IMPLIES

DOES NOT NEED BARRIER

(7)

fault tolerance, and the algorithms that have been im-plemented. It can be inferred that while the traditional tools have worked on only a single node and may not scale hor-izontally and may also have single point of failure, recent reengineering efforts have made them move across genera-tions. The other point to be noted is that most of the graph processing paradigms are not fault tolerant, while Spark and HaLoop are among the third-generation tools that provide fault tolerance.

Conclusions

This article has provided a comprehensive review of the three generations of tools/paradigms that realize machine learning algorithms for big data:

The first-generation tools, which include SAS and SPSS, can help in deep analytics but may only scale vertically.

The second-generation tools such as Mahout and RapidMi-ner can scale horizontally but may be limited by the Hadoop MR over which they are realized (in terms of the number of algorithms that can be implemented nonserially).

The third-generation realizations such as Spark and GraphLab, among others, promise the most in terms of horizontal scalability and the large number of ML algo-rithms that can be potentially implemented. However, it will take plenty of effort, time, and widespread adoption and industrial use to make sure that these third-generation tools realize their true potential and allow deep analytics of big data. It must be noted that only Spark has production use cases, with other tools starting to be used for testing/de-velopment. This article has also given certain performance

studies of ML algorithms realized over Spark and Mahout over Hadoop, illustrating the power of the third-generation paradigm.

Acknowledgments

The authors wish to thank Impetus management including Pankaj Mittal and Vineet Tyagi for providing encouragement and support. We also wish to thank Dr. Nitin Agarwal and Joydeb Mukherjee from the Data Sciences Practice team at Impetus. We also wish to express our gratitude to Gurvinder Arora, our technical writer, for reviewing the document and improving the language.

Author Disclosure Statement

No financial conflicts exist.

References

1. Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1. 2008, pp. 107–113.

2. Ekstrand MD, Ludwig M, Konstan JA, Riedl JT. Re-thinking the recommender research ecosystem: Repro-ducibility, openness, and LensKit. Proceedings of the fifth ACM conference on Recommender systems (RecSys¢11). ACM, New York: 2011, pp.133–140.

3. Srirama SN, Jakovits P, Vainikko E. Adapting scientific computing problems to clouds using MapReduce. Future Gener Comput Syst 2012; 28: 184–192.

4. Ekanayake J, Li H, Zhang B, et al. A runtime for iterative MapReduce. Proceedings of the 19th ACM International

Table4. Comparison of the Various Paradigms Across Different Nonfunctional Features

Generation First generation Second generation Third generation

Examples SAS, R, Weka, SPSS

in native form

Mahout, Pentaho, Revolution R, SAS In-memory Analytics (Hadoop)

Spark, HaLoop, GraphLab, Pregel, SAS In-memory Analytics (Greenplum/Teradata), Giraph, GoldenOrb, Stanford GPS, ML over Storm

Scalability Vertical Horizontal (over Hadoop) Horizontal (Beyond Hadoop)

Algorithms Available

Huge collection of algorithms

Small subset: sequential logistic regression, linear SVMs, Stochastic Gradient Descent,

k-means clustering, Random Forests, etc.

Much wider: including Conjugate Gradient Descent (CGD), Alternating Least Squares (ALS), collaborative filtering,

kernel SVM, belief propagation, matrix factorization,

Gibbs sampling, etc.

Algorithms Not Available

Practically nothing Vast no.: Kernel SVMs, Multivariate Logistic Regression, Conjugate Gradient Descent, ALS, etc.

Multivariate logistic regression in general form,k-means clustering etc.: work in progress

to expand the set of algorithms available.

Fault-Tolerance Single point of failure Most tools are FT, as they are built on top of Hadoop

FT: HaLoop, Spark

Not FT: Pregel, GraphLab, Giraph FT, fault-tolerant; GPS, graph processing system; ML, machine learning; SVM, support vector machine.

(8)

Symposium on High Performance Distributed Comput-ing. Chicago, IL: 2010.

5. Venkataraman S, Roy I, Young AA, Schreiber RS. Using R for iterative and incremental processing. Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Computing (HotCloud ’12). Berkeley, CA: USENIX As-sociation, 2012, p. 11.

6. Zaharia M, Chowdhury M, Franklin MJ, et al. Spark: cluster computing with working sets. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud ’10). Berkeley, CA: USENIX As-sociation, 2010, p. 10.

7. Bu Y, Howe B, Balazinska M, Ernst MD. HaLoop: Effi-cient iterative data processing on large clusters. Pro-ceedings VLDB Endowment 2010; 3:285–296.

8. Malewicz G, Malewicz G, Austern MH, et al. Pregel: A system for large-scale graph processing. Proceedings of SIGMOD International Conference on Management of Data (SIGMOD ’10). New York: Association for Com-puting Machinery (ACM), 2010, pp. 135–146.

9. Seo S, Yoon EJ, Kim J, et al. HAMA: An efficient ma-trix computation with the MapReduce framework. Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Sci-ence (CLOUDCOM ’10). Washington, DC: IEEE Com-puter Society, 2010, pp. 721–726.

10. Odersky M, Spoon L, Venners B. Programming in Scala, 2nd ed. Walnut Creek, CA: Artima Publishers, 2010. 11. Yu Y, Isard M, Fetterly D, et al. DryadLINQ: A system

for general-purpose distributed data-parallel computing

using a high-level language. Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). Berkeley, CA: USENIX As-sociation, 2008, pp. 1–14.

12. Low Y, Bickson D, Gonzalez J, et al. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endow-ment 2012; 5:716–727.

13. Gonzalez J, Low Y, Gretton A, Guestrin C. Parallel gibbs sampling: From colored fields to thin junction trees. AISTATS 2011; 15:324–332.

14. Gonzalez JE, Low Y, Gu H, et al. PowerGraph: Distributed graph-parallel computation on natural graphs. Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12). Berkeley, CA: USENIX Association, 2012, pp. 17–30.

Address correspondence to:

Vijay Srinivas Agneeswaran, PhD Innovation Labs

Impetus Infotech India Private Limited Pritech Park SEZ

Bellandur Outer Ring Road Bangalore, Karnataka 560103 India

E-mail: [email protected]

This work is licensed under a Creative Commons Attribution 3.0 United States License. You are free to copy, distribute, transmit and adapt this work, but you must attribute this work as ‘‘Big Data. Copyright 2013 Mary Ann Liebert, Inc. http://liebertpub.com/big, used under a Creative Commons Attribution License: http://creativecommons.org/licenses/ by/3.0/us/’’