Analysis Pipelines for Benchmarking Big Data Systems

(1)

Analysis Pipelines for Benchmarking Big Data Systems

Ref. Code: Berlin_EN_2182, Suggested starting date: May 15, 2013

Today, practically everyone ranging from web companies to traditional enterprises to natu-ral science researchers to anthropologists, is either already experiencing or at least antici-pating challenges due to two ongoing trends. First, the data that is acquired, stored, and analyzed to gain insight and value becomes increasingly voluminous, diverse, and time-critical – leading to the coinage of the term of Big Data. Second, the depth of analytics over this Big Data steadily increases to drive processes, e.g. for spam detection and ad-vertisement placement. Within the past few years, industrial and academic organizations designed a wealth of new systems to cope with these challenges. These include MapRe-duce [1], SCOPE [2], ASTERIX [3], and Stratosphere [4]; the latter being developed at our group.

Right now, there is no generally accepted way to benchmark such systems. In our previ-ous work, we implemented Myriad [5], a toolkit for the creation of parallel data generators and Big Darwin, a repository for workloads from different domains in this space, e.g. ma-chine learning and graph analysis. As a next step towards standardized benchmarks in this field, we plan to conceptualize and implement representative data analysis pipelines as they are commonly executed by Big Data users such as Walmart, Google, or CERN. Such pipelines consist of related jobs that potentially perform data transformations, rela-tional algebra operations, or multi-stage machine learning / statistical algorithms.

In this project, we will specify and implement analytic pipelines for the massively-parallel data analysis system Stratosphere, and quantitatively compare it to Big Data stacks devel-oped elsewhere (e.g., Spark from UC Berkeley or ASTERIX from UC Irvine) – on the basis of resulting pipelines – in a cluster or cloud (like Amazon EC2) setting. Candidates for this project should be interested in MapReduce, database internals, machine learning, and have good programming skills in Java.

[1]! Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004:137-150

[2]! Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, Darren Shakib: SCOPE: parallel databases meet MapReduce. VLDB J. (VLDB) 21(5):611-636 (2012)

[3]! Alexander Behm, Vinayak R. Borkar, Michael J. Carey, Raman Grover, Chen Li, Ni-cola Onose, Rares Vernica, Alin Deutsch, Yannis Papakonstantinou, Vassilis J. Tso-tras: ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases (DPD) 29(3):185-216 (2011)

[4]! Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, Daniel War-neke: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. SoCC 2010:119-130

[5]! Alexander S. Alexandrov, Berni Schiefer, John Poelman, Stephan Ewen, Thomas Bodner, Volker Markl: Myriad - Parallel Data Generation on Shared-Nothing Architec-tures. Proceedings of the First Workshop on Architectures and Systems for Big Data (ASBD), 2011

(2)

Implementing a DSL for Machine Learning on top of a

Massively-Parallel Data Processing System

Alexander Alexandrov – [email protected]

The boom of digital data in the past decades has triggered unprecedented interest in dis-ciplines that provide tools to discover and extract useful patterns from the expanding sea of information. Machine Learning is probably the most prominent example of such disci-pline. A Machine Learning cycle consists of a learning phase, where the parameters of a statistical model are fitted to training data, and a discovery phase, where the trained model is used to predict features of new, unseen data.

Empirical evidence shows that the quality of many Machine Learning algorithms improves with the size of the training data, and that a simple algorithm is likely to outperform a more complex one if it gets more training data [1]. Modern Machine Learning applications there-fore should

be able to scale to very large sets of training data. In the past few years, several projects have demonstrated that general-purpose parallel computation models like Google’s Ma-pReduce (and its open-source implementation Hadoop) [2,3] can be used to implement highly scalable, fault- tolerant machine learning algorithms [4,5].

The Stratosphere system [6] provides an extension of the MapReduce programming model. In Stratosphere, the data processing pipelines are specified as directed acyclic graphs, and the nodes of the graph are higher-order functions called parallelization con-tracts (PACTs), e.g. Map and Reduce, parametrized with UDFs that handle the actual computation. We are currently building a new Stratosphere optimizer that will allow effi-cient specification of Domain Specific Languages (DSLs) on top of the PACT programming model.

The goal of this project is to implement a small DSL for Machine Learning on top of the Stratosphere optimizer. In the course of the project, you will (A) specify an algebra of suit-able operators for the Machine Learning DSL (e.g. matrix multiplication, matrix decomposi-tion),

(B) provide PACT implementations for the algebra, and (C) find algebraic equivalences that

can be used by the optimizer for rewriting. Candidates for this project should be interested in MapReduce, database query optimization and linear algebra, and have good program-ming skills in Java.

[1]! A. Halevy, P. Norvig, F. Pereira: "The Unreasonable Effectiveness of Data". IEEE In-telligent Systems, pp. 8-12, March/April, 2009

[2]! J. Dean, S. Ghemawat: “MapReduce: Simplified Data Processing on Large Clusters”. OSDI 2004

[3]! Apache Hadoop. http://hadoop.apache.org/

[4]! A. Ghoting, R. Krishnamurthy, E. P. D. Pednault, B. Reinwald, V. Sindhwani, S. Tatik-onda, Y. Tian, S. Vaithyanathan, “SystemML: Declarative machine learning on Ma-pReduce”. ICDE 2011:231-242

[5]! Apache Mahout: Scalable machine learning and data mining. http://mahout.apache.org/

(3)

GPU-assisted model optimization in PostgreSQL

Max Heimel– [email protected]

Ref. Code: Berlin_EN_2221, Suggested starting date: June 01, 2013

In Machine Learning, statistical models are fitted - or trained - to observations. A trained model can be used to predict features of unseen data or to discover new aspects of ob-served data. Specific model parameters are typically learned using gradient-based nu-merical optimization algorithms like Gradient Descent or Quasi-Newton methods: Given the gradient of an objective function, these methods converge to an optimal set of parame-ters that fit the model to the given training data.

The quality of Machine Learning algorithms improves with the available amount of training data [1]. Today, models are usually trained using statistical / numerical systems such as R or Matlab. When training data is kept in a database, this approach requires data transfer between different systems; a process that does not scale well to large data volumes. In order to avoid this transfer, it is helpful to push model training directly into the database kernel [2]. While this approach reduces the need to im- and export data, it also adds non-trivial computational load for the database. To alleviate this problem, we propose to use graphics cards as "model-training co-processors" for the database kernel [3]. Graphics cards are well-suited to accelerate the computations encountered in model training [4] and can therefore help to reduce the additional load that is put on the database.

The goal of this project is to implement gradient-based optimization routines on a graphics card, integrate them into the open-source database engine PostgreSQL [5] and evaluate the performance of this integrated optimization module. You should be interested in ma-chine learning, numerical optimization and parallel programming on graphics cards using OpenCL [6], and have - at least basic - programming experience in C/C++ (or Java).

[1]! Halevy, A., Norvig, P., & Pereira, F. (2009). The unreasonable effectiveness of data. Intelligent Systems, IEEE, 24(2), 8-12.

[2]! Hellerstein, J. M., Ré, C., Schoppmann, F., Wang, D. Z., Fratkin, E., Gorajek, A., ... & Kumar, A. (2012). The MADlib analytics library: or MAD skills, the SQL. Proceedings of the VLDB Endowment, 5(12), 1700-1711.

[3]! Heimel, M., & Markl, V. (2012). A First Step Towards GPU-assisted Query Optimiza-tion.

[4]! Steinkraus, D., Buck, I., & Simard, P. Y. (2005, August). Using GPUs for machine learning algorithms. In Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on (pp. 1115-1120). IEEE.

[5]! http://www.postgresql.org/ [6]! http://www.khronos.org/opencl/

(4)

Using Compiler Techniques for Database Query

Optimi-zation

Marcus Leich– [email protected]

Kostas Tzoumas– [email protected]

During the last years, we have seen a boom in data availability, coupled by a plunging cost of stor-age and computing. These trends, accelerated by the opportunity of cloud computing are deliver-ing the "Big Data Analytics" promise: the ability to analyze data of an unprecedented volume and variety. This data is usually heterogeneous, from structured enterprise data, to unstructured Web and text data, and graph data from social networks. Traditional database systems were designed for tightly structured data, and are not a good match for such variety. In addition, the volume of such data necessitates the use of massively distributed computation platforms.

In reaction to these trends, a new breed of parallel data processing systems such as MapReduce/ Hadoop [3], Asterix[2], Scope[6], and Stratosphere[1] have emerged, the latter being developed at our group in TU Berlin. These systems have typically in their core a flexible functional dataflow model that allows many degrees of freedom to the programmer. Typically, programmers are free to write the data analysis code in a full-fledged imperative language such as Java. This poses chal-lenges to the traditional query optimizer designs that typically assume an algebraic query lan-guage.

In our previous work, we provided some building blocks for analyzing the code of user- defined code using program analysis, and using the outcomes for this analysis to emulate transformations done by traditional query optimizers without knowing full semantics [5]. The Scope system at Mi-crosoft has also arrived at similar techniques [4]. The goal of this project is to go beyond traditional query optimizers, by taking a holistic view at program optimization including compiler and database query optimization. We will investigate (1) intrusive techniques that split user code to pieces and optimize pieces independently, and (2) extracting semantics of user code in order to estimate re-source consumption. The ideal candidate should have experience with Java programming, data-base internals (query optimization), compilers, and program optimization. Experience with the Soot framework for program analysis is a plus.

[1]! Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, Daniel Warneke: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. SoCC 2010:119-130

[2]! Alexander Behm, Vinayak R. Borkar, Michael J. Carey, Raman Grover, Chen Li, Nicola Onose, Rares Vernica, Alin Deutsch, Yannis Papakonstantinou, Vassilis J. Tsotras: ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases (DPD) 29(3):185-216 (2011)

[3]! Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clus-ters. OSDI 2004:137-150

[4]! Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE. OSDI 2012

[5]! Fabian Hueske, Mathias Peters, Matthias Sax, Astrid Rheinländer, Rico Bergmann, Aljoscha Krettek, Kostas Tzoumas: Opening the Black Boxes in Data Flow Optimization. PVLDB 5(11):1256-1267 (2012)

[6]! Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Åke Larson, Ronnie Chaiken, Darren Shakib: SCOPE: parallel databases meet MapReduce. VLDB J. (VLDB) 21(5): 611-636 (2012)

(5)

Large Scale Data Mining with Iterative Data Flows

Sebastian Schelter – [email protected]

Ref. Code: BerlinGermany_EN_2414, Suggested starting date: May 16, 2013

During the last decade, the cost of acquiring and storing huge amounts of data has dropped significantly. As the technologies to process and analyze these datasets are being developed, we face previously unimaginable new possibilities. Scientists can test hypothe-ses on data several orders of magnitude larger than before, and formulate hypothehypothe-ses by exploring large data sets.

The basis for developing the technology to enable such analysis is rooted in the fields of database systems, information management, machine learning and distributed systems. While the first generation of parallel processing platforms has focused on aggregation and relational queries, there is a growing demand for applying more complex analytics. The tools to derive these come in the form of complex mathematical algorithms from the fields of machine learning, information extraction and graph mining. The majority of these algo-rithms are composed of applying mathematical functions iteratively to the data until con-vergence. Enormous amounts of training data are used to successively refine a mathe-matical prediction model.

It is desirable to handle iterative algorithms in general, multi-purpose distributed data flow systems. The Stratosphere system [1] is a database-inspired parallel processing stack for Big Data analytics. The system is part of a large-scale research project lead by our group at TU Berlin. We are currently investigating the efficient execution of iterative data flows [2]. These data flows can be composed of second-order functions, such as the commonly known map and reduce, but also of more advanced functions, e.g. for grouping and matching several inputs.

The goal of this project is to implement several data mining algorithms from the fields of social network analysis [3] and collaborative filtering [4] using Stratosphere’s iterative data flows. The implementations have to be benchmarked on several public datasets. Further-more we will investigate selected interesting aspects regarding the compilation, optimiza-tion and fault tolerance of the iterative data flows.

[1]! Alexander Alexandrov, Dominic Battré, Stephan Ewen, Max Heimel, Fabian Hueske, Odej Kao, Volker Markl, Erik Nijkamp, Daniel Warneke: Massively Parallel Data Analysis with PACTs on Nephele, VLDB, 2010

[2]! Stephan Ewen, Moritz Kaufmann, Kostas Tzoumas: Spinning Fast Iterative Data Flows, VLDB, 2012

[3]! Kang, U., Tsourakakis, C. E., & Faloutsos, C.: PEGASUS: A Peta-Scale Graph Min-ing System Implementation and Observations. ICDM, 2009

[4]! Yehuda Koren, Robert M. Bell, Chris Volinsky: Matrix Factorization Techniques for Recommender Systems. IEEE Computer, 2009

(6)

Conditions and Application

Terms:

•

Monthly scholarship of 650€

•

Allowance of 160€ for a Rail Pass

•

Health insurance as well as accident and personal liability insurance

•

Flights to and from Germany are not covered

•

Starting dates may be adjusted to you needs, but have to be between May 15 and early July

•

All projects have a duration of 2-3 months Requirements:

1. You are currently enrolled at a university/college in the United States, Canada or the UK as a full-time student in the field of biology, chemistry, physics, earth sciences or engi-neering (or a closely related field),

2. You are an undergraduate who will have completed at least two years of a degree pro-gram by the time of the placement

3. Proof that you will be registered as an undergraduate at your university/college for the academic year 2013/2014 (i.e., that you will still have undergraduate status upon your return to your home university).

4. VISA Requirements:

• British Citizens: no visa required

• U.S. and Canadian citizens: no visa required if you stay 90 days or less

• Others: please check with the German Federal Foreign (see www.auswaertiges-amt.de/EN/)

Application Process:

The full application process is described here: http://www.daad.de/rise/en/11638/index.html

•

Visit RISE website

•

http://www.daad.de/rise/en/index.html

•

Register at the website and log-in

•

Search the database for topics that suit your interests

•

Prepare and submit your application online

•

Required documents for your application: 1. Your completed application form

2. Full curriculum vitae/résumé

3. Up-to-date official university/college transcript(s)

4. As a supplement to this transcript, a list of courses you will have completed by the time the internship begins.

5. A letter of reference from a senior academic in your field of study from the university you are enrolled at

6. A cover letter in which you state your motivation to apply for this particular project (projects)

7. A certificate of enrollment

http://www.daad.de/imperia/md/content/rise/certificate_of_enrolment.pdf

(after having been accepted for a scholarship, you have to submit this document as hard copy as well)

Analysis Pipelines for Benchmarking Big Data Systems