Big Data Analytics
Zhian He, Mayuresh Kunjir, Harold Lim,
Eric Lo, Shivnath Babu, Meichun Hsu, and Malu Castellanos
The Hong Kong Polytechnic University and Duke University and HP labs
Abstract. We have been designing a benchmark for big data analytics. We only very recently came to know about the community effort and workshop series being organized on this problem. This extended abstract outlines discussion points that we would like to bring up before the big data benchmarking community.
“Benchmarks shape a field (for better or worse); they are how we
deter-mine the value of change.” (David Patterson, University of California
Berkeley, 1994).
If what David Patterson said is true, then the field of big data analytics is in serious need of benchmarks. The history of popular benchmarks like TPC-H indicate that a benchmark can address a number of different needs.
Need 1: System comparison for customers: Perhaps, the most obvious need
that a benchmark can address is enabling apples-to-apples performance compar-isons between competing systems such as parallel database systems Vs.
MapRe-duce systems [?]. Customers who are looking to deploy a big data analytics
system can use these comparisons to make informed decisions by better under-standing the advantages and disadvantages of each system.
Need 2: Application-level performance insights for system designers:System
designers and developers can use the benchmark to identify usability and per-formance issues in their systems from an end-to-end usage perspective instead of relying solely on microbenchmarks.
Need 3: Best practices for application developers: The variety of data and
processing needs in big data analytics can be confusing for application develop-ers. Application developers who are not fairly well versed in statistics, machine learning, programming, and system tuning may find it hard to develop scalable and efficient solutions. Reference implementations of a benchmark can help such developers understand best practices for designing and implementing big data analytics applications.
Need 4: Identifying new research problems:A well-designed benchmark may
capture the salient features of big data analytics in ways that can help researchers formulate new research problems.
We have been designing a benchmark for big data analytics. Appendix A gives a description of our work in progress on the benchmark design.
Our primary motivation for developing the benchmark came from Need 4 above. Because of lack of due diligence on our part, we were unaware of the efforts that are happening in the benchmarking space through the series of big data benchmarking workshops. In particular, we found that our efforts have some
strong similarities with theBigBench benchmark.
There are a few different ways in which we could proceed going forward. We have listed them below. We would like to attend the workshop in order to discuss with the community on what would be the best approach going forward.
Approach 1: Develop our benchmark as a potential competitor to BigBench: We could continue to develop our benchmark fully while targeting it specifically at Needs 1 and 2 like BigBench. Or, we could target our benchmark more at Needs 3 and 4. Some of these needs may partially conflict with each other. For example, the simplicity of TPC-H continues to make it more widely used for Needs 1 and 2 compared to the more realistic, but much more complex, TPC-DS.
Approach 2: Incorporate our ideas into BigBench: Compared to BigBench,
our design has some aspects like graph processing and data stream (or, contin-uous query) processing which are increasingly being seen in big data analytics. The similarity in design between BigBench and our benchmark will simplify this process.
Approach 3: Develop a benchmarking framework for the “no one size fits
all” nature of big data applications:Instead of creating a single one-size-fits-all
benchmark for big data analytics, it may be better to create a benchmarking framework that can instantiate different benchmarks. Each benchmark instanti-ated by the framework would be tailored to a specific set of data and workload requirements. For example, some enterprises may be grappling with increasing data volumes; with the variety and velocity of their data not being pressing con-cerns. Some other enterprises may be interested in benchmarking systems that can handle larger volumes and variety of data; but with the volume of unstruc-tured data dominating that of the strucunstruc-tured data by orders of magnitude. A third category of enterprises may be interested in understanding reference ar-chitectures for data analytics applications that need to deal with large rates of streaming data (i.e., large velocity of data). As part of our efforts on creating a benchmark, we have identified ways in which such a benchmarking framework can be created.
A
Design of Our Benchmark
Our benchmark separates the analytics into two types: (1) offline-analytics and (2) real-time analytics.
A.1 The data model
Customer TwitterID Mapping Item Customer Sales Promotion Followee/Follower Tweets Structured data Graph data Semi-Structured && Unstructured data TPC-DS Contributed by our benchmark
Fig. 1.Our Data Model
Figure 1 shows our data model. Sharing the same thoughts asBigBench[?],
we bring in semi-structured and unstructured data in addition to structured
data. Furthermore, we bring in graph data into our benchmark in which
Big-Benchhas not considered.
The data model reflects the analytics in a retail company that has carried out (offline-analytics) or carrying out (real-time analytics) a promotion on a number
of items and wants to analyze the promotion effectiveness. The structured data
is mostly adopted from the TPC-DS’sSalestable,Item table,Customertable,
andPromotiontable. The promotion effectiveness is measured by carrying out
a sentiment analysis on Tweets, which is represented bysemi-structured JSON
data type and containing unstructured text. A (simplified) example tweet in
JSON is as follows:
{
“created at”: “Thu Oct 21 16:02:46 +0000 2010”,
“id str”: “28039652140”,
“text”: “playing the latest angry bird, free for 24 hours — splendid new levels!”
“user”: {
“name”: “A. Hero.”,
“create at”: “Fri Oct 24 23:22:09 +0000 2008”
“id str”: “16958875”
}
“place”: null,
}
The relationship betweenItem table and theTweetdata is to associate the
promotion items with the items mentioned in the tweets. Furthermore, we no-tice that there is an increasing trend of using social network user accounts (e.g., Facebook accounts, Twitter accounts) as the log-in credential (e.g., see
SlideShare.net) and recent work such as TwitterRank[?] emphasizes that
qual-ity of the sentiment analysis can further be improved by taking into account of the existence of influential tweet senders. As a result, we also bring in the follower-followee graph (which is used to identify influential tweet users) in the data model and associate them with the customers.
A.2 Workload
We distinguish two types of workloads based on the data model: (1) Historical Query and (2) Continuous Query.
Join on itemKey
(itemKey, sales=quality*unitprice, soldDate)
Roll-up:
itemkey, sum(sales) as total_sales Group by itemKey
Filter:
Tweet contains itemName
Sentiment analysis
Join on itemKey
(twitterID, itemKey, sentiment_score, date)
Filter:
dateBegin<= date<=dateEnd
Join on TwitterID, itemKey (twitterID, itemKey, sentiment_score, influence_score)
Roll-up:
itemKey, sentiment_score*influence_score Group by itemKey
Join on itemKey
Run the TwitterRank algorithm to get the influence score for each user on each specific
Product (Graph Processing)
Lookup:
itemKey, dateBegin, dateEnd By promotionKey Sales Tweets Promotion Relevant Tweets UserInfluByItem (twitterID, itemKey, influence_score)
ReportSalesByItem (itemKey, total_sales)
RptSAbyItem (itemKey, total_sales, sentiment_score) Filter:
dateBegin <= soldDate <= dateEnd
Lookup: itemKey, itemName By itemKey 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Fig. 2.A historical workflow
Historical Query In this type of workload, data are refreshed at a rate as in traditional data warehouses (e.g., weeks).
We have a query Q1 that tries to associate the sales of a chosen set of items with their sentiment scores before a promotion. Query Q2 aims to report the association of the sales of the chosen set of items with their sentiment scores after a promotion. The workflow of Q2 is illustrated in Figure 2. Briefly, nodes 6 to 14 select the tweets that are relevant to the items in promotion and carry out a basic sentiment analysis on them. Nodes 19, 20, 15, and 16 further refine the sentiment results by incorporating the influential scores of the tweet users. Nodes 1 to 5 compute the sales of items within the promotion period. Finally, the association between the sales of the promotion items and their sentiment scores are joined (node 17) and a report (node 18) is generated.
Join on itemKey (itemKey, sales)
Roll-up:
itemkey, sum(sales) as total_sales Group by item_key
Filter:
Tweet contains itemName
Sentiment analysis
Join on itemKey
(twitterID, itemKey, sentiment_score)
Join on TwitterID, itemKey (twitterID, itemKey, sentiment_score, influence_score)
Roll-up:
itemKey, sentiment_score*influence_score Group by itemKey
Join on itemKey
Run the TwitterRank algorithm to get the influence score for each user on each specific
Product (Graph Processing)
Lookup: itemKey By promotionKey Sales Tweets Promotion Window:
tweetTime >= currentTime - 1hour
UserInfluByItem (twitterID, itemKey, influence_score)
LiveReportSalesByItem (itemKey, total_sales)
LiveReportSentimentAnalysisbyItem (itemKey, total_sales, sentiment_score)
Window:
soldTime >= currentTime – 1hour (itemKey, sales=quality*unitprice) Lookup: itemKey, itemName By itemKey Real-time Decision made by User Relevant Tweets
Fig. 3.A continuous query
Continuous Query The continuous query version of Q2, in which we name as RQ2 in our benchmark, is illustrated in Figure 3. Comparing with Q2 (Figure 2), RQ2 is a standing query that continuously crawls the latest tweets, but a one-hour window is applied to the data. At the same time, the sales of promotion
items are also continuously updated and a one-hour window-of-interest is applied to the data. RQ2 is to monitor the real-time effect of a short-term promotion (e.g., 1 day free download of the latest Angry Bird). It enables real-time decision making based on the instant update on the sales and the users’ feedback (e.g., extending the promotion to other regions immediately). Following up promotions