Big Data Analysis Technology

(1)

Big Data Analysis Technology

Tobias Hardes (6687549) Email: [email protected]

Seminar: Cloud Computing and Big Data Analysis, L.079.08013

Summer semester 2013 University of Paderborn Abstract—This paper introduces several big data processing

techniques. Association rule mining is to find relationships in large data sets. For this mining frequent pattern is an important technique. The Apriori algorithm is a classic algorithm to detect these frequent item sets and to generate rules using this item sets. In addition to the Aprioi algorithm, the FP-Growth algorithm needs less scans of the database to extract frequent item sets and performs the mining even faster.

Next to the rule mining, the cluster analysis is another technique for analyzing big data sets. For this the K-Means algorithm is the most common one that minimizes the euclidean distance between entities in the same cluster. Furthermore the K-Means++ algorithm will be discussed, that provides an extension for the Means. With this preprocessing of the Means++, the K-Means converges much faster with a solution that is O(log(k)) competitive to the optimal K-Means solution. Finally Google’s MapReduce will be discussed. It is a parallel and distributed computing framework for processing Big Data on a cluster with hundred or thousands of computer.

Keywords—Big Data, Cloud Computing, Association rule min-ing, Apriori algorithm, FP-growth algorithm, frequent pattern, K-means, K-means++, MapReduce framework

I. INTRODUCTION

Since the Business Intelligence area is an important topic for the decision making and reporting of companies, things have changed. Today it is possible to get storage units for less than $600 to store all of the worlds music [?]. The ability to store and to work with data and after that the usage of results to perform further analysis becomes even more accessible as with trends such as Moore’s Law [?].

Big Data is a term defining data that has three main characteristics. First, it involves a great volume of data. Second, the data cannot be structured into regular database tables because of variety and third, the data is produced with great velocity and must be captured and processed rapidly [?]. In other literature there is one additional keyword, Veracity [?]. The use of Big Data is needed in order to discover trends or hidden data over a course of activities.

In addition to transactional data, there is data from the web, e.g. social networks or sensor networks. Having a closer look to changes in our environment during the last years, we should also mention radio-frequency identification (RFID) tags, geodata (GPS), data from satellites or further medical aspects, which all creates a variety of data. Because of this high volume of data, it is not possible to insert this in traditional databases which already contain terabytes or petabytes of data. Furthermore the use of sensors often leads to a high velocity of new data.

New information is created every second and might be computed in real time. With this there is also a challenge of veracity, because there can be wrong values, which have to be detected.

The main purpose of this paper is to present different techniques for processing Big Data.

The remainder of this paper is organized in this manner. Section II discusses some issues based on the research at CERN and shows some problems which are related to Big Data. Section III discusses advanced and related techniques of Big Data. Section IV discusses techniques of the area Data Mining presenting the most important algorithms and methods to analyze large data sets. Finally section V concludes the aspects described in this paper.

II. BACKGROUND

In July 2012 the Large Hadron Collider (LHC) experiments ATLAS and CMS announced they had each observed a new particle in the mass region around 126 GeV1 also known as the Higgs boson [?].

The LHC is the world’s largest and most powerful particle accelerator. To find the Higgs boson, approximately 600 million times per second, particles collide within the LHC. Each collision generates particles that often decay in complex ways into even more particles. Sensors are used to record the passage of each particle through a detector as a series of electronic signals, and send the data to the CERN Data Centre for digital reconstruction. The Physicists must sift through the approximately 15 petabytes of data produced annually to determine if the collisions have thrown up any interesting physics [?].

The Worldwide LHC Computing Grid (WLCG) was invented in 2002 to address the issue of missing computational resources. The grid links thousands of computers in 140 centers over 45 countries. Up to now the WLCG is the largest computing grid in the world [?] and it runs more than one million jobs per day. At peak rates, 10 gigabytes of data may be transferred from its servers every second [?], [?].

Similar scenarios exist in the medical area, governments or in the private sector (e.g. Amazon.com, Facebook). Dealing with large data sets becomes more and more important because an accurate data base is important to face problems of the named areas. The challenge is to perform complex analysis algorithms on Big Data to generate or to 1_{Billion electron volts. Electron volt is a unit of energy equal to} approxi-mately 1.61019 joule

(2)

find some new knowledge which the data contains. This new knowledge can be used to discover important, temporal and daily problems such as the example with the research at CERN, the analysis of the stock market or the analysis of company data which was gathered during a long period of time.

A. Just in time analysis

Often it is necessary to analyze data just in time, because it has temporal significance (e.g. stock market or sensor networks). For this it might be necessary to analyze data for a certain time period e.g. the last 30 seconds.

Such requirements can be can be addressed through the use of Data Stream Management Systems (DSMS). The software tools are called Complex Event Processing (CEP) engines and the queries are written in declarative languages such as Event Processing Languages (EPL) like Esper [?]. The syntax is similar to the SQL in databases.

Listing 1. Sample EPL command

s e l e c t avg ( p r i c e ) from

S t o c k T i c k E v e n t . win : t i m e ( 3 0 s e c ) .

As shown in the upper example an EPLs supports a window over data streams, which can be used to buffer events during a defined time. This technique is used for data stream analysis. The window can move or slide in time. Figure 1 shows the difference of the two variants. If a window moves as much as the window size this type of window is a tumbling window. The other type is named sliding window. This window slides in time and buffers the last x elements.

Fig. 1. Windows - Based on [?]

The size of the window can be set by the user with the select command. Note that smaller window sizes have less data to compare, so often they result in high false positive rates. If a larger window is used, the effect can be compensated [?]. When using a CEP Engine, the computation is in time. Because of this, the complexity can not be very high, so it might not be possible to detect the Higgs boson using just Data Stream Analysis, because the analysis is too complex. Therefore, further mining algorithms on the full data set are required. Taking a look on the Twitter service, it might be possible to use CEP to analyze the stream of tweets to detect emotional states or something similar. This example shows

some additional aspects for Big Data analysis in respect to fault tolerance. Consider the following tweet:

After a whole 5 hours away from work, I get to go back again, I’m so lucky!

This tweet contains sarcasm and it is a complex tasks to detect this. A way to solve this problem is to collect training data to apply this data to an appropriate learning algorithm [?].

B. Rule mining and clustering

A more advanced technique is called Association rule mining.

Association rule mining searches for relationships between items in data sets, but it can also be implemented for analyzing streams. For this, there are association rules which are used by algorithms to find associations, correlations or causal structures. Chapter IV-A2 and IV-A3 discuss this technique in more detail.

Cluster analysis is the tasks of classification of similar objects into groups or classes. The classes are defined during the clustering and it is not a mapping to already pre-defined classes[?]. Today, cluster analysis techniques are used in different areas, such as data compression, process monitoring, analysis of the DNA, finance, radar scanning and further research areas. In all areas a huge data is stored. A clustering algorithm can be hierarchical or partitional [?].

III. RELATEDWORK

Big Data became a very important topic during the last year. It is still an important area for research but it also reached business. Various companies are working in the field of Big Data, even in research or for consulting: McKinsey & Company [?], IBM Corporation [?], Fujitsu Limited [?], Intel Corporation [?] and many more.

This chapter discusses some further developments to improve the algorithms and methods that are usually used to process Big Data. In addition, alternative solutions are presented that can be used for analyzing or processing Big Data.

A. Basic principles

There are different techniques which allow the analysis of large data sets. Not all techniques strictly require the use of Big Data, but they can apply to this area.

Most algorithms for large item sets are related to the Apri-ori algApri-orithm that will be discussed in Chapter IV-A2. All algorithms and methods are usually based on the same basic principles. These principles are statistics, probability theory and machine learning in terms of a system that can learn from data. An area called Data Mining is the biggest one. Data Mining can be described as a set of techniques which can be used to extract patterns from small as well as big data sets [?]. In the field of Data Mining the most important techniques are association rule learning, cluster analysis and classification.

(3)

B. Improvements

Today the performance of such algorithms becomes more and more important. Analyzing large data sets results in high computational costs. In recent years much progress has been been reached by the following directions [?]:

• Selection of the startup parameter for the algorithms • Reducing the number of passes over the database • Sampling the database

• Adding extra constraints on the structure of patterns • Parallelization

Chapter IV-B1 discusses the K-means method which is used for clustering. The classic implementation has an exponential runtime of O(2Ω(n))[?]. In [?], Arthur et al. published an alternative implementation which leads to a lower runtime by improving the start parameters of the algorithm. Other implementations pursue an optimization approach to reduce the runtime.

There are also many variations done on the Apriori algorithm in such a way. These variations are discussed in [?]. Many of these variations are concerned with the generation of a tempo-ral data set, because this part has the greatest computational cost.

In [?] Yang Kai et al discussed a new algorithm called FA-DMFI that is used for discovering the maximum frequent item sets. For this, the algorithm stores association information by scanning the database only once. Then the temporarily data set is stored in an association matrix. Summarized the efficiency is achieved in two steps.

1) The huge database is compressed in a smaller data structure which can avoid repeated and costly database scans.

2) The maximum frequent item set is generated ulti-mately by means of cover relation in such a informa-tion matrix and the costly generainforma-tion of large number of candidates is avoided.

With these steps the I/O time and CPU time are reduced and the generation of the candidate item set and the frequent item set can be done in less time. The Apriori algorithm, that is discussed in Chapter IV-A2, spends a lot of time generating the frequent item sets and it needs to scan the data source multiple times. Using some implementation details of the FA-DMFI algorithm could also improve the runtime of the Apriori algorithm.

C. Distributed computing

Often the computational tasks are very complex and the data which is needed for an individual task is serveral terabytes or even more.

That is why it is not possible to compute these tasks on a single machine because there are not enough hardware resources. These are the reasons why parallelization of computational tasks become more and more important.

MapReduce is a distributed computing technology that was developed by Google. The programming model allows to implement custom mapper and reducer functions program-matically and run batch processes on hundreds or thousands of computers. Chapter IV-C discusses this method in more detail. Based on this technology Google developed a web service called Google BigQuery which can be used for analyzing large data sets. BigQuery is based on Dremel which is a query service that allows to run SQL-like queries against very large data sets and gets accurate results in seconds [?]. So the data set can be accessed and queried but BigQuery is not meant to to execute complex mining. Usually the queries are executed with Microsoft Office or the Google Chrome browser. Figure 2 gives an example for the BigQuery workflow.

Upload

Upload the data set to the Google Storage

Analyse

Import data to tables

Process

Run queries

Fig. 2. Google BigQuery

With this approach Google tries so offer a suitable system for OLAP (Online Analytical Processing) or BI (Business Intelligence) using Big Data. Here, most queries are quite simple and done with simple aggregations or filtering using columns. Because of this BigQuery isn’t a programming model for deep analysis of Big Data. It is a query service with functionality to perform a quick analysis like aggregation of data and there are no possibilities to implement user code [?].

D. Big Data and Cloud Computing

In [?], Wangl et al present a development platform based on virtual clusters. Virtual clusters are a kind of virtual machines provided by cloud computing. The platform presented in [?] provides techniques for parallel data mining based on these vir-tual clusters. To measure the performance of the system, Wangl et al used a PSO (Particle Swarm Optimization) algorithm which was parallelized. The platform reduces the computing time of the swarm and improves the cluster quality.

To use the advantages of a parallel architecture a corresponding architecture is required.

(4)

IV. MAIN APPROACHES

This section analyses and describes the implementation details and practical issues of Big Data analysis techniques. The focus is on algorithms for associated rule mining and cluster analysis. The last part discusses a technique for distributed and parallel analysis of large data sets.

The Subsection IV-A2 discusses the Apriori algorithm, Sub-section IV-A3 discusses the Frequent Pattern (FP)-Growth algorithm and Subsection IV-B1 discusses the K-means and K-means++ algorithm. Finally Subsection IV-C discusses the MapReduce framework.

A. Associated rule mining

Association mining can be used to identify items that are related to other items. A concrete daily life example could be the analysis of baskets in an online shop or even in the supermarket. Because each customer purchases different combinations of products at different times there are questions that can be addressed with associated rule mining:

• Why do customers buy certain products? • Who are the customers? (Students, families,...) • Which products might benefit from advertising? The calculated rules can be used to restructure the market or to send special promotions to certain types of customers. This chapter discusses the Apriori and the FP-Growth al-gorithm. Both algorithm often provide the basis for further specialized algorithms.

A common strategy, which is used by association rule mining algorithms, is to disjoin the problem in two tasks:

1) Generation of frequent item sets:

The task is to find item sets that satisfy a minimum support that is usually defined by the user’s code. 2) Generation of rules:

The task is to find confidence rules using the frequent item sets.

The Apriori algorithm (see Chapter IV-A2) and the FP-Growth algorithm (see Chapter IV-A3) are both based on this principle. 1) Terminology: The following terminology has been taken from [?], [?], [?]. These definitions were adapted in order to discuss the algorithms of this chapter more accurately. Let S be a stream or a database with n elements. Then I = i1, i2, ..., in

is an item set with n items. So an item set could be a basket in a supermarket containing n elements. Furthermore an association rule is of the form A =⇒ B where A ⊂ I and B ⊂ I, A ∩ B = ∅. So the rule mining can extract a trend such as ”‘when stocks of credit institutes go down, companies for banking equipment follow”’. Using these rules there are no restrictions on the number of items which are used for such a rule [?].

The Support of an item set is the percentage of records which contain the specific item set. Assume that φ is the frequency of occurrence of an item set. Then the support of an item set A is:

sup(A) = φ(A) |S|

The Confidence is calculated as the relative percentage of records which contain both A and B. It determines how frequently items in B appear in a transaction that contains A:

conf (A =⇒ B) = Support(A ∪ B) Support(A)

Additionally the frequency of the item set is defined as the number of transactions in S that contains the item set I. Lk

denotes the frequent item sets of length k [?].

The example on Page 5 shows an daily situation where association rule mining is used.

2) Apriori algorithm: This section discusses the Apriori algorithm which is one of the most common algorithms for association rule mining. The Apriori algorithm is using the prior knowledge to mine frequent item sets in a database or stream. It is actually a layer-by-layer iterative search algorithm, where the item set k is used to analyze the item set k + 1. The algorithm is based on a simple principle:

If an item set is frequent, then all subsets of this item set are also frequent.

To use the algorithm, the user specifies the minimum support (min sup), the data source (S) and the minimum confidence (min conf ) (See Chapter IV-A1 for terminology). The first task is to extract frequent item sets.

Extract frequent item sets: Before the iteration starts, the algorithm scans the database to find the number of each item. This item has to satisfy the min sup input. Figure 4 shows the pseudocode for the implementation of the Apriori algorithm. Figure 3 shows the according activity diagram using an UML (Unified Modeling Language) activity diagram.

Fig. 3. Activity diagram - Apriori frequent item set generation - Notation: [?]

(5)

1: procedureAPRIORI FREQUENTITEMSETS(min sup, S) 2: L1← itemsets

3: for k = 2; Lk−16= ∅; k + + do

4: Ck = aprioriGen(Lk−1) . Create the candidates

5: for each c ∈ Ck do

6: c.count ← 0

7: end for

8: for each I ∈ S do

9: Cr← subset(Ck, I) . Identify candidates that

belong to I

10: for each c ∈ Cr do

11: c.count + +. Counting the support values

12: end for

13: end for

14: if c.count ≥ min sup then

15: Lk = Lk∪ c

16: end if

17: end for

18: return Lk 19: end procedure

Fig. 4. Apriori algorithm - Based on [?], [?]

The algorithm outputs a set of itemsets I with support(I) ≥ min sup.

The function aprioriGen in line 4 is the part where the set of candidates is created. It takes Lk−1 as an argument.

This function is performed in two steps:

1) Join: Generation of a new candidate itemset Ck.

2) Prune: Elimination of item sets with support(Ij) <

min sup

With this it is assumed that the items are ordered, e.g. in an alphabetical way.

Assume that there are distinct pairs of sets in Lk−1 =

{{a1, a2, ..., ak−1}, {b1, b2, ..., bk−1}, ..., {n1, n2, ..., nk−1}}

with ai≤ bi ≤ ... ≤ ni, 1 ≤ i ≤ k − 1.

Then the join-step joins all k-1 itemsets that differ by only the last element.

The following SQL-Statement shows a way how to select those items:

select p.item1, p.item2, ..., p.itemk−1, q.itemk−1from

Lk−1p, Lk−1q where p.item1 =

q.item2, ..., p.itemk−2 = q.itemk−2, p −

itemk−1< q.itemk−1

The line 8 to 13 (Figure 4) are used for the support counting. This is done by comparing each transaction I with the candidates Ck (line 9) and to increment the support.

This is a very expensive tasks specially if the number of transactions and candidates is very large.

The result of the statement is added to the set Ck. Now Ck

is used to proceed with the prune step.

In the prune step, the set Lk+1 is generated using the

candidates of the set Ck by pruning those combinations of

items which can not satisfy the min sup value (Line 14). After the For-loop in line 3 terminates, the algorithm outputs a set of item sets were the support is greater than the value min sup [?], [?], [?].

With this it is possible to generate association rules.

Generation of association rules: The generation of the association rules is similar to the generation of the frequent item sets and it is based on the same principle. Figure 5 and Figure 6 show the pseudocode for the implementation of the Apriori algorithm.

1: procedure APRIORI RULEGEN(min conf, s) 2: for each itemsetLk do

3: H1= {i|i ∈ fk} . Rules with 1 item consequence 4: Genrules(Lk, H1, min conf )

5: end for 6: end procedure

Fig. 5. Apriori rule generation - Based on [?]

1: procedure GENRULES(Lk, H1, min conf ) 2: if k > m + 1 then

3: Hm+1= aprioriGen(Hm)

4: for each hm+1∈ Hm+1do

5: conf= Support(Lk)/Support(Lk− hm+1)

6: if conf ≥ min conf then

7: output(Lk− hm+1) → hm+1

8: else

9: Hm+1← Hm+1\hm+1

10: end if

11: end for

12: Genrules(Lk, H1, min conf ) 13: end if

14: end procedure

Fig. 6. Apriori Genrules - Based on [?]

First, all confidence rules are extracted using the frequent item sets. These rules are used to generate new candidate rules.

For example, if the rules {a, b, c} → {d} and {a, f, c} → {g} are rules that fit the minimum confidence then there will be a candidate rule {a, c} → {d, g}. This is done by merging the result of both rules. The new rule has to satisfy the min conf value, which was specified by the user.

The generation of such rules is done iteratively. The function aprioriGen (Figure 6, line 3) is used again to generate candidates for new rules. Then the set of candidates is checked for the min conf value (Figure 6, line 6). For this the same assumption as in the generation of the frequent item sets is used. So if there is a new rule that does not satisfy this assumption, all subsets of that rule won’t satisfy the assumption as well. Because of this the rule can be removed (Figure 6, line 9).

Example - Association rule mining: Usually the algorithms for associated rule mining are used for basket analysis in online stores. To give an example, assume following transactions:

Assume the rule Bread =⇒ Butter. To compute the confidence for this rule, the support it the item set is needed:

(6)

Transaction Item

T1 {Cof f ee, P asta, M ilk}

T2 {P asta, M ilk}

T3 {Bread, Butter}

T4 {Cof f ee, M ilk, Butter}

T5 {M ilk, Bread, Butter}

TABLE I. TRANSACTIONS

sup(Bread =⇒ Butter) = φ({Bread, Butter})

|S| =

2 5 = 40% Then the confidence for this rule is computed as follow:

conf (Bread =⇒ Butter) =Support(Bread ∪ Butter) Support(Bread)

=40%

40% = 100%

So the rule Bread =⇒ Butter has a confidence of 100%. So if there is bread in a basket, the probability that the same basket also contains butter is about 100%.

Now assume the rule Butter =⇒ M ilk. Again the support value has to be calculated:

sup(Butter =⇒ M ilk) =φ({Butter, M ilk})

|S| =

2 5 = 40% The calculation of the confidence is done as above:

conf (Butter =⇒ M ilk) = Support(Butter ∪ M ilk) Support(Butter)

=40%

60% ≈ 66%

So the rule Butter =⇒ M ilk has a confidence of 66%. So if there is butter in a basket, the probability that the same basket also contains milk is about 66%.

Using such rules to identify trends can be critical because a trend might not hold for long. If the streaming data outputs rules, the reaction of the data analysts is to increase the support and confidence values to obtain fewer rules with stronger confidence, but rules may have already become outdated by the end of that period. An example for this could be sales of ice-cream during the hottest month in the summer. After the trend is gone, there will be no sales opportunity. Another example could be the stock market, which was already mentioned in Chapter II-B.

3) Frequent Pattern (FP)-Growth algorithm: The Frequent Pattern (FP)-Growth method is used with databases and not with streams. The Apriori algorithm needs n + 1 scans if a database is used, where n is the length of the longest pattern. By using the FP-Growth method, the number of scans of the entire database can be reduced to two.

The algorithm extracts frequent item sets that can be used to extract association rules. This is done using the support of an item set. The terminology, that is used for this algorithm is described in chapter IV-A1.

The main idea of the algorithm is to use a divide and conquer strategy:

Compress the database which provides the frequent sets; then divide this compressed database into a set of conditional databases, each associated with a frequent set and apply data mining on each database ??.

To compress the data source, a special data structure called the FP-Tree is needed [?]. The tree is used for the data mining part.

Finally the algorithm works in two steps: 1) Construction of the FP-Tree 2) Extract frequent item sets

a) Construction of the FP-Tree: The FP-Tree is a compressed representation of the input. While reading the data source each transaction t is mapped to a path in the FP-Tree. As different transaction can have several items in common, their path may overlap. With this it is possible to compress the structure. Figure 7 shows an example for the generation of an FP-tree using 10 transactions. Related to the example on Page 5, an item like a, b, c or d could be an item of a basket e.g. a product which was purchased in a supermarket.

TID Items 1 {a,b} 2 {b,c,d} 3 {a,c,d,e} 4 {a,d,e} 5 {a,b,c} 6 {a,b,c,d} 7 {a} 8 {a,b,c} 9 {a,b,d} 10 {b,c,e}

(i) After reading TID =1 (ii) After reading TID =2

(iii) After reading TID =3 (iv) After reading TID =10

Fig. 7. Construction of an FP-tree - Based on [?] The FP-Tree is generated in a simple way.

First a transaction t is read from the database. The algorithm checks whether the prefix of t maps to a path in the FP-Tree. If this is the case the support count of the corresponding nodes in the tree are incremented. If there is no overlapped path, new nodes are created with a support count of 1. Figure 8 shows the corresponding activity diagram using an UML (Unified Modeling Language) activity diagram. Additional a FP-Tree uses pointers connecting between nodes that have the same items creating a singly linked list.

These pointers, represented as dashed lines in Figure 7, 9 and 10, are used to access individual items in the tree even faster. The corresponding FP-Tree is used to extract frequent item sets directly from this structure. Each node in the tree contains the label of an item along with a counter that shows the number of transactions mapped onto the given path. In the best case scenario there is only a single node, because all transactions have the same set of items. A worst case

(7)

Fig. 8. Activity diagram - Construction of the FP-Tree - Notation: [?]

scenario would be a data source where every transaction has a unique set of items. Usually the FP-tree is smaller than the uncompressed one, because many transactions share items. As already mentioned the algorithm has to scan the data source twice.

Pass 1

The data set is scanned to determine the support of each item. The infrequent items are discarded and not used in the FP-Tree. All frequent items are ordered based on their support.

Pass 2

The algorithm does the second pass over the data to construct the FP-tree.

The following example shows how the algorithm works. According to Figure 7 the first transaction is {a, b}.

1) Because the tree is empty, two nodes a and b with counter 1 are created and the path null → a → b is created.

2) After {b, c, d} was read, three new nodes b, c and d have to be created. The value for count is 1 and a new path null → b → c → d is created. Because the value b was already in transaction one, there is a new pointer between the b’s (dashed lines).

3) The transaction {a, c, d, e} overlaps with transaction one, because of the a in the first place. The frequency count for a will be incremented by 1. Additional pointers between the c’s and d’s are added.

After each transaction was scanned, a full FP-Tree is created (Figure 7-iv). Now the FP-Growth algorithm uses the tree to extract frequent item sets.

b) Extract frequent item sets: A bottom-up strategy starts with the leaves and moves up to the root using a divide and conquer strategy. Because every transaction is mapped on a path in the FP-Tree, it is possible to mine frequent item sets ending in a particular item, for example e or d. So according to Figure 9, the algorithm first searches for frequent

item sets ending with e and then with d, c, b and a until the root is reached. Using the pointers, each the paths can be accessed very efficient by following the list. Furthermore each path of the tree can be processed recursively to extract the frequent item sets, so the problem can be divided into smaller subproblems. All solutions are merged at the end. This strategy allows to execute the algorithm parallel on multiple machines [?].

Fig. 9. Subproblems - Based on [?]

The FP-Growth algorithm finds all item sets ending with a specified suffix using the divide and conquer strategy. Assume the algorithm analyzes item sets ending with e. To do so, first the item set e has to be frequent. This can be done using the corresponding FP-Tree ending in e (Figure 10 (a)). If it is frequent, the algorithm has to solve the subproblem of finding frequent item sets ending in de, ce, be and ae. These subproblems are solved using the conditional FP-Tree (Figure 10 (b)).

The following example (Figure 10) shows how the algorithm solves the subproblems with the task of finding frequent item sets ending with e [?]. Assume the minimum support is set to two.

1) The first step is to collect the initial prefix path (Figure 10 a)

2) From this prefix path the support count is calculated by adding all support counts with node e. In the example the support count is 3..

3) Because 3 ≥ 2 = min sup the algorithm has to solve the subproblems of finding frequent item sets ending with de, be, ce and ae. To solve the subproblems the prefix path has to be converted into a conditional FP-Tree ( Figure 10 b). This tree is used to find frequent item sets ending with a specific suffix.

a) Update the support count along the prefix path that don’t contain e. Consider the path far right of the tree: null → b : 2 → c : 2 →

(8)

e : 1. This path includes the transaction {b, c} that doesn’t contain the item e. Because of this the count along this prefix path has to be 1.

b) The node e has to be removed, because the support counts have been updated to reflect only transactions that contains e. The sub-problems of finding item sets ending in de, ce, be and ae no longer need information about the node e.

c) Because the support counts were updated, there might be some prefix paths that are no longer frequent. According to Figure 10 (a) the node b appears only once with support equal to 1. It follows that there is only one transaction that contains b and e. So be is not frequent and can be ignored.

4) The tree 10 b is used to solve the subproblems of finding frequent item sets ending with de, ce and ae. Consider the subproblem of finding frequent item sets ending with de. A new prefix path tree is needed (Figure 10 c). After the frequency counts for d were updated, the support count for {d, e} is equal to 2 and it fits with the defined conditions. Next the conditional FP-tree for de is constructed using the method as from step 3. Figure 10 d shows the conditional FP-tree for de. This tree contains only one item a. The support is 2 which fits the conditions. The algorithm extracts the item set {a, d, e} and this subproblem is completely processed.

5) The algorithm starts with the next subproblem.

Fig. 10. Example of FP-Growth algorithm - Based on [?]

This example demonstrates that the runtime depends on the compression of the data set.

The FP-Growth algorithm has some advantages compared to the Apriori algorithm:

• Compressed data structure.

• No expensive computation of candidates. • Using a divide and conquer strategy. • Scanning the data source only twice.

It can be assumed that the FP-Growth algorithm is more effi-cient. The next chapter IV-A4 compares the Apriori algorithm and the FP-Growth algorithm.

4) Efficiency: To compare both algorithms, two databases with 10,000 and 100,000 records are used. Figure 11 shows the total runtime. The FP-growth method is faster than the Apriori algorithm because the FP algorithm scans the database only twice.

Fig. 11. Total runtime [?]

Fig. 12. Total runtime [?]

From figure 12 we can see that the FP-growth algorithm uses less memory than the Apriori algorithm, because the FP method tries to compress the structure [?].

B. Clustering

Cluster analysis is the task of classification of similar objects into groups or classes. By this the classes are defined during the clustering.

An area of use could be marketing. Here clustering can discover groups in all customers to use these groups for targeted marketing. Cluster analysis is a very important area in data mining. This chapter discusses the K-Means algorithm and a variation called the K-Means++ algorithm. Although the first version of the K-Means algorithm was published in 1964 it is still one of the most important methods of cluster analysis. It is the basis for further algorithms that often improve the accuracy or the run time.

(9)

1) K-means algorithm: Clustering is the task of dividing data into groups of similar objects. This task can be solved in various ways and there is no specific algorithm to solve this task.

There are two main methods of clustering: 1) Hierarchical:

One cluster is built at the beginning. Iteratively points are added to existing clusters or a new one is created. 2) Centroid based:

A cluster is represented by a central point which couldn’t be a part of the data set.

The K-means algorithm is one of the most common tech-niques used for clustering. It is a kind of learning algorithm and it is centroid based. The goal of the K-Means algorithm is to find the best division of an arbitrary data set in k classes or clusters, with the feature that the distance between the members of a cluster and its corresponding centroid is minimized. To calculate the distance it is possible to use various metrics. In the standard implementation of the K-means method a partition of the data set with n entities into k sets is given to minimize the within-cluster sum of squares (WCSS). k X j=1 n X i=1 (||xji − cj||)

The expression ||xj_i− cj|| describes the Euclidean distance

be-tween an entity and the cluster’s centroid. This implementation is also known as Lloyd’s algorithm. It provides a solution that can be trapped at a local minimum and there is no guarantee that it corresponds to the best possible solution. The accuracy and running time of K-means clustering depend heavily on the position of the initial cluster center.[?]

The initial K-means implementation requires three parameters: Number of clusters k, a distance metric and the cluster initial-ization

There are different implementations of the K-means method. The classic K-means clustering algorithm just takes the pa-rameter k. It uses the Euclidean distance and the cluster initialization is done with the first items of the input. The algorithm creates k clusters by assigning the data to its closest cluster mean using the distance metric. Using all data of a cluster the new mean is calculated. Based on the new means the data have to be reassigned to the means. This loop goes on until the maximum value for the iteration is reached or the new calculated means doesn’t move any more. In fact it is a heuristic algorithm. The basic k-means algorithm is as follow [?]:

1) Select k entities as the initial centroids.

2) (Re)Assign all entities to their closest centroids. 3) Recompute the centroid of each newly assembled

cluster.

4) Repeat step 2 and 3 until the centroids do not change or until the maximum value for the iterations is reached.

Figure 13 shows an exemplary result for a K-means execution: The purple x represents a centroid of a cluster.

Fig. 13. Exemplary K-means result

2) K-means++ algorithm: The K-means converges to an optimum without guarantee that it is the global one. There may often be a large number of local optima depending on the total number of entities n, the number of clusters k and the original layout of the entities.

Because of this the choice of the starting values for the algorithm is critical, since different parameters can lead to different optima [?], [?]. According to Aloise et al.[?], the K-means clustering problem is NP-hard in the Euclidean space even for k = 2 and it has a worst-case runtime of O(2Ω(n)). The K-means algorithm is the starting point for many further algorithms. Because the basic decision version of the K-means algorithm is NP-hard there are additional implementations that are slightly modified [?], [?]. A way to compute NP-hard problems is to compute an optimized solution.

Based on K-means Arthur and Vassilvitskii described the serial K-means++ method [?] as an approximation algorithm for the K-means to improve the theoretical runtime. The K-means++ method selects the first centroid at random. The subsequent centroids are selected according to the minimum probable distance that separates the centroid from the others. Let D(x) denote the shortest distance from an entity x of the data source X to the closest center which has already been chosen. Then the K-means++ is as follows:

1) Take one center c1, chosen uniformly at random from

X

2) Take a new center cx, choosing x ∈ X with

proba-bility PD(x)2 x∈XD(x)

2

3) Repeat step 2, until k centers have been taken 4) Proceed as with the standard K-means algorithm The step 2 is also called ”D2 weighting”.

Although the K-means++ method needs more time for the steps 1-3, the step 4 converges much faster than the K-means method without the initial selection. The main idea is to choose the centers in a controlled fashion, where the current set of chosen centers will stochastically influence the choice

(10)

of the next center [?]. According to Arthur et al.[?] the K-means++ is O(log(k)) competitive to the optimal solutions where k is the number of clusters.

C. Distributed computation - MapReduce algorithm

Chapter II discussed the research at CERN. Because of the huge amount of data, CERN invented the Worldwide LHC Computing Grid (WLCG) to perform the analysis that large data set.

Google’s MapReduce framework can be used for generating and processing large data sets. It allows the development of scalable parallel applications to process Big Data using a cluster or a grid [?]. For this the data can be stored in a structured (database) or unstructured way (filesystem). A typical MapReduce computation processes many terabytes of data on thousands of machines [?].

1) Programming model: The programmer expresses the computation of the two functions map and reduce. The MapRe-duce framework contains another function that is called shuffle. This function has no user code and it is a generic step which is executed after the users map function.

The map and reduce functions are written by the user. The function map takes an input pair and produces a set of inter-mediate key/value pairs. The input is used by the MapReduce framework to group the values by the intermediate keys. This is done by the function shuffle. Those aggregated values are passed to the reduce function. Because of the arbitrary user code it outputs data which is specified by the user. Usually it uses the given values to form a possibly smaller set of values. Typically there is just one result at the end. [?]

2) Execution overview: Figure 14 shows the complete flow if user code calls the MapReduce framework.

User Programm Master worker worker worker worker worker worker worker worker

Input files Map phase

Intermediate

files Shuffle Reduce Output files Worker for red keys

Worker for blue keys

Worker for yellow keys (1)fork (1)fork (1)fork

(2) assign (2) assign (3) read (4) local write (5) RPC

(7) return

Fig. 14. MapReduce Computational Model - Based on [?], [?]

When the user calls the MapReduce function in the user code, the following sequence of action occurs [?], [?]:

1) The MapReduce framework splits the specified set of input files I into m pieces, Ij∈ {I1, .., Im}. Then

it starts up copies of the users program on various cluster of machines and an additional copy called Master.

2) All copies excluding the Master are workers that are assigned by the master. The number of workers in the Map phase and Reduce phase don’t have to be equal, so assume there are M map tasks and R reduce task. The Master picks idle workers and assigns Map or Reducer tasks to them.

3) A worker with a Map task reads content from the cor-responding input file Ij The intermediate key/value

pairs produced by the map function are buffered in memory.

4) Periodically, the buffered pairs are written to local disk, partitioned into R regions. The location for each Map worker storage location is passed back to the Master. From here the Master is responsible for forwarding these locations to the reduce workers. 5) A Reduce worker uses remote procedure calls to read

the buffered data from the local disks of the worker. After the Reduce worker has read all intermediate data for its partition, it sorts it by the intermediate key so that all occurrences of the same key are grouped together.

6) The Reduce worker iterates over the sorted interme-diate data and for each unique intermeinterme-diate key en-countered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function. The result of the Reduce worker is an output file which is written to a arbitrary destination.

7) After all Map and Reducer tasks haven been com-pleted, the Master notifies the user program. With this the MapReduce call from the user code returns. 3) Example: Consider the problem of counting words in a single text or on a website. Concerning this problem, the map function is very simple.

1: procedure MAP(key, value)

2: for each word in value.split() do 3: output.collect(word,1); 4: end for

5: end procedure

Fig. 15. Map implementation

The map function processes one line at a time. It splits the line and emits a key-value pair of <<word>, 1>. Using the line ”Hello World Bye World” the following output is produced:

< Hello, 1> < World, 1> < Bye, 1> < World, 1>

The Reducer implementation just sums up the values: The output of the job is:

<Hello, 1> <World, 2> <Bye, 1>

(11)

1: procedureREDUCE(key, values) 2: while values.hasN ext() do

3: sum ← sum + value.get() . Sum up the values 4: end while

5: return(key, sum) 6: end procedure Fig. 16. Reduce implementation

MapReduce is designed as a batch processing framework and because of this it is not suitable for ad-hoc data analysis. The time to process this kind of analysis is to large and it doesn’t allow programmers to perform iterative or on-shot analysis on Big Data [?].

The project Apache Hadoop provides an open source mentation of the MapReduce algorithm [?]. Today this imple-mentation is used by many companies like Fujitsu Limited [?] or Amazon.com Inc. [?].

V. CONCLUSION

This paper introduces the basic principles for Big Data analysis techniques and technology. As the examples have shown, Big Data is a very important topic in daily business. It helps companies to understand their customers and to improve business decisions. Further research in the medical an many other areas wouldn’t be possible without Big Data analysis.

Due this various requirements, many approaches are needed. This paper focused on Association Rule Mining, Cluster Analysis and Distributed Computing. In the field of As-sociation Rule Mining, the Apriori and FP-Growth algorithms were presented. The Apriori algorithm is the most common one in this area but it has to scan the data source quite often. In addition, the complexity is comparatively high, because of the generation of the candidate item set.

In order to improve the scans of the data source, the FP-Growth algorithm was published. Instead of n+1 scans it takes only two. After a special data structure, the FP-Tree, was build the further work can easily be parallelized and executed on multiple machines.

The K-Means algorithm is the most common one for cluster analysis. Using a data source it creates k clusters based on the euclidean distance. The results depends on the initial cluster position and the number of clusters. Because the initial clusters are chosen at random, there is no unique solution for a specific problem.

However the K-Means++ algorithm improves the runtime by doing a preprocessing before it proceeds as the K-Means algorithm. With this preprocessing the K-Means++ algorithm is guaranteed to find a solutions that is O(log(k)) competitive to the solution of the K-Means algorithm.

Finally the MapReduce Framework allows a parallel and distributed processing of Big Data. For this hundred or even thousands of computers are used to process large amount of data. The framework is easy to use, even without knowledge about parallel and distributed systems.

All techniques in this paper are used for processing large data sets. Therefore it is usually not possible to get a response in seconds or even minutes. It usually takes multiple minutes, hours or days until the result is computed. For most companies

it is necessary to get a very fast response like in the OLAP ap-proach. For this, Google’s BigQuery is one possible solutions. With using tools for ad-hoc analysis, there are no possibilities for a deep analysis of the data set. To provide an ad-hoc analysis with a deep analysis of the data set, what is needed are algorithms that are more efficient and more specialized for a certain domain.

Because of this, technology for analyzing Big Data is also an important area in the academic environment. Here the focus is the running time and the efficiency of these algorithms. Furthermore the data volume is still rising, so another probable approach is further parallelization of the computational tasks. On the other side, the running time can also be improved by specialized algorithms like the Apriori algorithm for the analysis of related transactions.