Big Data Analytics - Big Data Optimization : Algorithmic Framework for Data Analysis Guided by

Figure 2.4: Big Data Sources.

Big Data Analytic reflect the challenges of data that are vast, unstructured, and fast moving to be managed by traditional methods. Input to Big Data systems comes from various sources. Traditionally, the largest amount of data were created by business, such as: paper, audio or photographs records. However, nowadays, there exists plenty of emerging sources thank to new technologies, for instance, different fields, which are creating new data in real-time. Furthermore social networks are a powerful data source, Figure 2.4 shows a briefly view of data sources.

Big Data analytics have become increasingly important in both the academic and the business communities over the past years. Gleaning meaningful information and competitive advantages from massive amounts of data has become increasingly important to organizations globally. Trying to efficiently extract the meaningful insights from such data sources quickly and easily is challenging. Thus, analytics has become inextricably vital to realize the full value of Big Data to improve their business performance and increase their market share.

28 2.3. BIG DATA ANALYTICS

2.3.1 Data Mining

Since Big Data is not only large, but also varied and fast growing many technologies and analytical techniques are needed in order to attempt extracting relevant information. There are a myriad of analytic techniques that could be employed when attacking a Big Data project. Which ones are used, depends on the type of data being analyzed, the technology available to you, and the research questions you are trying to solve. One of the most used due to the amount of variants it offers isData Mining.

Brown et al. [71] defines Data Mining as the “combining methods from statistics and machine learning with database management” in order to pinpoint patterns in large datasets. Data mining is an area that has taken much of its inspiration and techniques from machine learning (and some, also, from statistics).

Data mining can be divided into two main classes of tools: model building and pattern discovery. Model building is a high level global descriptive summary of datasets, which in modern statistics include: regression models, cluster decomposition and Bayesian networks. Models describe the overall shape of the data. Pattern is a local structure, in a possibly vast search space, describing data with an anomalously high density compared with that expected in a baseline model. Patterns are usually embedded in a mass of irrelevant data [72].

However, in this PhD Thesis we focus on Big Data optimization because, as we said in Chapter 1, it is of special interest for the industry (SRIA 4.0) and furthermore, it has not been widely studied yet.

2.3.2 Big Data Optimization

Currently, the growing interest in Big Data applications [43], where many of them require to process large amounts of data coming at great speed from streaming data sources, brings new opportunities to apply dynamic multi-objective optimization. The rationale is that both, Big Data and multi-objective optimization are found in many disciplines, such as: transportation, economics, agri-food, mobility, and medicine; so it is foreseeable that they converge in a near future leading to multi-objective Big Data optimization problems hence, Big Data optimization includes the high dimensionality of data, the dynamical change of data, and the multi-objective of problems and algorithms.

In recent years, the content of the Big Data has been increasing over time, and the goal of Big Data analytic also changes with time. The algorithm not only should be able to handle the dynamical changing data, but also adjusting the target of data analytic. One of the foremost characteristics of Big Data is that the data came from different sources, so as to create together the large dataset. Generally, more than one objective needs to be satisfied at the same time in these large dataset. The most traditional methods can only be applied to continuous and differentiable functions, and have to perform a series of separate runs to satisfy different objectives [73].

In the last decade, a number of approaches consisting in adapting metaheuristic techniques to work in parallel on Hadoop ecosystems have been proposed. These proposals are related to data mining or data management applications, like: a swarm intelligence method to optimize the feature selection in Big gene expression datasets [74], data partitioning in Big Databases [75], dimension reduction in Big Data analytics [76], pattern detection with Artificial Immune Algo- rithms [77], a parallel MapReduce evolutionary algorithm for graph inference [78], and a parallel artificial ant colony optimization for task scheduling in clusters environments [79]. Most of these approaches are based on the MapReduce programming model. However, MapReduce entails a series of disadvantages that make it unsuitable to be properly integrated in global optimization techniques in general, and metaheuristics in particular. These drawbacks are mostly related to: high latency queries, non-iterative programming model, and weak real-time processing. Therefore,

CHAPTER 2. CONTEXT AND FUNDAMENTALS 29

new challenging approaches are demanded to integrate Big Data based technologies with global optimization algorithms in order address all these issues. In this thesis we present a new framework in order to cope with all that issues, namely jMetalSP.

Big Data analytics and specifically Big Data optimization, are divided into four components: handling large amount of data, handling high dimensional data, handling dynamical data, and multi-objective optimization. Most real-world Big Data problems can be modelled as a large scale, dynamical, and multi-objective problems.

In document Big Data Optimization : Algorithmic Framework for Data Analysis Guided by Semantics (Page 34-36)