Some Studies on Big Data Analytics with Machine Learning

(1)

105 International Journal for Modern Trends in Science and Technology, Volume 3, Special Issue 5, October 2017

Some Studies on Big Data Analytics with Machine Learning

S.Krishna Reddy¹| P.Lakshmikanth²| B.Ramakrishna³

1,2,3Department of CSE, Kallam Harnadhareddy Institute of Technology, Guntur, Andhra Pradesh, India.

2Associate Professor, Department of CSE, Chalapathi Institute of Engineering & Technology, Guntur, India.

To Cite this Article

S.Krishna Reddy, P.Lakshmikanth and B.Ramakrishna, “Some Studies on Big Data Analytics with Machine Learning”, International Journal for Modern Trends in Science and Technology, Vol. 04, Special Issue 01, January 2018, pp. 105-110.

Massive information is way quite storage of and access to information. Massive information Analytics plays a vital role in creating sense of the information and exploiting its price. But it’s a big challenge to find out and develop new types of machine learning algorithms. Scaling up massive data to correct spatiality is associate degree challenge that may encounter in machine learning algorithms and there are unit challenges of managing speed, volume and plenty of more for every kind of machine learning algorithms. Here, in this paper, we have a tendency to area unit 1^st exploring massive information idea, bringing with associate degree pressing would like for advanced information acquisition, management, and analysis mechanisms, we have conferred the idea of massive information and highlighted the four phases of massive information that area unit generating information, acquisition of information, storing this massive information, and then analyzing information. The next section of this paper, focuses on managing huge data victimization machine learning (ML), and highlighted the three cc methods: supervised learning, unsupervised learning and reinforcement learning and its impact on big data.

Keywords—Big Data, Machine Learning, Supervised learning, unsupervised learning. Reinforcement learning.

I. INTRODUCTION

The rising big-data paradigm, attributable to its broader impact, has deeply reworked our society and can continue to attract numerous attentions from each technological specialist and therefore the public generally. It is obvious that we tend to live a knowledge deluge era, proved by the sheer volume of information from a range of sources and its growing rate of generation. As an example, Associate in Nursing IDC report [5] predicts that, from 2005 to 2020, the world information volume will grow by an element of three hundred, from one hundred thirty Exabyte’s to forty,000 Exabyte’s, representing a double growth each 2 years. The word of “big-data” was coined to capture the deep meaning of this data-explosion trend and so the

information has been touted because the new oil, that is predicted to transform our society.

Machine learning may be outlined as a “Field of study that gives computers the flexibility to be told while not being explicitly programmed”.

It’s a science of algorithms says that, the algorithms “learn” from the dataset, characteristic patterns or classifying trends as an example, and so automates output- whether or not that’s sorting knowledge into classes or creating predictions on future outputs.

Machine Learning is currently quite an matured discipline and captures normally any activity that involves machine-controlled learning from information or expertise. At the core of machine learning is that the ability of a package or machine to enhance the performance of sure tasks through ABSTRACT

Available online at: http://www.ijmtst.com/ncracse2018.html

Special Issue from 3rd National Conference on Recent Advances in Computer Science and Engineering, 27^th – 28^th January 2018, Guntur, Andhra Pradesh, India

(2)

106 International Journal for Modern Trends in Science and Technology being exposed to data and experiences. Machine

learning model initial learns the information from the information it's exposed to then applies this information to deliver predictions concerning the new information that is antecedently unseen information [5] [1].

The quality of the predictions delivered by machine learning model rely on variety of factors:

however well the relevant information is delineated by the module, how effectively complete and representative were the training data, however simply we are able to forecast the matter normally.

Most of the days, machine learning became a really sensible weapon at sure recognition, identification or categorization tasks like fingerprint, faces or voice recognition [5][7]. Likewise recent cluster algorithms became superb at mechanically grouping folks according to their profiles, determine market segments, forming communities, and even section pictures or distinguish genes [5], [4]. The productive application of ML models to those issues was attainable not solely thanks to the training model ingenuity however primarily thanks to the terribly careful information with the properties like transformative, distinctive, and generation of multidimensional that created the training drawback discriminative enough to form simple distinctions between the predicted targets supported the information [5], [3], [4].

Depending upon the depth of data that's available for learning, machine learning models are often categorized into supervised, unsupervised and semi supervised learning algorithms [5], [2], [1].

Big knowledge is of 2 types: Structured knowledge and Semi structured knowledge. Structured knowledge square measure numbers and words that can be simply categorized and analyzed. This knowledge square measure generated by things like network sensors embedded in electronic devices, sensible phones, and world positioning system (GPS) devices. Structured knowledge conjointly embodies things like account balances, sales figures, and dealings knowledge.

Unstructured knowledge embodies a lot of advanced data such as reviews of consumers and forums from business websites, photos and different transmission, and comments on social networking sites. These knowledge cannot simply be divided into classes or analyzed numerically.

II. BIG DATA AND ITS IMPACT A. Layered Architecture for Big Data

The big information system may be rotten into a stratified structure, as illustrated in Fig. 1.

The stratified structure is divisible into 3 layers, i.e., the infrastructure layer, the computing layer, and therefore the application layer, from bottom to top. This stratified read solely provides an abstract hierarchy to underscore the quality of an enormous information system. The operate of every layer is as follows.

The infrastructure layer consists of a pool of ICT resources, which may be organized by cloud computing infrastructure and enabled by virtualization technology. These resources are going to be exposed to upper-layer systems in a _ne-grained manner with a special service-level agreement (SLA). Among this model, resources should be allocated to satisfy the massive information demand whereas achieving resource potency by increasing system utilization, energy awareness, operational simplification, etc.

The computing layer encapsulates numerous information tools into a middleware layer that runs over raw ICT resources. In the context of massive information, typical tools embody information integration, information management, and therefore the programming model. Information integration suggests that deed information from disparate sources and desegregation the dataset into a unified form with the mandatory information pre-processing operations. Data management refers to mechanisms and tools that provide persistent information storage and extremely economical management, like distributed beer systems and SQL or NoSQL information stores. The programming model implements abstraction application logic and facilitates the info analysis applications. Map cut back [12], Dryad [10], Pregel [12], and Dremel [6] exemplify programming models.

The applying layer exploits the interface provided by the programming models to implement numerous information analysis functions, as well as querying, applied mathematics analyses, clustering, and classification; then, it combines basic analytical ways to develop numerous field connected applications.

(3)

107 International Journal for Modern Trends in Science and Technology Fig: 1 BigData Layered Architecture

B. Big Data Phases

A big-data system is advanced, providing functions to deal with completely different phases within the digital knowledge life cycle, ranging from its birth to its destruction. At constant time, the system typically involves multiple distinct phases for different applications [8], [11]. During this section, we focus on the chain for big knowledge analytics. Specifically, we describe an enormous knowledge chain that consists of 4 stages (generation, acquisition, storage, and processing).

Fig. 2: Big Data Value Chain a. Data Generation

Data Generation the primary and most vital part of big information chain. This section highlights the trends of huge data generation which will be characterized by the information generation rate. Specifically, the info generation rate is increasing as a result of technological advancements. We roughly classify information generation patterns into 3 sequent stages.

Stage I: The primary stage began within the Nineteen Nineties. As digital technology and information systems were wide adopted, many management systems in varied organizations were storing giant volumes of information, like bank mercantilism transactions, store records, and government sector archives. These datasets ar structured and might be analyzed through database-based storage management systems.

Stage II: The second stage began with the growing popularity of internet systems. The Web 1.0

systems, characterized by internet search engines and ecommerce businesses once the late Nineteen Nineties, generated giant amounts of semi-structured and/or unstructured information, including web pages and dealing logs. Since the first 2000s, many Web 2.0 applications created Associate in Nursing abundance of user-generated content from on-line social networks, such as forums, on-line teams, blogs, social networking sites, and social media sites.

Stage III: The third stage is triggered by the emergence of mobile devices, like good phones, tablets, sensors and sensor-based Internet-enabled devices. The mobile centrical network has and can continue to produce extremely mobile, location-aware, person-centered, and context relevant information within the close to future.

With this classification, we will see that the info generation pattern is evolving speedily, from passive recording in Stage I to active generation in Stage II and automatic production in Stage III.

These 3 styles of information constitute the first sources of huge information, of that the automatic production pattern can contribute the foremost in the close to future.

b. Data Acquisition

As illustrated in the big data value chain, the task of the data acquisition phase is to aggregate information in a digital form for further storage and analysis. Intuitively, the acquisition process consists of three sub-steps, data collection, data transmission, and data pre-processing, as illustrated in Fig. 3.

Fig. 3. Subtasks of Data Acquisition

Data pre-processing is also of equal importance in data acquisition phase. Because of their diverse sources, the collected data sets may have different levels of quality in terms of noise, redundancy, consistency, etc. Transferring and storing raw data would have necessary costs. On the demand side, certain data analysis methods and applications might have strict requirements on data quality. As such, data preprocessing techniques that are designed to improve data quality should be in place in big data systems.

c. Data Storage

The data storage subsystem in a big data platform organizes the collected information in a convenient format for analysis and value extraction. For this

Generation Acquisition Storage Analytics

Data Collection

Data Transmission

Pre- Processing

(4)

108 International Journal for Modern Trends in Science and Technology purpose, the data storage subsystem should

provide two sets of features: The storage infrastructure must accommodate information persistently and reliably. The data storage subsystem must provide a scalable access interface to query and analyze a vast quantity of data. This functional decomposition shows that the data storage subsystem can be divided into hardware infrastructure and data management.

d. Data Analysis

The last and most important stage of the big data chain is data analysis, the goal of which is to extract useful values, suggest conclusions and/or support decision making. First, we discuss the purpose and classification metric of data analytics.

Second, we review the application evolution for various data sources and summarize the six most relevant areas. Finally, we introduce several common methods that play fundamental roles in data analytics.

III. MACHINE LEARNING FOR BIG DATA The core abilities of the data analytics and the machine learning to model, learn from, and predict data were very deeply affected by the emergence of big data. Almost in an instant the challenge came from all sides: accessing big amounts of data, learning the models from it and carry out mass predictions all in a reasonable time.

Since big data processing requires decomposition, parallelism, modularity and/or recurrence, inflexible black-box type machine learning models failed in an outset. In fact all machine learning algorithms with computational complexity of O(n²) immediately become intractable when faced with billions of data points.

Depending on the depth of knowledge that is available for learning, ML models can be categorized into supervised, unsupervised and semi-supervised learning algorithms [5], [2], [1].

A. Supervised Learning

Supervised learning algorithms use labeled training examples. The input data and the target outputs are given explicitly for the model to learn the mapping or function between them. Once this is captured the model then uses the learned mapping and the new unseen input data to predict their outputs. Within supervised learning family we can further distinguish between classification models which focus on prediction of discrete (categorical) outputs or regression models which predict continues outputs. Among big number of models reported in the literature linear and nonlinear density based classifiers, decision trees,

naïve Bayes, support vector machines, neural networks and nearest neighbor are the most frequently cited and applied in practical applications [5].

Decision trees despite their simplicity were particularly successful in various industrial applications due to their robustness, transparency and the portability of the model effectively resulting in a set of SQL queries. Despite their operational simplicity induction a decision tree is at best O(n log n) complex and still requires significant data reduction effort to handle very big data [5] .

Support vector machines (SVM) were also reported as very powerful in terms of achieved performance on the numerical data thanks to their explicit effort to maximize the margin of misclassification along the boundaries among the classes of data [5]- . Traditional SVMs are quadratic ally complex O(n2), yet there are successful attempts to cleverly reduce their complexity to linear complexity [6]. We intend to expand on these developments.

Neural networks (NN) are yet another example of very powerful, flexible and robust predictors for both classification and regression problems [5] - . Its generic structure allows learning virtually any function successfully with enough hidden layers of processing nodes and training data. Like for SVM, NNs are typically O(n3) or at best quadratic ally complex with the number of examples. Complex structures further expand computational demands of learning NN. There have been recent attempts to redesign NN into more scalable architectures [8], further work is required to make NNs fully scalable for big data processing.

B. Unsupervised Learning

Unsupervised learning generally named as clustering is concerned with recognizing natural grouping among the data i.e. separating similar from dissimilar data supported multidimensional data. Figure 2 visually depicts some samples of with k-means and mixture of Gaussians clustering methods.

Many clustering methods from k-means through density based mostly DBSCAN and hierarchical clustering models were reported for easy issues on tiny data sizes [4], [9] traditional hierarchical clustering typically involves computation of the pair wise distances among the input data examples which usually involves quadratic ally complicated computation step.

Assigning data samples to totally different clusters based on the distance hierarchy poses yet another computational overhead.

(5)

109 International Journal for Modern Trends in Science and Technology DBSCAN is one of the foremost cited

clustering algorithms. It supported the data density and uses the concept native reach ability among points to decide concerning cluster memberships. However, even if optimized with efficient indexing structure it's at best O(n log n) complex in the range of examples [4]

K-means clustering, on the opposite hand, is linearly complicated within the number of examples and data dimensionality however assumes fixed number of algorithmic program iterations and stuck range of clusters. Because of its linear complexity it's however the most scalable clustering model thus far which also evolved into a successful artificial neural network equivalent – the Kohonen Nets [4]. Some recent research according more improved scalability of k-means model when the average aggregation is replaced with the median [9]. We have a tendency to will expand on the promising scalable supervised and unsupervised learning models additionally as develop the new models that flexibly manage the balance between the computational complexity of learning and also the prediction performance through modularization and repeat at totally different data granularity.

C. Reinforcement learning:

Reinforcement learning lies somewhat in between supervised and unsupervised learning in an exceedingly sense that learning isn't done on the totally labeled or evaluated data however a hint during a form of native reward is provided to each actionable data. The goal of the reinforcement learning is to maximize the long term reward through examination and learning the best possible action policy in response to the environmental data. Reinforcement learning found prime applications in artificial intelligence, agent technologies and lots of complicated exploration systems requiring continuous response to the changing environment whereas maximizing strategic longer-term objectives.

IV. BIG DATA REVOLUTION

The emergence of the notion of big data initiated verity revolution within

the way we tend to store, manage and process the data. Coming back now in Exabyte per day massive data and its quality still grow exponentially.

Big data are connected to the explosive growth in mobile devices and their ability to get, collect, share and access large amounts of textual, numeric, imagery and video data. Combined with

a related to growth of web resources, networked services and also the cult of sharing on ever growing social networks what we tend to are experiencing is a true data revolution.

V. CONCLUSION

The era of big data is upon us, bringing with it an urgent need for advanced data acquisition, management, and analysis mechanisms. In this paper, we have presented the concept of big data and highlighted the big data value chain, which covers the entire big data lifecycle. The big data value chain consists of four phases: data generation, data acquisition, data storage, and data analysis. The next part of this paper focuses on dealing with big data using machine learning (ML), and highlighted the three ML methods:

supervised learning, unsupervised learning and reinforcement learning and its impact on big data.

REFERENCES

[1]. O. Chapelle, B. Cholkopf, A. Zien. Semi-Supervised Learning (Adaptive Computation and Machine Learning Series). MIT Press, Cambridge, 2006.

[2]. C.M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006.

[3]. R.O. Duda, P.E. Hart and D.G. Stork. Pattern Classification. John Wiley & Sons, New York, 2001.

[4]. C.C. Aggarwal and S.K. Reddy. Data Clustering, Algorithms and Applications. Chapman and Hall / CRC, Boca Raton, 2014.

[5]. T.M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

[6] S. Melnik et al., ``Dremel: Interactive analysis of web-scale datasets,'' Proc. VLDB Endowment, vol. 3, nos. 1_2, pp. 330_339, 2010.

[7]. S. Marsland. Machine Learning, An Algorithmic Perspective. Chapman and Hall / CRC Press, Boca Raton, 2009.

[8 ] E. B. S. D. D. Agrawal et al., ``Challenges and opportunities with big data_A community white paper developed by leading researchers across the united states,'' The Computing Research Association, CRA White Paper, Feb. 2012.

[9]. YC Kwon, D. Nunley, J.P. Gardner, M. Balazinska, B. Howe, S. Loebman. Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster.

Scientific and Statistical Database Management 6187:

132-150, 2010. [11] J. Gantz and D. Reinsel, ``The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east,'' in Proc. IDC iView, IDC Anal. Future, 2012.

[10] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly,

``Dryad: Distributed data-parallel programs from sequential building blocks,'' in Proc. 2nd ACM SIGOPS/EuroSys Eur. Conf. Comput. Syst., Jun. 2007, pp. 59_72.

(6)

110 International Journal for Modern Trends in Science and Technology [11] D. Fisher, R. DeLine, M. Czerwinski, and S. Drucker,

``Interactions with big data analytics,'' Interactions, vol.

19, no. 3, pp. 50_59, May 2012.

[12] G. Malewicz et al., ``Pregel: A system for large-scale graph processing,'' in Proc. ACM SIGMOD Int. Conf.

Manag. Data, Jun. 2010, pp. 135_146.