Optimization of Parallel Data Accessing in Big Data Processing

(1)

Volume 3, Special Issue 1, ICSTSD 2016

222

Optimization of Parallel Data Accessing in Big

Data Processing

Prof. Saurabh A.Ghogare*1

MCA,Prof.RamMeghe Institute of Technology & Research, Amravati

Amravati, India,

[email protected]

Abstract-Big data usually includes data sets

with sizes beyond the ability of commonly used software tools to manage and process data within a tolerable elapsed time. Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that

are diverse, complex, and of a massive scale. The one aspect that differs now (if compared with the past) would be the sheer scale and accessibility of Data, which is the direct result of the super efficient speeds in which data can now be computed. Big Data is therefore an all-encompassing term for any collection of large data sets that were once difficult to process.organizationsare capturing data at deeper levels of detail and

keeping more history than ever before.

1. INTRODUCTION

Knowledge Discovery from Data (KDD) refers to a set of activities designed to extract new knowledge from complex datasets. The KDD process is often interdisciplinary and

(2)

223

the original 12 rules defined in the OLAP system defined the methods of data mining required for the analysis of data and defined SDA (standard data analysis) that helped analysis of data that is in aggregated form and these were much well timed in comparison with the decision taken in traditional methods. [3]

.According to Gartner definition big data is measured in 3 V’s: variety, volume, velocity and value. [4][5]

 Volume: Big data uses massive datasets, including for example meta-data from internet searches, credit and debit card purchases, social media postings, mobile phone location data, or data from sensors in cars and other devices. The volume of data being produced in the world continues to increase rapidly.[6].

 Variety:Big data often involves bringing together data from different sources. Currently it appears that big data analytics mainly uses structured data[7], e. g in tables with defined fields, but it can also include unstructured data. For example, it is possible to obtain a feed of all the data coming from a social media source such as Twitter. This is often used for ‘sentiment analysis, i. e to analyse what people are saying about productsor organisations.[3][7]

 Velocity:In some contexts, it is important to analyse data as quickly as possible, even in real time. Big data analytics can be used to analyse data ‘in motion’, as it is produced or recorded, as well as data ‘at rest’ in data stores. A potential application of ‘in motion’ analysis is in credit card payments. For example, Visais looking at using big data analytics to develop a new ways of authorising credit card payments[7].

2. ASSUMPTION -I

(3)

224

A. Statistical Analysis

:

It is concerned with summarising large datasets.Most statistical tools (i.e., R, SAS) prefer to compute data over numericaldata organized in tabular format.It require organization step specially for unstructured data.Currently in our system various statistical tools are available like SQL,R and Python.

B.DataMining:

Datamining(the analysis step of the "Knowledge Discovery in Databases" process, or KDD) [10] an interdisciplinary subfield of computer science.Each and every day the human beings are using the vast data and these data are in the different fields . As the data are available in the different formats so that the proper action to be taken as and when the customer will required the data should be retrieved from the database and make the better decision .This technique is actually we called as a data mining or Knowledge Hub or simply.KDD[1]

C.Datavisualization

and

visual

analysis:

Aprimary goal of data

visualization is to communicate information clearly and efficiently to users via the information graphics selected, such as tables and charts.Effective visualization helps users in analyzing and reasoning about data and evidence..

3.ASSUMPTION-II

Comprehensive KDD architecture provide a variety of data analysismethod.It must also supply a mean of storing and processing of data.Single storage mechanism is for small data volumes like local file system. But it is more problematic for large-scale data analysis. B, weargument that different types of analysis and the intermediate data structures required by these (e.g. graphs for social network analysis) call for specialized data management systems.Others have also recognized that the time of the single style database that fits all needs is gone [2].

A. Data Preparation and Batch Analytics:

(4)

225

B. Processing Structured Data

:

Although Hadoop can process such data (via Hive), we have found distributed analytic databases [6] to be useful for storing and analyzing such data.

C.

Processing Semi-structured Data:Not all data can be easily model using relational techniques. For examplehierarchical documents, graphs and geospatial data. Such data is extremelyuseful for social network analysis, natural language processing, and semantic web analysis. We provide H Base [7] and Cassandra forhierarchical, key-value data organization. For graph analysis, we employboth open-source tools (e.g., Neo4j [6]) and proprietary hardware solution.

Furthermore, data warehouses may not be able to handle the processing demands posed by sets of big data that need to be updated frequently or even continually -for example, real-time data on the performance of mobile applications or of oil and gas pipelines.



Hadoop

:Hadoop is a free Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure.



MapReduce:

MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distribute cluster of processor or stand-alone computers. It was developed at Google for indexing Web pages and replaced their original indexing algorithms and heuristics in 2004.The framework is divided into two parts:Map, a function that parcels out work to different nodes in the distributed cluster.Reduceanother function that collates the work and resolves the results into a single value.If a node remains silent for longer than the expected interval.[9]

4. APPLICATION

1.Understandingand

Targeting

(5)

226

traditional data sets with social media data, browser logs as well as text analytics and sensor data to get a more complete picture of their customers. The big objective, in many cases, is to create predictive models.

2. Understanding and Optimizing

Business Processes:

Big data is also increasingly used to optimize business process. Retailers are able to optimize their stock based on predictions generated from social media data, web search trends and weather forecasts. One particular business process that is seeing a lot of big data analytics is supply chain or delivery route optimization.[4]

3. Improving Science and Research:

Science and research is currently being transformed by the new possibilities big data brings. Take, for example, CERN, the Swiss nuclear physicslab with its Large Hadron Collider, the world’s largest and most powerful particle accelerator started and works - generate huge amounts of data.[4]

4. Optimizing Machine and Device

Performance:

Big data analytics help machines and devices become smarter and more autonomous.For example, big data tools are used to operate Google’s self driving car. The Toyota Prius is fitted with

cameras, GPS as well as powerful computers and sensors to safely drive on the road without the intervention of human beings. Big data tools are also used to optimize energy grids using data from smart meters. We can even use big data tools to optimize the performance of computers and data warehouses.[1]

5. FUTURE WORK

Although our infrastructure is used for real-world applications, we treat thesesystems as a research platform and expect it to continuously evolve as thestate-of-the-art advances. We have developed our knowledge discoveryprinciples during the course of implementing these applications and stand in up our own systems. However, there is still much to do and many open architectural questions. Some immediate ones include:[5]

 How do we take advantage of cloud computing to instant at big data services in an optimal manner (i.e., to reduce cost, maximize performance)?

 How do we automate and formalize the process of instantiating the entire data analysis pipeline?[7][9]

6. CONCLUSIONS

(6)

227

introduced complex, interesting questions for the community. As organizations continue to collect more data at this scale, formalizing the process of big data analysis will become paramount. In this paper, we introduced principle that we believe can guide organizations into developing a sound, useful, and flexible data analysis pipeline.We have instantiated these principles in our own infrastructure and havefound the principle to be useful guides.

7. REFERENCES

[1] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,Advanced in knowledge discovery and data mining, 1996. [2] “Big Data for Development:

Challenges and Opportunities”, Global Pulse, May 2012

[3] IvankaValova, Monique Noirhomme, “Processing Of Large Data

Sets:Evolution, Opportunities And Challenges”, Proceedings of PCaPAC08.

[4] Joseph McKendrick, “Big Data, Big Challenges, Big Opportunities: 2012 IOUG BigStrategies Survey”, IOUG, Sept 2012

[5] Nigel Wallis, “Big Data in Canada: Challenging Complacency for

CompetitiveAdvantage”, IDC, Dec 2012

[6] Souza, Robert et al. How to get started with big data.BCG perspectives. TheBoston Consulting Group, 29 May 2015 Accessed 25 June 2014

[7] Russom, Philip Managing big data. The Data Warehousing Institute, 2013.Available from http://www.pentaho.com/resourcesA ccessed 25 June 2014

[8] T. White, Hadoop: The definitive guide. Yahoo Press, 2010.