The Conceptual Overview: Challenges and Opportunities with BIG DATA

(1)

The Conceptual Overview:

Challenges and Opportunities with “BIG DATA”

Shailesh Tejram Gahane, Assistant Professor, Dr D Y Patil School of MCA, Charholi Bk., Via Lohegaon, Pune-412105 Maharashtra, India

ABSTRACT

As we know that we are now living in the world of • Massive Multi-Core Processing Systems, • Very Large Main Memory Systems, • Fast Networking Components, • Big Computing Clusters, and

• Large Data Centers - which consumes massive amounts of energy.

So here question arises that, what is the next evolution for data-driven world?

In this paper we have looked forward towards the above some challenges and we have found the solution is “Big Data”. Big Data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery and/or analysis (Source: IDC). Big Data is an inflection point when it comes to information technologies. In short, Big Data is Big Deal! In fact, Big Data is going to change the way you do things in the future, how you gain insights, and how you make decisions. Big Data technologies, such as Hadoop and extending it into an enterprise ready Big Data Platform. In a similar way we have discussed on hadoop and NOSQL databases as a component of Big Data and discussed on various resources which are affected by the Big Data problems.

Keywords:

Big Data, Multi-Core Processing Systems, Main Memory, Computing Clusters, Hadoop, NOSQL Database, Petabytes, Terabytes, GPS, Content Management, Content Analysis.

INTRODUCTION

“Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be considered Big Data i.e. we don’t define Big Data in terms of being larger than a certain number of Terabytes (Thousands of Gigabytes). We assume that, as technology advances over time, the size of datasets that qualify as big data will also increase. Also note that the definition can vary by sector,

depending on what kinds of software tools are commonly available and what sizes of datasets are common in a particular industry, with those cautions Big Data in many sectors today will range from a few dozen Terabytes to multiple Petabytes (Thousands of Terabytes).

A flood of data is created every day by the interactions of billions / trillions of people using computers, GPS devices, cell phones, medical devices and many more. Many of these interactions occur through the use of mobile devices being used by people in the developing world, people whose needs and habits have been poorly understood until now. Big data is a term applied to a new generation of Software, Applications, System and Storage Architecture, all designed to provide business value in the market from unstructured and unbalanced data. Around 10+ years in advance, the Big Data need to create an important value in market places.

In this market the most constructive and needful application like eBay, Amazon, LinkedIn, Facebook, Apple Store, Wikipedia and many more, all depend on the online transaction community to create an icon and in turn to create importance or value. Researchers and policymakers are beginning to realize the potential for channeling these torrents of data into actionable information that can be used to identify needs, provide services, and predict and prevent crises for the benefit of low-income populations. Concerted action is needed by government, development organizations and companies to ensure that this data helps the individuals and communities who create it. Big data technology and analytics has emerged as a discipline to enable business enterprises deal with large volume and variety of data that is generated at high velocity. In this paper we will study with the use of mobile phone technology, which are the major areas affected by the Big Data.

We are swimming in a sea of data, and the sea level is rising rapidly. Tens of millions of connected people, billions of sensors, trillions of transactions now work to create incomprehensible amounts of information. An equivalent amount of data is generated by people simply going about their lives, creating what the McKinsey Global Institute calls “DIGITAL EXHAUST” data given off as a byproduct of other activities such as their Internet browsing and searching or moving around with their smartphone in their pocket.

(2)

REVIEW OF LITERATURE

Big Data is a growing problem for corporations. Current discussion has frequently centered around Hadoop and NOSQL, but these are only a starting point. There is a growing realization that there is no single solution to the

problems that are being faced in this area. Instead, there is a range of available solutions that need to be tuned to the specific problems being addressed. Hadoop has clearly emerged as an important component of Big Data solutions, but it is not a panacea, and there is widespread divergence in how it is being used, how the results are being

(3)

integrated with other data, and how programming and usage are facilitated. There are also a range of other processing methods that should be examined with respect to the problem at hand. NOSQL databases have also arisen as a component of the Big Data problem, particularly under the argument that data and processing requirements now exceed the capabilities of standard RDBMS and Analytic RDBMS systems. While this is to some extent true, it is more significant to look at the wide range of current alternative data stores, which include columnar databases, IMS and CODASYL solutions, and the range of NOSQL variations. Integration with current corporate data will be critical, and this means that even these solutions must eventually be brought into the corporate data warehouse and analytics solution. The major RDBMS suppliers are now all providing some accommodation. Structured corporate data will not go away, and neither will it lose its usefulness; the synergies between Big Data solutions and current analytics solutions will provide a key area of growth for tomorrow. In addition to integration issues, Big Data problem are likely to involve an increasing array of new problems in data location, security, processing and purpose that have only begun to emerge. For example, storage of Personally Identifiable Data/ Personally Identifiable Information (PID/PII) is illegal in many areas, but what about data and processing that permits such data to be generated? Companies are now only at the very edge of this emerging area, as the vast wave of digital data enables new approaches, new information, and generates new territories of risk.

SIGNIFICANCE OF THE STUDY

Since the 1990s, data warehousing and data mining have brought numerous examples of Big Data. Walmart’s Teradata based data warehouse, dating from that period, has been consistently among the largest data warehouses in the world. Walmart is notoriously secretive about their system, but a variety of sources on the web give figures. In the early 1990s, it started at 340GB. By 2000, it was said to be 70TB. By 2004, it was 500TB or half a petabyte. In 2008, a figure of 4PB was being mentioned. For their time, these are certainly big data, before the term was invented, with correspondingly large storage and management costs and Walmart, as the world’s largest retailer through most of that time, is certainly renowned for getting value for its money. From these and other examples, we may draw a couple of conclusions. First, data size has always been pushing the limits of computer storage and processing technology. In that sense, what we are seeing today is not new. So far, technology has been able to accommodate this data growth, so it is probably a reasonable assumption that it will continue to do so. Second, some highly successful, leading-edge companies have consistently invested in the (relatively) expensive technologies needed to use big data and have reaped significant value from it. This, in a nutshell, is the importance of big data: Big data

enables innovative businesses to become leading-edge adopters of new approaches to doing business and thus become particularly successful.

of course, this is not to say that any business adopting big data will necessarily become a leading edge adopter of a new business approach. There are other highly significant factors such as the market environment in which the business operates and the organization’s ability to adapt to change. Furthermore, like any technical innovation, big data may confer first mover advantage on a particular business, but then becomes mandatory for the competition simply to survive. There is hardly a large retailer who has not used data in a manner similar to Walmart. However, for a variety of reasons, perhaps related to Walmart’s market share or business philosophy, they have been unable to achieve Walmart’s level of success.

The Value of Big Data Recognizing that big data has long been with us allows us to look at the historical value of big data, as well as current examples. This allows a wider sample of use cases, beyond the Internet giants who are currently leading the field in using big data. This leads us to the identification of value in two broad categories: pattern discovery and process invention.

PATTERN DISCOVERY

First, let me be clear that pattern discovery alone is not of value to the business. The title is simply shorthand for “pattern discovery and innovative reaction” Clearly, discovering a pattern in, for example, customer behavior may be very interesting, but the real value occurs when we put that discovery to use by changing something that reduces costs or increases sales.

Pattern discovery leads us directly back to data mining. There are probably few of us who haven’t heard and perhaps repeated the “beer and diapers (or nappies)” story: A large retailer, supposedly Walmart, discovered through basket analysis – data mining till receipts that man who buy diapers on Friday evenings are also likely to buy beer. They rearranged the store layout to place the beer near the diapers and watched beer sales climb. Sadly, this story is now widely believed to be an urban legend or sales pitch rather than a true story of unexpected and momentous business value gleaned from data mining. Nevertheless, it makes the point as well as any number of real examples: there are nuggets of useful information to be discovered through statistical methods in any large body of data, and action can be taken to benefit from these insights.

This particular story illustrates data mining in a single, well understood data set, with the insight used to target a previously unidentified segment of the customer set “fathers who buy diapers when supplies run low at home” or some similar categorization. Such uses of big data

(4)

remain common and provide business value through targeted marketing to smaller and ever more specific micro segments of the market until we reach the nirvana of the “segment of one.”

PROCESS INVENTION

As we discussed, mining behavior data allows existing processes to be tweaked and changed to provide better business value. However, the second approach to getting business value from big data involves using the data operationally to invent an entirely new process or substantially re-engineer an existing one. Beyond basket analysis, most retailers in the 1990s also used the cash register data to re-engineer their restocking and supply chain processes. It makes sense that goods purchased at the register deplete stock on the shelves, which must then be replenished. Restocking shelves depletes the store room and supplies must be reordered, and so on. In the past, restocking and reordering were largely triggered by a manual process in which floor or warehouse staff had to first notice that stocks were low and then take action. By automating these processes from triggers in the sales/inventory system, enormous financial and customer satisfaction benefits increases. The twin terrors of retail out of stock or overstock situations are avoided.

Machine sensor data, the second category is a key to this level of process re-engineering. Such data consists essentially of raw, unadulterated operational events that can drive new or re-engineered processes. A prime example of this type of process invention comes from the automobile insurance industry, where on board sensors of acceleration and braking, among others, transmit driving behavior information in real time to the insurance company. Premiums are adjusted based on measured behavior. This is an entirely new way of offering automobile insurance that was impossible before the advent of this type of big data, and clearly allows new ways of achieving business value.

STATEMENT OF THE PROBLEM

Big Data is about the massive growth that we have seen in digital data as everything knowable becomes digitized and new forms of communication that only exist within the digital area continue to be added to the mix. Data has been growing very quickly for a very long time, with the conventional estimate being a doubling every 18 months. The McKinsey Global Institute has estimated that enterprises globally stored more than 7 Exabytes (7 x 260 bytes) of new data on disk drives in 2010, while consumers stored more than 6 Exabytes of new data on devices such as PCs and notebooks. Volume by itself does not begin to describe the true picture, as illustrated in Figure 1. As different types of objects are brought into the data stream, such as voice, video, architectural plans, and customer comments, new issues emerge in how these items can be processed, stored, accessed and indeed how they can be differentiated from each other. As we move into a world where the digital stream is largely unstructured, is not necessarily stored in an ordered way, exists in real time, and exists in formats that have special processing requirements, the old assumptions begin to break down. This is the root of the Big Data problem. Big Data is not just about Analytics, though this is perhaps the most critical area. It is also about Organization, Categorization and Access to data. There is an increasing realization that all data is not alike and this means that the uniform models previously used to manage, store, analyze and retrieve it in the past no longer operate so effectively. Not only is the amount much greater, but the differentiation is also greater and techniques used to shoehorn unwilling data objects (for example BLOBs) into unnatural arrangements soon break down when any kind of real access is required. Extraordinary growth in data, although predictable, continues to strain corporate resources in both infrastructure and processing

sectors.

(5)

OBJECTIVES

OF PROPOSED

RESEARCH WORK

The primary objective of the research work is development of a low cost software platform for big data. The proposed development has the following subsidiary objectives. • Data storage • Data recovery • Applications • Business Continuity • Security

• Networking and network infrastructure

• Content management

• Content hosting

• Content analysis

The current areas of impact are within development of alternative database designs, particularly within the “NOSQL” movement; and within Advanced Analytics, where Hadoop and MapReduce are becoming of increasing importance and defining new market sectors.

RESEARCH METHODOLOGY

The methodology of implementing the proposed system is divided into the following steps:

Analysis and Interpretation of Data: During this step the analysis and interpretation of data will be prepared. The functionalities of different Big data techniques will be studied in this step. The general architecture of the proposed research will be planned on paper.

Data Collection: During this step the various big data complexities discussed and different types of objects are brought into the data stream, such as voice, video, architectural plans, and customer comments, new issues emerge in how these items can be processed, stored, accessed and indeed how they can be differentiated from each other. According to above discussion the data will be collected.

Testing: In this step, the analysis will be done with respect to the big data complexities. There are some areas of fundamental importance in understanding Big Data with respect to Big data complexities gets tested.

These are:

• Big Data infrastructure

• Big Data Analytics

• Cloud facilitation of Big Data processing

While in this testing phase much of the current attention has been focused upon Analytics, the infrastructure issues are equally important, particularly with respect to developing a capacity to access, manage and process petabytes of data.

CONCLUSION AND DISCUSSION

For those contemplating investment in big data, the most important conclusion from this article is to recognize that there are very specific combinations of circumstances in which big data can drive real business value. Sometimes, of course, it is the price for simply staying in the game, as we saw in the case of retailing. Other times, it can open up a new market slot for profitable exploitation, at least for the first movers. Recognizing and taking advantage of such opportunities demands a close partnership between IT and the business to understand all aspects of the situation.

In this paper we have looked towards how the market for Big Data infrastructure, technologies and solutions are evolving in response to the explosion in both structured and non-structure content of all types. The Cloud is emerging as a particularly useful platform for Big Data solutions for both infrastructure and analytics, as it is an ideal platform for massive data crunching and analysis at affordable prices.

“Big Data” is an increasingly used but often ill defined term, spurred in large part through the growth of Cloud IT and Cloud Business. In this paper we address two important aspects of Big Data for enterprise IT and business leaders:

• Determining how emerging Big Data technologies can aid them in developing real business solutions.

• and Understanding the components of Big Data solutions to make appropriate choices that meet specific requirements.

SCOPE OR DETAILS OF THE

PROPOSED IMPLEMENTATION

As Big data absolutely has the potential to change the way governments, organizations, and academic institutions conduct business and make discoveries, and its likely to change how everyone lives their day-to-day lives. In today’s generation many organizations can use a number of tools to make the management of big data more tolerable. These solutions include Microsoft SQL Server, PowerPivot, SharePoint, Office and Windows Server. As above Figure 1 denotes, areas affected include:

• Data storage

• Data recovery

• Applications

• Business Continuity

• Security

• Networking and network infrastructure

• Content management

• Content hosting

(6)

As considering the some factors we will decide the future scope of the big data.

These are as follows:

• Does it handle high data velocity.

• Can it tackle all types of data.

• How well does it perform with large data volume.

• Can it handle complex distribution and implementation use cases.

• How does it stack up in hitting the big data “bulls eye” (i.e cost, saleable performance etc.)

As processing of enormous databases and multi-gigabyte artifacts becomes more common, processes themselves will advance to both provide better management and to take advantage of the rich nature of evolving digital content. Thus, data growth will continue to impact all areas.

The current areas of impact are within development of alternative database designs, particularly within the “NOSQL” movement and within Advanced Analytics, where Hadoop and MapReduce are becoming of increasing importance and defining new market sectors.

REFERENCES

[1] www.ibm.com/software/data/bigdata/ [2 [3] www.nextbillion.net/blog/the-age-of-big-data [4] jana.com/research/sample/, www.nextbillion.net/blog/2011/10/05/reaching-the-next-billion-through-mobile [5] ess.santafe.edu/publications.html [6] ushahidi.com [7] [8] [9] http://www.panasas.com/solutions/big-data [10] http://compnetworking.about.com/od/itinformationte chnology/l/aa070101b.htm [11] http://pcsupport.about.com/od/termss/g/serialata.htm [12] http://en.wikipedia.org/wiki/Marketing_effectiveness [13] http://en.wikipedia.org/wiki/Decision_model

[14] Advancing Discovery in Science and Engineering. Computing Community Consortium. Spring 2011. [15] Using Data for Systemic Financial Risk Management.

Mark Flood, H V Jagadish, Albert Kyle, Frank Olken, and Louiqa Raschid. Proc. Fifth Biennial Conf. Innovative Data Systems Research, Jan. 2011. [16] The Age of Big Data. Steve Lohr. New York Times,

Feb 11, 2012.

http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html

[17] McKinsey Global Institute. “Big data: The next frontier for innovation, competition, and productivity.” May 2011. Available at: http://www.mckinsey.com/insights/mgi/research/tech nology_and_innovation/big_data_the_next_frontier_ for_innovation

[18] The Economist, “Data, data everywhere.” Feb. 25, 2010. Available at: http://www.economist