• No results found

Big Data is Not just Hadoop

N/A
N/A
Protected

Academic year: 2021

Share "Big Data is Not just Hadoop"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

http://dx.doi.org/10.14257/AJMSCAHS.2012.06.04

Big Data is Not just Hadoop

Ronnie D. Caytiles

1)

Abstract

When we started discussing this topic with our customers, and we had obtained very interesting feedbacks. First of all, they all think that this is just a buzzword that all the people are talking about and nobody has already seen someone who is doing something reasonable. Then they had a feeling that big data and hadoop are synonyms and moreover that hadoop is something that is dedicated just to the processing of unstructured data – mainly text data [1][2] In this paper, we will show you that this is not the truth –at least from IBM’s point of view.

Another reason that big data is a hot topic in the market today is the new technology that enables an organization to take advantage of the natural resource of big data. Big data itself isn’t new it’s been here for a while and growing exponentially. What’s new is the technology to process and analyze it. The purpose of big data technology is to cost effectively manage and analyze all of the available data. Any data, as is. If you want to analyze structured data, then structure it. If you want to analyze an acoustic file, then analyze the acoustic file with appropriate analytics.

Keywords : BIG DATA is not just HADOOP, Business-Centric

1. Introduction

The big data is a new technology that enables an organization to take advantage of the natural resource of big data. Big data itself isn’t new – it’s been here for a while and growing exponentially. What is new is the technology to process and analyze it. The purpose of big data technology is to cost effectively manage and analyze all of the available data. Any data, as is. If you want to analyze structured data, then structure it. If you want to analyze an acoustic file, then analyze the acoustic file with appropriate analytics.

You’ll see the wide variety of sources of big data. It comes from our traditional systems Billing systems, ERP systems, CRM systems. It also comes from machine data from RFID tags, sensors, network switches. And it comes from human’s website data, social media, etc. [3].

(2)

traditional systems run are not there at all and this is often something suspicious. So, be prepared during the future much deeper discussions with vendors you will hear something what will surprise you and what will need some mental switch.

2. Related Work

I told you big data is not equal to Hadoop. Hadoop is just its subset. There are much more areas and particular items being part of this area. Do you see the yellow elephant? This is a hadoop logo. Do you know it’s history?

Many people think big data is about Hadoop technology. It is and it isn’t. Its about a lot more than Hadoop. One of the key requirements is to understand and navigate federated sources of big data – to discover data in place. New technology has emerged that discovers, indexes, searches, and navigates diverse sources of big data. Of course big data is also about Hadoop. Hadoop is a collection of open source capabilities. Two of the most prominent ones are Hadoop File System for storing a variety of information, and mapreduce – a parallel processing engine. Data warehouses also manage big data- the volume of structured data is growing quickly. The ability to run deep analytic queries on huge volumes of structured data is a big data problem. It requires massive parallel processing data warehouses and purpose-built appliances for deep analytics. Big data isn’t just at rest – it’s also in motion. Streaming data represents an entirely different big data problem – the ability to quickly analyze and act upon data while its still moving. This new technology opens a world of possibilities – from processing volumes of data that were just not practical to store, to detecting insight and responding quickly [3][4]. As much of the worlds big data is unstructured and in textual content, text analytics is a critical component to analyze and derive meaning from text. And finally, integration and governance technology – ETL, data quality, security, MDM, and lifecycle management. Integration and governance technology establishes the veracity of big data, and is critical in determining whether information is trusted.

(3)

[Fig. 1] System architecture diagram

3. Performance Evaluation

3.1 Unlock Big Data

The first pain point is around the unlocking of big data. What does it mean? A lot of business users from our customers have problems to get the data they need and look at them from all different perspectives. For example try to imagine you are a marketing officer of a company producing cars – like e.g. BMW and would like to see how the particular car type sales figures look like. But not only the sales data, but you would like to see how the marketing campaigns are going, what’s the feedback from the customers, what are they complains, how many cars of this and this type are ready to be sold and a lot of other different things. All these little puzzles are placed in different systems – ERP, CRM, Campaign management, e-mails, social networks and it’s very valuable to have an insight to all of them [5][6].

IBM acquired a company called Vivisimo having product Vivisimo Velocity that was renamed to Info Sphere Data Explorer and it’s a solution for this business need. It leaves the data where it is, just enable

(4)

existing sources of big data.

This type of implementation can yield significant business value - from cutting manual efforts to search and retrieve big data, to gaining a better understanding of existing sources of big data before further analysis. The payback period is often short.

Customer example – Proctor and Gamble

The entry point in the big data platform is Vivisimo Velocity – it enables federated search and navigation.

3.2 Analyze Raw Data

Another pain point is around Analyzing of raw data. Everybody is talking about Hadoop. Hadoop distributed file system or something similar to this is a good place for storing big data. So, if you want to store raw data

– it means data as it is in its native format no matter if they come from social networks, web pages, from various sensors or machines, it’s a good idea to store them on a distributed file system for future analysis. But to store this data is just a first step. What you will do with this data next? It’s valuable to analyze it isn’t it? But how? Existing tools for business intelligence or data mining doesn’t work with data in a raw format and doesn’t work mostly with data stored in distributed parallel file systems.

You have basically two options. Either use some tools being able to convert those data to something what’s possible to process by existing tools dedicated for data analysis or use some special new tools focused on analysis those new types of data. It’s hard to say which one of this approach is better. It really depends on the current situation.

Next we have a pain point around analyzing raw data. The primary need is to analyze unstructured, or semi-structured, data from one or multiple sources. Often the content is textual and the meaning is hidden within the text. Another common need is to combine different data types – structured and unstructured – for combined analysis.

Customers often gain significant value in this approach – they unlock insights that were previously unknown. Those insights can be the key to retaining a valuable customer, to identifying a previously undetected fraud, or discovering a game-changing efficiency in operational processes [3][7].

One client, a financial services regulatory organization, analyzed a variety of new data sources and integrated the insights with their existing data warehouse to further enhance their risk modeling processes. The big data platform entry point is InfoSphere BigInsights, a Hadoop-based analytics system.

Info Sphere Big Insights

(5)

There are two releases:

Basic edition – key point is, it is free. It has open source components as well as IBM value add (DB2 integration, integrated installation) Second key point – it’s supported – you can purchase support. This is an excellent choice for companies who want a Hadoop environment up and running or conducting a POC – but it lays the foundation to turn that POC into a pilot or full enterprise deployment. As most companies start with POCs and research/IT is looking heavily at Hadoop – there’s no need to choose between a paid-for enterprise platform vs. free open source this is a perfect starting point. Enterprise edition – key point is it adds significant value on the same base platform – you can grow into it. Includes text analytics, security, integrated web console, scheduling, and more Info Sphere Big Insights provides an integrated solution for analyzing hundreds of terabytes, peta bytes or more of raw data derived from an ever-growing variety of sources. Starting at the bottom of this chart, you’ll see the core of Big Insights – a flexible data processing platform that includes an integrated installer that pre-configures environments for rapid deployment. Also included are monitoring and system management tools, as well as support for integrating with popular enterprise software products, such as DB2. A portion of this core includes IBM’s distribution of the Apache Hadoop project, which features the standard IBM licensing terms and conditions. (Some firms are concerned about conditions imposed by GPL licensing associated with open source software.) For details on Apache Hadoop, see the supplemental charts in this deck. The core technology of Big Insights fully supports massive scale-out on commodity hardware, allowing capacity to be added as needed. Furthermore, its inherent fault tolerance reduces down time and maximizes productivity [8].

Moving up the stack, the enabling infrastructure of Big Insights includes a variety of application services to help firms analyze text and unstructured data, as well as to mine and score information. In addition, IBM Information Management is partnering with other IBM organizations and third parties to deliver solutions and applications based on Big Insights. Candidate areas include sentiment analysis, complex analytics, search and discovery applications, and others. Big Insights complements popular software solutions, including data warehouses and analytic tools. This speeds time-to-value, reduces ownership costs, and enables firms to leverage existing investments to drive new insights [4].

(6)

platform, organizations are able to preserve their queries and take advantage of Hadoop’s cost-effective processing capabilities.

One customer example, a financial services firm, moved processing of applications and reports from an operational data warehouse to Hadoop Hbase; they were able to preserve their existing queries and reduce the operating costs of their data management platform.

The entry point for this pain is InfoSphere BigInsights – IBM’s Hadoop-based product.

References

[1] Rackspace, http://www.rackspace.com/.

[2] M. Armbrust, A. Fox, R. Grifth, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. P. A. Rabkin, I. Stoica, and M. Zaharia, “Above the Clouds: A Berkeley View of Cloud Computing”, EECS, University of California, Berkeley, Tech. Rep., (2009).

[3] S. Pandey, L. Wu, S. Guru, and R. Buyya, “A Particle Swarm Optimization (PSO)-based Heuristic for Scheduling Workflow Applications in Cloud Computing Environment”, Proc. of IEEE AINA, (2010). [4] Human Genome Project, http://www.ornl.gov/hgmis/home.shtml.

[5] Hadoop- Facebook, http://www.facebook.com/note.php?note id=1612 1578919.

[6] Hadoop at Twitter, http://www.slideshare.net/kevinweil/hadoop-attwitter- hadoop-summit-201.

[7] M. Hajjat, X. Sun, Y. E. Sung, D. Maltz, and S. Rao, “Cloudward Bound: Planning for Beneficial Migration of Enterprise Applications to the Cloud”, Proc. of ACM SIGCOMM, (2010) August.

[8] X. Cheng and J. Liu, “Load-Balanced Migration of Social Media to Content Clouds”, Proc. of ACM NOSSDAV, (2011) June.

[9] Y. Wu, C. Wu, B. Li, L. Zhang, Z. Li, and F. Lau, “Scaling Social Media Applications into Geo-Distributed Clouds”, Proc. of IEEE INFOCOM, (2012) March.

References

Related documents

outsourcing CUSTOMER FIXED VOICE MOBILE DATA PARTNER NETWORK THIRD PARTY SUPPLIERS CLOUD SERVICES NETWORKING SOLUTIONS SECURITY UNIFIED COMMUNICATIONS CONTACT CENTRE MANAGED

The newly added article that will provide requirements to address low-voltage Class 2 ac and dc volt equipment connected to ceiling grids, and walls built specifically for this type

Greetings Potential 2015 Ultimate Chicago Sandblast Event Sponsorship Partner, From creating brand awareness and increasing sales, Ultimate Chicago Sandblast is the ideal marketing

• Benefits: fast deployment, high-speed multi-parameter sampling, auto- synchronization (date and time stamping), sampling radius of up to 3 km, real-time data logging and

This publication was commissioned by the NSW Mine Safety Advisory Council as a result of the NSW Mining Industry Health and Safety Action Plan to 2008.. The NSW Mine Safety Advisory

or on a weekend, please report to the Public Relations office which is located across the street from the Mission’s main office, behind the gift shop.. If no one from Public

Recently, Google rolled out Place Search, which reorganizes how local results are displayed on the search engine results page (SERP).. Place Search shows when Google predicts that

We base our so-called G-Net on a vehicle detection followed by a wheel localization phase on the cropped image of the vehicle, both based on a recurrent neural network [7, 11]