Taming the Beast of Big Data

65  Download (0)

Full text

(1)

Taming the Beast of Big Data

Jeff Zakrzewski

(2)

Local Touch, Global Reach

What is Big Data?

Some Sources of Big Data

Approaches to Big Data

The Hadoop Buzz

Vertical Perspective

Vendor Perspective

Role of the Future

Q & A

(3)
(4)

Local Touch, Global Reach

(5)

• As much as 80% of the world’s data is now in unstructured formats, which is created and

held on the web. This data is increasingly associated with genuine Cloud-based services,

used externally to the Enterprise IT. The part of Big Data that relates to the expected

explosive growth and creation of new value is the unstructured data mostly arising from

these external sources.

• Data sets are growing at a staggering pace

• Expected to grow by 100% every year for at least the next 5 years.

• Most of this data is unstructured or semi-structured – generated by servers, network devices, social media, and distributed sensors.

• “Big Data” refers to such data because the volume (petabytes and exabytes), the type (semi- and unstructured, distributed), and the speed of growth (exponential) make the traditional data storage and analytics tools insufficient and cost-prohibitive.

• An entirely new set of processing and analytic systems are required for Big Data, with Apache Hadoop being one example of a Big Data processing system that has gained significant popularity and

acceptance.

• According to a recent McKinsey Big Data report, Big Data can provide up to $300 billion annual value to the US Healthcare industry, and can increase US retail operating margins by up to 60%. It’s no surprise that Big Data analytics is quickly becoming a critical priority for large enterprises across all verticals.

What is Big Data?

“Big data is a term applied to data sets whose size is beyond the ability

of commonly used software tools to capture, manage, and process the

data within a tolerable elapsed time.”

(6)

Local Touch, Global Reach

The usual big data characteristics are:

Volume:

there is a lot of data to be analyzed and/or the

analysis is extremely intense; either way, a lot of hardware

is needed.

Variety:

the data is not organized into simple, regular

patterns as in a table; rather text, images and highly varied

structures—or structures unknown in advance—are typical.

Velocity:

the data comes into the data management system

rapidly and often requires quick analysis or decision

making.

(7)

Big Data – Trend Overview

Drivers

Volume, variety, velocity, and complexity of incoming data streams

Growth of “Internet of Things” results in explosion of new data

Commoditization of inexpensive terabyte-scale storage hardware is

making storage less costly ….so why not store it?

Increasingly enterprises are needing to store non-traditional and

unstructured data in a way that is easily queried

Desire to integrate all the data into a single source

(8)

Local Touch, Global Reach

Big Data – Trend Overview

Challenges

Data comes from many different sources (enterprise apps, web,

search, video, mobile, social conversations and sensors)

All of this information has been getting increasingly difficult to store in

traditional relational databases and even data warehouses

Unstructured or semi-structured text is difficult to query. How does one

query a table with a billion rows?

Culture, skills, and business processes

Conceptual Data Modeling

(9)

Big Data – Trend Overview

Implications

Emerging capabilities to process vast quantities of structured and

unstructured data are bringing about changes in technology and

business landscapes

As data sets get bigger and the time allotted to their processing

shrinks, look for ever more innovative technology to help organizations

glean the insights they'll need to face an increasingly data-driven

(10)

Local Touch, Global Reach

Have you processed your Yottabyte today?

With the advent of big data comes

even bigger storage capacity – now

we can deal in Yottabytes!

The National Security Agency (NSA) is already building a gigantic supercomputer to process this gigantic amount of

information in the biggest spy center ever (bigger than 17 football fields). The million square foot Centre will be more than five times the size of the US Capitol and be able to sift through literally all electronic communications all over the world.

The Utah-based facility that can process yottabytes (a quadrillion gigabytes) of data, (according to the Gizmondo technology blog), is designed to “intercept, decipher, analyze, and store vast swaths of the world’s communications as they zap down from satellites and zip through the

underground and undersea cables of international, foreign, and domestic

networks,” It will be the centerpiece for the Global Information Grid and is set to go live in September 2013.

(11)

Name Symbol Binary Measurement

Decimal

Measurement Number of Bytes Equal to kilobyte KB 2^10 10^3 1,024 1,024 bytes megabyte MB 2^20 10^6 1,048,576 1,024KB gigabyte GB 2^30 10^9 1,073,741,824 1,024MB terabyte TB 2^40 10^12 1,099,511,627,776 1,024GB petabyte PB 2^50 10^15 1,125,899,906,842 ,624 1,024TB exabyte EB 2^60 10^18 1,152,921,504,606 ,846,976 1,024PB zettabyte ZB 2^70 10^21 1,180,591,620,717 ,411,303,424 1,024EB yottabyte YB 2^80 10^24 1,208,925,819,614 ,629,174,706,176 1,024ZB

Big Data – The Byte Scale

The file size conversion table below shows the relationship between the file storage sizes that

computers use. Binary calculations are based on units of 1,024, and decimal calculations are based on units of 1,000.

File size measures the size of a computerfile. Typically it is measured in bytes with a prefix. The actual amount of disk space consumed by the file depends on the file system. The maximum file size a file system supports depends on the number of bits reserved to store size information and the total size of the file system. For example, with FAT32, the size of one file cannot be equal or larger than 4

GiB.

(12)

Local Touch, Global Reach

(13)
(14)

Local Touch, Global Reach

14

An Explosion in Data in Recent History!

6 Billon

Mobile Phones World Wide 1.8 Billion RFID tags in 2005

4 Billion RFID tags in 2009 30 Billion RFID tags in 2010

Over 2.3 Billion

Internet users

Twitter processes

12 terabytes

of data every day - 230 million tweets Facebook processes

25 terabytes

of data every day World Data Centre for Climate

220 Terabytes of Web data 9 Petabytes of additional data

Billions of financial transactions daily TBs of data! 24 Petabytes of data processed in a single day

The Human Genome Project Fully mapped in 2003

Petabytes of data

100s of Millions Videos

(15)
(16)

Local Touch, Global Reach

The Challenge:

Bring Together a Large Volume and Variety of Data to Find New Insights

Identify criminals and threats from disparate video, audio, and data feeds

Make risk decisions based on real-time transactional data Predict weather patterns to plan optimal wind turbine usage, and optimize capital expenditure on asset placement

Detect life-threatening conditions at hospitals in time to intervene

Multi-channel customer sentiment and experience analysis

 Analyzing a variety of data at enormous volumes

 Insights on streaming data  Large volume structured data

(17)
(18)

Local Touch, Global Reach

The Big Data Approach:

Information Sources Drive Creative Discovery

Business and IT Identify Information Sources Available

IT Delivers a Platform that enables creative

exploration of all available data and

content

Business determines what questions to ask by exploring the data and

relationships New insights drive

integration to traditional technology

(19)

Business Analytic Applications and Solutions

Warehouse and Appliances

Traditional data sources Operational Data Store

Big Data

Enterprise Data Platform

Manage Big Data from the instant it enters the enterprise

High fidelity – no changes to original format

Available for new uses, analyses, and integrations

Big Data Applications

Big Data Enterprise Engine Big Data Solutions

Developers End Users Admin. Big Data User Environment

Client and Partner Solutions

Big Data Platform

Source data (Web, sensors, logs, media, etc. ) Streaming

analytics

Internet-scale analytics

Govern: Quality, Lifecycle Management, Security, Privacy

(20)

Local Touch, Global Reach

Traditionally, data processing for analytic purposes follows a fairly static blueprint. Namely, through the regular course of business enterprises create modest amounts of structured data with stable data models via enterprise applications like CRM, ERP and financial systems. Data integration tools are used to extract, transform and load the data from enterprise applications and transactional databases to a staging area where data quality and data normalization (hopefully) occur and the data is modeled into neat rows and tables. The modeled, cleansed data is then loaded into an enterprise data warehouse. This routine usually occurs on a scheduled basis – usually daily or weekly, sometimes more frequently

Data Processing and Analytics: The Old Way

(21)

• Transactional Big-data projects cannot use Hadoop, as it is not real-time. For transactional systems that do not need a database with ACID2 guarantees, NoSQL databases can be used, though there are constraints such as weak consistency guarantees (e.g., eventual consistency) or restricting transactions to a single data item. For big-data transactional SQL databases that need the ACID2 guarantees the choices are limited. Traditional scale-up databases are usually too costly for very large-scale deployment, and don't scale out very well. Most social medial databases have had to hand-craft solutions. Recently a new breed of scale-out SQL database have emerged with architectures that move the processing next to the data (in the same way as Hadoop), such as Clustrix. These allow greater scaleoutability.

• This area is extremely fast growing, with many new entrants into the market expected over the next few years.

Big Data Analytics Complements the DW

2 ACID stands for atomicity,

(22)

Local Touch, Global Reach

Merging Traditional and Big Data Approaches

IT

Structures the data to

answer that question

IT

Delivers a platform to

enable creative

discovery

Business

Explores what questions

could be asked

Business Users

Determine what

question to ask

Monthly sales reports Profitability analysis Customer surveys

Brand sentiment Product strategy

Maximum asset utilization Preventative care

Big Data Approach

Iterative & Exploratory Analysis

Traditional Approach

(23)
(24)

Local Touch, Global Reach

24

Enterprise Integration

Trusted Information &

Governance

Companies need to

govern what comes in,

and the insights that

come out

Data management

Insights from Big Data

must be incorporated into

the warehouse

Big Data Platform

Data Warehouse

Enterprise

Integration

(25)
(26)

Local Touch, Global Reach

Big data and Hadoop

What is Hadoop?

The most well known technology used for Big Data is Hadoop. It has been inspired from Google

publications on MapReduce, GoogleFS and BigTable. As Hadoop can be hosted on commodity hardware (usually Intel PC on Linux with one or 2 CPU and a few TB on HDD, without any RAID replication

technology), it allows them to store huge quantities of data (petabytes or even more) at very low costs (compared to SAN systems).

Hadoop is an opensource version of Google’s MapReduce framework. It is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation:

http://hadoop.apache.org/.

The Hadoop “brand” contains many different tools. Two of them are core parts of Hadoop:

Hadoop Distributed File System (HDFS) is a virtual file system that looks like any other file system except than when you move a file on HDFS, this file is split into many small files, each of those files is replicated and stored on (usually, may be customized) 3 servers for fault tolerance constraints.

Hadoop MapReduce is a way to split every request into smaller requests which are sent to many small servers, allowing a truly scalable use of CPU power.

(27)

How does Hadoop help?

What problems can Hadoop solve?

• The Hadoop framework is used by major players including Google, Yahoo , IBM, eBay, LinkedIn and Facebook, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS X. Hadoop was originally the name of a stuffed toy elephant belonging to a child of the framework's creator, Doug Cutting.

• Mike Olson (Cloudera): The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn't fit nicely into tables. It's for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That's exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.

• Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But

Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they're more likely to buy the thing you show them, that sort of problem is well addressed by the

(28)

Local Touch, Global Reach

What does the Hadoop architecture look like?

(29)

The free open source application, Apache Hadoop, is available for enterprise IT departments to download, use and change however they wish. But for many business users, the need for support and technical expertise often largely overshadows the lure of free do-it-yourself applications, especially when there are critical IT systems at stake.

That's where supported, enterprise-ready versions of Hadoop can instead be a better, more realistic option. Here is a sampling of some of the major commercial vendors that can help your company get started with Hadoop. Some offer on-premises software packages; others sell Hadoop in the cloud. There are also some Hadoop database appliances beginning to appear, including the recently announced joint effort by Oracle and

Cloudera.

Amazon Web Services runs Amazon Elastic MapReduce, a hosted Hadoop framework running on Amazon's Elastic Compute Cloud and its Simple Storage Service

The Cloudera Enterprise subscription service

The Datameer Analytics Solution using Hadoop

The DataStax Enterprise Hadoop software

Greenplum, a Division of EMC, offers Greenplum HD Enterprise-Ready Apache Hadoop

The Hortonworks Data Platform

BigInsights, an unstructured-data cloud service from IBM based on Hadoop

Karmasphere Analyst, a toolkit to help produce data using Hadoop

MapR provides an enterprise-ready M5 edition of its Hadoop software

This list features only some of the many vendors offering enterprise Hadoop products and services today. The number of vendors is constantly growing as Hadoop gains steady traction in the data marketplace.

(30)

Local Touch, Global Reach

30

WHY HADOOP?

Hadoop

• Open source platform supporting large-scale parallel processing – 1000’s of

servers

• Massive scale distributed file system – Petabytes of data

Customer Requirements

• Very affordable, scalable storage (petabytes)

• Want to store complete transaction data

• Flexible schema – new datasets with new schema created regularly

• Scalable, flexible analytics – generation of models of fraudulent card usage

• Job fault-tolerance

Hadoop Benefits

• We showed that jobs that took multiple weeks reduced to hours with Hadoop

• “Fundamentally change what they are able to do”

(31)
(32)

Local Touch, Global Reach

(33)

Big Data Market

The Big Data market is on the verge of a rapid growth spurt that will see it top the $50 billion

mark worldwide within the next five years.

As of early 2012, the Big Data market stands at just over $5 billion based on related

software, hardware, and services revenue. Increased interest in and awareness of the power

of Big Data and related analytic capabilities to gain competitive advantage and to improve

operational efficiencies, coupled with developments in the technologies and services that

make Big Data a practical reality, will result in a super-charged CAGR of 58% between now

and 2017.

(34)

Local Touch, Global Reach

Big Data Market Forcast

Big Data is the new definitive source of competitive advantage across all industries. For those

organizations that understand and embrace the new reality of Big Data, the possibilities for new

innovation, improved agility, and increased profitability are nearly endless.

Below is Wikibon’s five-year forecast for the Big Data market as a whole:

(35)

Big Data Pure-Play Vendors Annual Revenue

Source: Wikibon 2012

Below is a worldwide revenue breakdown of the top Big Data pure-play vendors as of

February 2012.

(36)

Local Touch, Global Reach

Big Data Pure-Play Vendors Market Share

Source Wikibon 2012

Below is a breakdown of market share among the pure-play segment of the Big

Data market.

(37)

Components of Big-data Processing

Big-data projects have a number of different layers of abstraction from abstaction of the data through to running analytics against the abstracted data. Figure 1 shows the common components of analytical Big-data and their relationship to each other. The higher level components help make big Big-data projects

easier and more productive. Hadoop is often at the center of Big-data projects, but it is not a prerequisite.

(38)

Local Touch, Global Reach

“The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012”

The Forrester Wave is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave are trademarks of Forrester Research, Inc. The Forrester Wave is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.

(39)
(40)

Local Touch, Global Reach

IBM Big Data Platform

Big Data Accelerators

Eclipse Oozie Hadoop HBase Pig Lucene Jaql

Open Source Foundation Compnents

Big Data Enterprise Engines

Productivity Tools and Optimization

InfoSphere BigInsights InfoSphere Streams

Connectors Applications Blueprints

Text Image/Video Financial Times Series Statistics Mining Geospatial Mathematical Acoustic Workload Management and Optimization

Client and Partner Solutions IBM Big Data

Solutions Consumability and Management Tools Data Growth Management InfoSphere Optim Database DB2 Data Warehouse InfoSphere Warehouse Master Data Management InfoSphere MDM Warehouse Appliance IBM Netezza Marketing IBM Unica Content Analytics ECM Business Analytics Cognos & SPSS Info Spher e Info rmation S er ver

(41)

BigInsights Platform and Roadmap

Unica DB2 . . . Streams Netezza DataStage DBA

Manageability Consumability Integration

Data Explorer Application Flows Dashboards/Reports Administration

BigInsights Enterprise Console

BigInsights Enterprise Engine

Languages

(Jaql, Pig, Hive, HBase) Workflow orchestration Workload Prioritization Map-reduce (Hadoop) File system (GPFS+, HDFS) Performance Analyst Analyst DBA/Analyst/ Programmer SPSS Cognos Analytics

(machine learning, text)

Indexing DBs JMS HTTP Web & Application logs Crawlers Streams Analytics Open source IBM unique value

IBM complementary value IBM differentiating value

(42)

Local Touch, Global Reach

(43)
(44)

Local Touch, Global Reach

(45)
(46)

Local Touch, Global Reach

(47)
(48)

Local Touch, Global Reach

Application Database Partner Data

SWIFT NACHA HIPAA

Cloud Computing Unstructured

Data Warehouse Data Migration Test Data Management & Archiving Master Data Management Data

Synchronization Exchange B2B Data Data

Consolidation

The Informatica Approach

(49)
(50)

Local Touch, Global Reach

(51)

Pentaho and DataStax

Pentaho and DataStax will offer the first Cassandra-based big data analytics solution that combines the highly scalable, low-latency performance of Cassandra with Kettle’s visual interface for high-performance data extract, transformation and load, as well as integrated reporting, visualization and interactive analysis capabilities. This will make it easier for developers and data scientists to operationalize, integrate and analyze both big data and traditional data sources.

(52)

Local Touch, Global Reach

DataStax Enterprise – real-time, analytic, and search capabilities in one integrated big data platform

(53)
(54)

Local Touch, Global Reach

Enhancing Fraud Detection for Banks and Credit Card Companies

Scenario

• Build up-to-date models from

transactional to feed real-time

risk-scoring systems for fraud

detection.

Requirement

• Analyze volumes of data with

response times that are not

possible today.

• Apply analytic models to

individual client, not just client

segment.

Benefits

• Detect transaction fraud in

progress, allow fraud models to

be updated in hours than weeks.

(55)

Social Media Analysis for Products, Services and Brands

Scenario

• Monitor data from various sources

such as blogs, boards, news feeds,

tweets, and social medias for

information pertinent to brand and

products, as well as competitors

Requirement

• Extract and aggregate relevant

topics, relationships, discover

patterns and reveal up-and-coming

topics and trends

Benefits

• Brand Management for marketing

campaigns, Brand protection for ad

placement networks

(56)

Local Touch, Global Reach

Store Clustering Analysis in the Retail industry

Age Range

Education

Income

Children

Assets

Urbanicity

Scenario

• Retailer with large number of

stores needs to understand

cluster patterns of shoppers.

Requirement

• Use shopping patterns for

multiple characteristics like

location, incomes, family size

for better product placement.

Benefits

• Store specific clustering of

products, clustering specific

types of products by locations.

(57)

Healthcare and Energy Industry

IBM Stream Computing for Smarter

Healthcare

InfoSphere Streams based analytics can

alert hospital staff of impending life

threatening infections in premature infants

up to 24 hours earlier than current

practices

Vestas Wind Systems use IBM big data analytics

software and powerful IBM systems to improve wind

turbine placement for optimal energy output.

Energy

(58)

Local Touch, Global Reach

(59)
(60)

Local Touch, Global Reach

(61)
(62)

Local Touch, Global Reach

(63)

Big Data – Some References

• Forrester : The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012

• IBM Software:

Big Data

and

Data Management

• IBM Systems:

Big Data

• IBM - Big Data and Better Business Outcomes – A Strategic Foundation for Analytics

• International Data Corporation (IDC)

• Oracle:

Big Data

McKinsey Global Institute

• Microsoft:

Big Data

• EMC Greenplum:

Big Data

Cloudera.com

Hadoop.com

• Wikibon:

Big Data

(64)

Local Touch, Global Reach

(65)

Figure

Updating...

References