Taming the Beast of Big Data
Jeff Zakrzewski
Local Touch, Global Reach
What is Big Data?
Some Sources of Big Data
Approaches to Big Data
The Hadoop Buzz
Vertical Perspective
Vendor Perspective
Role of the Future
Q & A
Local Touch, Global Reach
• As much as 80% of the world’s data is now in unstructured formats, which is created and
held on the web. This data is increasingly associated with genuine Cloud-based services,
used externally to the Enterprise IT. The part of Big Data that relates to the expected
explosive growth and creation of new value is the unstructured data mostly arising from
these external sources.
• Data sets are growing at a staggering pace
• Expected to grow by 100% every year for at least the next 5 years.
• Most of this data is unstructured or semi-structured – generated by servers, network devices, social media, and distributed sensors.
• “Big Data” refers to such data because the volume (petabytes and exabytes), the type (semi- and unstructured, distributed), and the speed of growth (exponential) make the traditional data storage and analytics tools insufficient and cost-prohibitive.
• An entirely new set of processing and analytic systems are required for Big Data, with Apache Hadoop being one example of a Big Data processing system that has gained significant popularity and
acceptance.
• According to a recent McKinsey Big Data report, Big Data can provide up to $300 billion annual value to the US Healthcare industry, and can increase US retail operating margins by up to 60%. It’s no surprise that Big Data analytics is quickly becoming a critical priority for large enterprises across all verticals.
What is Big Data?
“Big data is a term applied to data sets whose size is beyond the ability
of commonly used software tools to capture, manage, and process the
data within a tolerable elapsed time.”
Local Touch, Global Reach
The usual big data characteristics are:
Volume:
there is a lot of data to be analyzed and/or the
analysis is extremely intense; either way, a lot of hardware
is needed.
Variety:
the data is not organized into simple, regular
patterns as in a table; rather text, images and highly varied
structures—or structures unknown in advance—are typical.
Velocity:
the data comes into the data management system
rapidly and often requires quick analysis or decision
making.
Big Data – Trend Overview
Drivers
Volume, variety, velocity, and complexity of incoming data streams
Growth of “Internet of Things” results in explosion of new data
Commoditization of inexpensive terabyte-scale storage hardware is
making storage less costly ….so why not store it?
Increasingly enterprises are needing to store non-traditional and
unstructured data in a way that is easily queried
Desire to integrate all the data into a single source
Local Touch, Global Reach
Big Data – Trend Overview
Challenges
Data comes from many different sources (enterprise apps, web,
search, video, mobile, social conversations and sensors)
All of this information has been getting increasingly difficult to store in
traditional relational databases and even data warehouses
Unstructured or semi-structured text is difficult to query. How does one
query a table with a billion rows?
Culture, skills, and business processes
Conceptual Data Modeling
Big Data – Trend Overview
Implications
Emerging capabilities to process vast quantities of structured and
unstructured data are bringing about changes in technology and
business landscapes
As data sets get bigger and the time allotted to their processing
shrinks, look for ever more innovative technology to help organizations
glean the insights they'll need to face an increasingly data-driven
Local Touch, Global Reach
Have you processed your Yottabyte today?
With the advent of big data comes
even bigger storage capacity – now
we can deal in Yottabytes!
The National Security Agency (NSA) is already building a gigantic supercomputer to process this gigantic amount of
information in the biggest spy center ever (bigger than 17 football fields). The million square foot Centre will be more than five times the size of the US Capitol and be able to sift through literally all electronic communications all over the world.
The Utah-based facility that can process yottabytes (a quadrillion gigabytes) of data, (according to the Gizmondo technology blog), is designed to “intercept, decipher, analyze, and store vast swaths of the world’s communications as they zap down from satellites and zip through the
underground and undersea cables of international, foreign, and domestic
networks,” It will be the centerpiece for the Global Information Grid and is set to go live in September 2013.
Name Symbol Binary Measurement
Decimal
Measurement Number of Bytes Equal to kilobyte KB 2^10 10^3 1,024 1,024 bytes megabyte MB 2^20 10^6 1,048,576 1,024KB gigabyte GB 2^30 10^9 1,073,741,824 1,024MB terabyte TB 2^40 10^12 1,099,511,627,776 1,024GB petabyte PB 2^50 10^15 1,125,899,906,842 ,624 1,024TB exabyte EB 2^60 10^18 1,152,921,504,606 ,846,976 1,024PB zettabyte ZB 2^70 10^21 1,180,591,620,717 ,411,303,424 1,024EB yottabyte YB 2^80 10^24 1,208,925,819,614 ,629,174,706,176 1,024ZB
Big Data – The Byte Scale
The file size conversion table below shows the relationship between the file storage sizes that
computers use. Binary calculations are based on units of 1,024, and decimal calculations are based on units of 1,000.
File size measures the size of a computerfile. Typically it is measured in bytes with a prefix. The actual amount of disk space consumed by the file depends on the file system. The maximum file size a file system supports depends on the number of bits reserved to store size information and the total size of the file system. For example, with FAT32, the size of one file cannot be equal or larger than 4
GiB.
Local Touch, Global Reach
Local Touch, Global Reach
14An Explosion in Data in Recent History!
6 Billon
Mobile Phones World Wide 1.8 Billion RFID tags in 20054 Billion RFID tags in 2009 30 Billion RFID tags in 2010
Over 2.3 Billion
Internet users
Twitter processes
12 terabytes
of data every day - 230 million tweets Facebook processes25 terabytes
of data every day World Data Centre for Climate220 Terabytes of Web data 9 Petabytes of additional data
Billions of financial transactions daily TBs of data! 24 Petabytes of data processed in a single day
The Human Genome Project Fully mapped in 2003
Petabytes of data
100s of Millions Videos
Local Touch, Global Reach
The Challenge:
Bring Together a Large Volume and Variety of Data to Find New Insights
Identify criminals and threats from disparate video, audio, and data feeds
Make risk decisions based on real-time transactional data Predict weather patterns to plan optimal wind turbine usage, and optimize capital expenditure on asset placement
Detect life-threatening conditions at hospitals in time to intervene
Multi-channel customer sentiment and experience analysis
Analyzing a variety of data at enormous volumes
Insights on streaming data Large volume structured data
Local Touch, Global Reach
The Big Data Approach:
Information Sources Drive Creative Discovery
Business and IT Identify Information Sources Available
IT Delivers a Platform that enables creative
exploration of all available data and
content
Business determines what questions to ask by exploring the data and
relationships New insights drive
integration to traditional technology
Business Analytic Applications and Solutions
Warehouse and Appliances
Traditional data sources Operational Data Store
Big Data
Enterprise Data Platform
Manage Big Data from the instant it enters the enterprise
High fidelity – no changes to original format
Available for new uses, analyses, and integrations
Big Data Applications
Big Data Enterprise Engine Big Data Solutions
Developers End Users Admin. Big Data User Environment
Client and Partner Solutions
Big Data Platform
Source data (Web, sensors, logs, media, etc. ) Streaming
analytics
Internet-scale analytics
Govern: Quality, Lifecycle Management, Security, Privacy
Local Touch, Global Reach
Traditionally, data processing for analytic purposes follows a fairly static blueprint. Namely, through the regular course of business enterprises create modest amounts of structured data with stable data models via enterprise applications like CRM, ERP and financial systems. Data integration tools are used to extract, transform and load the data from enterprise applications and transactional databases to a staging area where data quality and data normalization (hopefully) occur and the data is modeled into neat rows and tables. The modeled, cleansed data is then loaded into an enterprise data warehouse. This routine usually occurs on a scheduled basis – usually daily or weekly, sometimes more frequently
Data Processing and Analytics: The Old Way
• Transactional Big-data projects cannot use Hadoop, as it is not real-time. For transactional systems that do not need a database with ACID2 guarantees, NoSQL databases can be used, though there are constraints such as weak consistency guarantees (e.g., eventual consistency) or restricting transactions to a single data item. For big-data transactional SQL databases that need the ACID2 guarantees the choices are limited. Traditional scale-up databases are usually too costly for very large-scale deployment, and don't scale out very well. Most social medial databases have had to hand-craft solutions. Recently a new breed of scale-out SQL database have emerged with architectures that move the processing next to the data (in the same way as Hadoop), such as Clustrix. These allow greater scaleoutability.
• This area is extremely fast growing, with many new entrants into the market expected over the next few years.
Big Data Analytics Complements the DW
2 ACID stands for atomicity,
Local Touch, Global Reach
Merging Traditional and Big Data Approaches
IT
Structures the data to
answer that question
IT
Delivers a platform to
enable creative
discovery
Business
Explores what questions
could be asked
Business Users
Determine what
question to ask
Monthly sales reports Profitability analysis Customer surveys
Brand sentiment Product strategy
Maximum asset utilization Preventative care
Big Data Approach
Iterative & Exploratory Analysis
Traditional Approach
Local Touch, Global Reach
24Enterprise Integration
•
Trusted Information &
Governance
•
Companies need to
govern what comes in,
and the insights that
come out
•
Data management
•
Insights from Big Data
must be incorporated into
the warehouse
Big Data Platform
Data Warehouse
Enterprise
Integration
Local Touch, Global Reach
Big data and Hadoop
What is Hadoop?
The most well known technology used for Big Data is Hadoop. It has been inspired from Google
publications on MapReduce, GoogleFS and BigTable. As Hadoop can be hosted on commodity hardware (usually Intel PC on Linux with one or 2 CPU and a few TB on HDD, without any RAID replication
technology), it allows them to store huge quantities of data (petabytes or even more) at very low costs (compared to SAN systems).
Hadoop is an opensource version of Google’s MapReduce framework. It is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation:
http://hadoop.apache.org/.
The Hadoop “brand” contains many different tools. Two of them are core parts of Hadoop:
Hadoop Distributed File System (HDFS) is a virtual file system that looks like any other file system except than when you move a file on HDFS, this file is split into many small files, each of those files is replicated and stored on (usually, may be customized) 3 servers for fault tolerance constraints.
Hadoop MapReduce is a way to split every request into smaller requests which are sent to many small servers, allowing a truly scalable use of CPU power.
How does Hadoop help?
What problems can Hadoop solve?
• The Hadoop framework is used by major players including Google, Yahoo , IBM, eBay, LinkedIn and Facebook, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS X. Hadoop was originally the name of a stuffed toy elephant belonging to a child of the framework's creator, Doug Cutting.
• Mike Olson (Cloudera): The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn't fit nicely into tables. It's for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That's exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.
• Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But
Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they're more likely to buy the thing you show them, that sort of problem is well addressed by the
Local Touch, Global Reach
What does the Hadoop architecture look like?
The free open source application, Apache Hadoop, is available for enterprise IT departments to download, use and change however they wish. But for many business users, the need for support and technical expertise often largely overshadows the lure of free do-it-yourself applications, especially when there are critical IT systems at stake.
That's where supported, enterprise-ready versions of Hadoop can instead be a better, more realistic option. Here is a sampling of some of the major commercial vendors that can help your company get started with Hadoop. Some offer on-premises software packages; others sell Hadoop in the cloud. There are also some Hadoop database appliances beginning to appear, including the recently announced joint effort by Oracle and
Cloudera.
• Amazon Web Services runs Amazon Elastic MapReduce, a hosted Hadoop framework running on Amazon's Elastic Compute Cloud and its Simple Storage Service
• The Cloudera Enterprise subscription service
• The Datameer Analytics Solution using Hadoop
• The DataStax Enterprise Hadoop software
• Greenplum, a Division of EMC, offers Greenplum HD Enterprise-Ready Apache Hadoop
• The Hortonworks Data Platform
• BigInsights, an unstructured-data cloud service from IBM based on Hadoop
• Karmasphere Analyst, a toolkit to help produce data using Hadoop
• MapR provides an enterprise-ready M5 edition of its Hadoop software
• This list features only some of the many vendors offering enterprise Hadoop products and services today. The number of vendors is constantly growing as Hadoop gains steady traction in the data marketplace.
Local Touch, Global Reach
30WHY HADOOP?
Hadoop
• Open source platform supporting large-scale parallel processing – 1000’s of
servers
• Massive scale distributed file system – Petabytes of data
Customer Requirements
• Very affordable, scalable storage (petabytes)
• Want to store complete transaction data
• Flexible schema – new datasets with new schema created regularly
• Scalable, flexible analytics – generation of models of fraudulent card usage
• Job fault-tolerance
Hadoop Benefits
• We showed that jobs that took multiple weeks reduced to hours with Hadoop
• “Fundamentally change what they are able to do”
Local Touch, Global Reach
Big Data Market
The Big Data market is on the verge of a rapid growth spurt that will see it top the $50 billion
mark worldwide within the next five years.
As of early 2012, the Big Data market stands at just over $5 billion based on related
software, hardware, and services revenue. Increased interest in and awareness of the power
of Big Data and related analytic capabilities to gain competitive advantage and to improve
operational efficiencies, coupled with developments in the technologies and services that
make Big Data a practical reality, will result in a super-charged CAGR of 58% between now
and 2017.
Local Touch, Global Reach
Big Data Market Forcast
Big Data is the new definitive source of competitive advantage across all industries. For those
organizations that understand and embrace the new reality of Big Data, the possibilities for new
innovation, improved agility, and increased profitability are nearly endless.
Below is Wikibon’s five-year forecast for the Big Data market as a whole:
Big Data Pure-Play Vendors Annual Revenue
Source: Wikibon 2012
Below is a worldwide revenue breakdown of the top Big Data pure-play vendors as of
February 2012.
Local Touch, Global Reach
Big Data Pure-Play Vendors Market Share
Source Wikibon 2012
Below is a breakdown of market share among the pure-play segment of the Big
Data market.
Components of Big-data Processing
Big-data projects have a number of different layers of abstraction from abstaction of the data through to running analytics against the abstracted data. Figure 1 shows the common components of analytical Big-data and their relationship to each other. The higher level components help make big Big-data projects
easier and more productive. Hadoop is often at the center of Big-data projects, but it is not a prerequisite.
Local Touch, Global Reach
“The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012”
The Forrester Wave is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave are trademarks of Forrester Research, Inc. The Forrester Wave is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
Local Touch, Global Reach
IBM Big Data Platform
Big Data Accelerators
Eclipse Oozie Hadoop HBase Pig Lucene Jaql
Open Source Foundation Compnents
Big Data Enterprise Engines
Productivity Tools and Optimization
InfoSphere BigInsights InfoSphere Streams
Connectors Applications Blueprints
Text Image/Video Financial Times Series Statistics Mining Geospatial Mathematical Acoustic Workload Management and Optimization
Client and Partner Solutions IBM Big Data
Solutions Consumability and Management Tools Data Growth Management InfoSphere Optim Database DB2 Data Warehouse InfoSphere Warehouse Master Data Management InfoSphere MDM Warehouse Appliance IBM Netezza Marketing IBM Unica Content Analytics ECM Business Analytics Cognos & SPSS Info Spher e Info rmation S er ver
BigInsights Platform and Roadmap
Unica DB2 . . . Streams Netezza DataStage DBAManageability Consumability Integration
Data Explorer Application Flows Dashboards/Reports Administration
BigInsights Enterprise Console
BigInsights Enterprise Engine
Languages
(Jaql, Pig, Hive, HBase) Workflow orchestration Workload Prioritization Map-reduce (Hadoop) File system (GPFS+, HDFS) Performance Analyst Analyst DBA/Analyst/ Programmer SPSS Cognos Analytics
(machine learning, text)
Indexing DBs JMS HTTP Web & Application logs Crawlers Streams Analytics Open source IBM unique value
IBM complementary value IBM differentiating value
Local Touch, Global Reach
Local Touch, Global Reach
Local Touch, Global Reach
Local Touch, Global Reach
Application Database Partner Data
SWIFT NACHA HIPAA …
Cloud Computing Unstructured
Data Warehouse Data Migration Test Data Management & Archiving Master Data Management Data
Synchronization Exchange B2B Data Data
Consolidation
The Informatica Approach
Local Touch, Global Reach
Pentaho and DataStax
Pentaho and DataStax will offer the first Cassandra-based big data analytics solution that combines the highly scalable, low-latency performance of Cassandra with Kettle’s visual interface for high-performance data extract, transformation and load, as well as integrated reporting, visualization and interactive analysis capabilities. This will make it easier for developers and data scientists to operationalize, integrate and analyze both big data and traditional data sources.
Local Touch, Global Reach
DataStax Enterprise – real-time, analytic, and search capabilities in one integrated big data platform