BIG DATA
TRENDS AND TECHNOLOGIES
THE WORLD OF DATA IS CHANGING
Cloud
WHAT IS BIG DATA?
Big data are datasets that grow so large that they become
awkward to work with using on- hand database management tools.
Difficulties include capture, storage, search, sharing,
analytics, and visualizing.
We are entering the era in which sensors are collecting data in our physical world and delivering it to networks that aggregate and analyze the information. Big data defines us and will increasingly
dictate how we live in a fully interconnected world.
WHAT IS BIG DATA?
Forrester’s Brian Hopkins describes big data as
“techniques and technologies
that make handling data at
extreme scale economical.”
• BEFORE 1990
PHOTOGRAPHS AND VIDEO TAKEN ENTIRE LIFE BY ONE PROFESSIONAL OCCUPIES AROUND 10 GIGABYTES
DIGITAL DATA AGE
• 2010
PHOTOGRAPH AND VIDEO TAKEN ONE YEAR BY ORDINARY PEOPLE TAKES UP ABOUT 5 GB
• PHOTOGRAPHS
WHERE DATA COME FROM
• VIDEO
• MACHINE LOGS
• RFID READER
• VEHICLE GPS TRACE
• RETAIL TRANSACTION
• FINANCIAL TRANSACTION
DIGITAL DATA FACTS AND FIGURES
• 1990
1,000 MB WITH 4.4 MB/S TRANSFERATE
It takes approximately 714 s or 12 minutes to read whole disk.
STORAGE CAPACITY AND TRANSFER RATE
• 2011
1,000 GB WITH 135 MB/S TRANSFERATE
It takes approximately 7,600 s or 2
Hours to read whole disk.
• READ 2 TERABYTE FROM 1 DISK
TAKE 4 HOURS
MORE OXES BETTER THAN ONE BIGGER OX ?
• WHAT ABOUT READ 1 TB FROM 1,000 DISKS IN PARALLEL
TAKE 15 SECONDS
• Hadoop Distributed File System - reliable data storage
DATA IS DISTRIBUTED AND REPLICATED OVER MULTIPLE MACHINES DESIGNED FOR LARGE FILES (TB, PB, OR LARGER)
HADOOP
• MapReduce -high-performance parallel data processing
Inspired by GOOGLE BIG TABLE and MAP REDUCE papers Circa 2004 created by doug cutting
HADOOP DISTRIBUTED FILE SYSTEM
MAP/REDUCE ADVANTAGES
• SCALABLE
Automatically Parallelizes Map & Reduce Operations Supporting 1,000’s of Processors and Petabytes of Data
• FAULT TOLERANCE
Replicated Data in HDFS
Failed Jobs Automatically Restarted without Loss of the Rest of Jobs
• ELASTIC AND FLEXIBLE
Degree of Parallelism can be Determined at Runtime Flexible Data Model and Programing
• AFFORDABLE AND EASY TO USE
Open Source and Designed to Work on Commodity Hardware Two Routines : Map & Reduce
HADOOP ARCHITECTURE
HADOOP ADVANTAGES
• DISTRIBUTED
DATA WAS REPLICATED AND PROCESSED ACROSS THE CLUSTER
• FAULT TOLERANT
WHEN NODES FAIL
• SELF HEALING
REBALANCES FILES ACROSS CLUSTER
• SCALABLE
JUST BY ADDING NEW NODES
HADOOP FACTS
• OPEN SOURCE
• BATCH / OFF-LINE ORIENTED
• DATA AND I/O INTENSIVE (READ)
• HADOOP IS NOT A RELATIONAL DATABASE
• HADOOP IS NOT AN OLTP SYSTEM AND NOT
A STRUCTURED DATA STORE OF ANY KIND
HADOOP STACK
HIVE
DATA WAREHOUSE PLATFORM ON HADOOP
HBASE
TABLE STORAGE ON HADOOP
CASANDRA
DATA STORE
ZOO KEEPER
ZooKeeper is a centralized service for maintaining configuration information ,naming, providing distributed synchronization, and providing group services
FLUME, PIG, etc.
WHO TO FOLLOW
WHO’S USING HADOOP?
• YAHOO
SEARCH ASSIST
PEOPLE YOU MAY KNOW
• YOUTUBE
VIDEO SUGGESTIONS
FRIENDS YOU MAY KNOW AND ALMOST EVERYTHING
• AMAZON, EBAY, GOOGLE
LEVERAGES TRADITIONAL AND NEW CAPABILITIES
TRADITIONAL
Relational Database Management System NEW
Petabyte-Scale Services
Share your data with the world via Azure Marketplace
Enrich with social media data via Social Analytics
Advanced analytics with Hadoop
Connecting
with the World’s Data
Microsoft’s approach to Big Data
Analyze Big Data with familiar tools Immersive insights from any data JavaScript based simple
programming
Immersive Insight, Wherever you are
Simplicity and manageability of Windows to Hadoop
Extended data warehousing with Hadoop
Scale & elasticity of cloud
Any Data, Any Size Anywhere
MICROSOFT BIG DATA ANALYTIC
Hadoop connectors for SQL named SQOOP that enable to move data seamlessly between Hadoop and SQL Server or SQL Server Parallel Data Warehouse.
new Hive ODBC Driver and an Excel Hive Add-in that enable customers to move data from Hive directly into Excel, or Microsoft BI tools such as
PowerPivot, for analysis.
Extending your Enterprise Data Warehouse with hadoop
Integration with enterprise BI solutions
Microsoft SQL Server connector for Apache Hadoop with SQOOP (SQL to Hadoop)
Integration with Microsoft Enterprise Data Warehouses
SQL Server Parallel Data Warehouse connector for Apache Hadoop with SQOOP
Deeper insights from structured and
unstructured data
Benefits Key Features
Delivery insights to everyone by enabling big data analysis with familiar end user tools
Hive add-in for Excel Interaction and analysis of unstructured data in Hadoop from Microsoft Excel
Benefits Key Features
Unlocking new insights from all data with Microsoft BI tools
Hive ODBC Driver integrates Hadoop to SQL Server Analysis Services,
PowerPivot, and Power View Familiar BI tools with structured and unstructured data
Benefits Key Features
Simplifying programming on hadoop with JavaScript
New JavaScript libraries for Hadoop
JS
Deploy JavaScript Hadoop jobs from a simple web browser MapReduce
programs in JavaScript
Simplified
programming Simplified deployment of MapReduce jobs
Benefits Key Features
Providing Choice of Deployment options
Elastic peta-scale
analytics on Microsoft’s cloud platform
Hadoop-based Service on Windows Azure platform
Enterprise-class Big Data platform on-premises
Hadoop-based distribution on Windows Server
Benefits Key Features
Connects Hadoop to the world via Windows Azure Marketplace
Mashing up of internal and public data sets via Data Explorer
Integration with third- party data and services Sharing of data and
insights through Windows Azure Marketplace
Integration with Windows Azure Marketplace
Benefits Key Features
Simplicity and manageability of windows to hadoop
Enterprise-class security
Integration with Microsoft System Center
Integration with Windows Server® Active Directory Simplified
management of
Hadoop on Windows
Smart packaging of Hadoop on premises Fast deployment of Hadoop on Azure
Easy setup on-premises and in the cloud
Benefits Key Features
A holistic BIG DATA Solution from Microsoft spanning relational and non-relational Worlds
NON-RELATIONAL
1 0 0 1 1
1 DATA MANAGEMENT
SHARE AND GOVERN DISCOVER
AND RECOMMEND TRANSFORM
AND CLEAN
INSIGHTS
DATA ENRICHMENT
OPERATIONAL
SELF-SERVICE MOBILE
PREDICTIVE
REAL-TIME COLLABORATIV
E and Services External Data MARKETPLACE
RELATIONAL MULTIDIMENSIONAL STREAMING
CY
DATA MANAGEMENT
DATA ENRICHMENT
INSIGHTS
Hadoop on Windows & Azure: Roadmap
29
H2 2011
Hadoop on Azure Preview 2
• More capacity
• Disaster Recovery for HDFS
• Support for Mahout Hadoop on Azure Private CTP
Hadoop on Server Private TAP
• Hadoop Core & Common
• JavaScript Framework
Hadoop on Azure GA
• Portal Integration & Billing
• Azure SDK integration
Hadoop on Server GA
• JavaScript, PIG, Hive, Hbase
• Active Directory Integration
• Systems Center Integration
2012 Hive ODBC Driver Preview 2
Azure Labs
• Data Explorer
• Social Analytics
• Data Hub (Private Data Market) Hadoop Connectors
Azure Data Market
Excel Integration Preview 2
• Hive Add-in for Excel
• PowerPivot Add-in for Excel
• Power View for SharePoint