So What’s the Big Deal?
NCOAUG Training Day – February 22, 2013
Presentation Agenda
• Introduction
• What is Big Data?
• So What is the Big Deal?
• Big Data Technologies
• Identifying Big Data Opportunities
• Conducting a Big Data Proof‐of‐Concept
• Big Data Case Study (if we have time)
• Q&A
• Links to More Information
Introduction
• RhinoSource, Inc.
– Oracle App/Tech Consulting and Managed Services
• Oracle E‐Business Suite
• Oracle Business Intelligence
• Oracle Database (performance, partitioning and replication)
• Application Development and Advanced PL/SQL Development
– Advanced Technology Consulting and Managed Services
• Big Data
• Mobile Applications
• Cloud Computing
– CIO‐Level Advisory Services
• IT Strategy, Planning and Project Management
• ERP/CRM Evaluation and Implementation
WHAT IS BIG DATA?
What Makes Up Big Data?
• Blog posts, user comments
• Emails and Messaging
• Web server logs
• Instrumentation of online stores
• Image and video uploads
• Process data, such as RFID
• Sensor device data
• External data sets
– Census data – Weather data
– Geographical data
• “Shadow Data” (replicated copies and change journals)
NCOAUG Training Day – February 22, 2013
The 3 V’s of Big Data
• Velocity
– Data generated at a faster rate than ever before – Server logs, smart phones, sensor devices, RFID
– Challenge: Existing systems cannot process new data fast enough
• Variety
– Data more varied and complex – Structured and unstructured
– Many formats: text, document, image, video
– Challenge: Existing databases do not handle varying data formats well
• Volume
– Orders of magnitude larger
– 2.5 Zetabytes of new data created in 2012
– 8 Zetabytes on new data projected to be created in 2015
• 3 Billion Internet users, 15 Billion connected devices
– Challenge: Existing databases do not cost‐effectively scale to Big Data sizes
Big Data Growth Trend
Ze tta b yt e s
40% CAGR
NCOAUG Training Day – February 22, 2013
How Much is 1 Zettabyte?
New Data by End of 2015
17 ZB of New Data
By end of
2015!
SO, WHAT’S THE BIG DEAL?
“Can Big Data Show Us The Way?”
• Scientific American, Dec ’11:
– "...the rise of 'big data' [is] a trend that is striking many scientists as being on a par with the invention of the telescope and microscope."
– "...many experts believe we
are on the cusp of opening
up new worlds of inquiry."
NCOAUG Training Day – February 22, 2013
Big Data Advantages
• Better, more accurate predictions
• Deeper, richer insights into customers, business partners and the operations
• Real‐time Big Data analytics enables faster decision‐making
• Creates competitive advantage
• Improves bottom line
Big Data Spending
• Companies have spent $4.3 billion on Big Data as of the end of 2012.
• Gartner predicts those initial investments will in turn trigger a domino effect of upgrades
and new initiatives
– Valued at $34 billion for 2013, per Gartner.
– Over a 5 year period, spend is estimated at $232
billion.
BIG DATA TECHNOLOGIES
A Brief History of Big Data
RDBMS
Data Warehouse
RAC
Sc ale
Time
Distributed
Big Data Cluster
NCOAUG Training Day – February 22, 2013
How Do We Store Big Data?
• NoSQL databases store data records as key‐value pairs
– Or as triplets with a timestamp.
• Schema‐less or “schema‐optional”
– Values may be structured or unstructured (developer’s choice).
• Not relational
– No relationships between records – No join support in a NoSQL database.
– Does not use SQL to store and retrieve records.
• Highly optimized for retrieval and appending operations.
– High performance writes.
– High performance retrieval by primary key.
– Little functionality beyond record storage and retrieval.
• Highly Scalable to huge amounts of data
– Millions or Billions of records
– Partition data across many distributed, inexpensive servers for cost‐effective scalability and availability
• Must trade off between Availability versus Consistency (CAP Theorem).
Popular NoSQL Databases
Key‐Value Stores Column‐Oriented Databases
Graph Databases Document Databases
NCOAUG Training Day – February 22, 2013
Why Not Relational for Big Data?
• Transforming and loading data into RDBMS requires extensive pre‐
processing of data into a pre‐defined schema
– Doesn’t work well for semi‐structured and unstructured data
– Can take more time than is available before next batch must be loaded
• Joining multiple data sets at query time is an expensive operation
• RDBMS scaling must be done vertically to larger and more expensive servers and storage solutions
• RDBMS clustering requires expensive networking and shared storage infrastructures
– Fiber Channel, Infiniband, SAN, NAS
• Challenging to distribute data across data centers
– Replication strategies are “add‐ons” and complex
• Strict Consistency requirement is enforced at the cost of write
performance and availability (CAP Theorem)
Dr. Brewer’s CAP Theorem
CP:
BigTable Hadoop/Hbase
MongoDB Oracle NoSQL
Redis p
AP:
Cassandra CouchDB
Dynamo Riak SimpleDB
CA:
“Pick 2”
RDBMS
Oracle RAC
NCOAUG Training Day – February 22, 2013
Scalability Comparison
(Logarithmic Scale)
1 10 100 1000 10000 100000
MongoDB Oracle RAC Cassandra Hadoop
Terabytes Server Nodes
RAC
21 PB,
2000 Nodes at Facebook 300 TB,
400 Nodes at Digital Reasoning 71 TB,
48 Nodes at Amazon 10 TB,
100 Nodes
at CraigsList
Scalability Comparison
(Linear Scale)
0 5000 10000 15000 20000 25000
MongoDB Oracle RAC Cassandra Hadoop
Terabytes Server Nodes
RAC
21 PB,
2000 Nodes at Facebook 300 TB,
400 Nodes at Digital Reasoning 71 TB,
48 Nodes at Amazon 10 TB,
100 Nodes
at CraigsList
Feature Comparison
Cassandra
• Best of the NoSQLs for Cross‐Data Center Replication and High‐Availability
• Known to scale to 100’s of Terabytes (but theoretically to Petabytes)
• Tunable Consistency at operation‐level for writes and reads. Availability model (AP).
• Primary and Secondary Indexes
• Queries are Real‐Time (CQL, Thrift)
• No Join Support
• Masterless Peer‐to‐Peer Ring Architecture = No S.P.O.F.
• Provides most cost‐effective HA and scalability of the NoSQLs
• Written in Java
• Minimum of 3 nodes recommended.
• Easy to install and setup on commodity hardware.
Hadoop/HBase
• The current “Gold Standard” of the NoSQLs for Data Analysis
• Known to scale to Petabytes (1000’s of Terabytes)
• Consistency model (CP)
• Hadoop Queries are Batch (MapReduce).
HBase provides real‐time queries similar to Cassandra.
• Joins are Possible
• Master‐Slave Architecture = S.P.O.F.
(Name/JobTracker Node)
• Written in Java
• Minimum of 5 nodes recommended.
• More challenging installation and setup.
• Warm Standby and Shared Storage Required for High‐Availability Failover, so higher
infrastructure costs.
Best of All Worlds
• DataStax Enterprise
– Cassandra Real‐Time Database
• Peer‐to‐Peer HA Architecture
• Cross‐Data Center Replication
• Real‐Time, Low‐Latency Queries
– Hadoop A Analytics
• Map/Reduce, Hive, Pig (Joins)
– Solr Search
• Full‐Text Search
• Rich Document Handling (Word, PDF)
NCOAUG Training Day – February 22, 2013
Plus Cluster Management
Current Big Data Challenges
• Integrating Big Data with existing databases and BI/reporting systems.
– JDBC, ODBC – sqoop
• Security and Encryption
– DataStax Enterprise 3.0 (In Beta)
• Transparent Data Encryption
• Internal and External Authentication
• Data Auditing
IDENTIFYING BIG DATA
OPPORTUNITIES
Big Data Use Cases
• Context for Interactions and Transactions
– Reward Points – Warranty Policies – Social media chatter
– Survey response feedback – Website requests
• Connection with Outside Patterns
– Weather Data – Demographic Data – Geographical Data
– Government Compliance Data