Big Data for the Enterprise
DAMA 12/15/2011
The following is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
Agenda
•
Big Data Overview
•
Oracle Big Data solutions
•
Oracle NoSQL Database
•
Architecture
Technical Overview
•
Technical Overview
<Insert Picture Here>
Big Data
What is Big Data?
Big data is a term applied to data sets that are large,
complex and dynamic (or a combination thereof) and
complex and dynamic (or a combination thereof) and
for which there is a requirement to capture, manage
and process the data set in its entirety (not
transformed), such that it is not possible to process the
data using traditional software tools and analytic
What is Big Data ?
GEODATA
TEXT
VOLUME VELOCITY VARIETY VALUE
What is Driving Big Data ?
• The cost per TB of data is now under $100. Over the last 30 years, the amount of space per unit cost has doubled roughly every 14 months (increasing by an order of magnitude every 48 months).
• The cost of CPU cores and memory has dropped massively. Core
counts and computing power has increased several fold for commodity hardware but the cost per server has decreased.
• The software licensing costs for manipulating terabytes, petabytes or
• The software licensing costs for manipulating terabytes, petabytes or exabytes of data has gone from millions of dollars to $0 for open
source Hadoop and noSQL systems compared to Teradata, for
example. These new open source systems are capable of handling the demands of extremely large datasets and massive distribution.
• Development and operational costs have remained about the same but are insignificant compared to cost of licensing in traditional
What is Driving Big Data ? (Cont’d)
• Big Data projects are being initiated with the business, marketing and
BI segments of companies because there is a demand for deeper and more comprehensive analytics.
• The impetus to use Hadoop and noSQL is a response to take
advantage of years of “throw away” raw data to mine to gain a competitive edge. These new systems are providing a way for the competitive edge. These new systems are providing a way for the business people to do this.
• Hadoop and noSQL get around the restrictions imposed by operations
on traditional RDBMS systems as far as performance, space and license costs are concerned.
• Even though these systems are still in their infancy, they are providing a means to mine almost any kind of data. Text, binary data, XML,
Why it Big Data important ?
US HEALTH CARE US RETAIL MANUFACTURING GLOBAL PERSONAL LOCATION DATA
EUROPE PUBLIC SECTOR ADMIN
$300 B
60+%
–50%
$100 B
€250 B
Increase industry value per year by
Increase net margin by
Decrease dev., assembly costs by
Increase service provider revenue by
Increase industry value per year by
“In a big data world, a competitor that fails to sufficiently
develop its capabilities will be left behind.”
Big Data Is About*
Tapping into new data sets
Tapping into new data sets
Finding and monetizing
Finding and monetizing
Finding and monetizing
previously unknown relationships
Finding and monetizing
previously unknown relationships
• Deep Analytics
• Low, predictable Latency
• High Transaction Count
Big Data in Action
Acquire
Organize
Analyze
• Agile Development
• Massive Scalability
• Real Time Results
• High Throughput
• In-Place Preparation
• All Data Sources/Structures
• Flexible Data Structures
Provide Value Added Services to your Customers
Early Adopters
A Divided Solution Spectrum
MapReduce Solutions Distributed File Systems Transaction (Key-Value) Stores
NoSQL
Flexible Specialized Developer Centric Schema-less Unstructured Data VarietyAcquire Organize Analyze
Early Adopter Dilemma
• Time to Build?
• Expertise?
• Big data is a hot topic for 2011 and beyond driven by
the user community not the vendor community.
• Processing now goes to the data instead of data
being transported to the process (traditional
RDBMS).
Paradigm Shift !
RDBMS).
• The competition for the big data market place is not
between vendors but between the inspired do it
yourself team or individual and the hardware and
software vendors.
• Until recently, big data projects were being
Agenda
•
Big Data Overview
•
Oracle Big Data solutions
•
Oracle NoSQL Database
•
Architecture
Technical Overview
•
Technical Overview
Hadoop to Oracle
Bridging the Gap
Hadoop MapReduce HDFS
Cassandra
Oracle Loader for
Schema-less Unstructured
Data Variety
Acquire Organize Analyze
RDBMS (OLTP)
RDBMS
(DW) Advanced Analytics
ETL
Oracle Loader for Hadoop
Oracle Integrated Software Solution
Schema-less Unstructured Data Variety Hadoop HDFS Oracle NoSQL DB Oracle Analytics: Data Mining R Oracle Oracle Loader for HadoopAcquire Organize Analyze
Oracle Engineered Solutions
Schema-less Unstructured Data Variety In-DB Analytics “R” Mining Text Graph Oracle NoSQL DB HDFS Hadoop Oracle Data Integrator Oracle Loader for HadoopBig Data Appliance
•Hadoop
•NoSQL Database
•Oracle Loader for hadoop
•Oracle Data Integrator Exalytics •Speed of Thought
Acquire Organize Analyze
Schema Oracle Database (DW) Oracle Database (OLTP) Graph Spatial Oracle BI EE Data Integrator Oracle Exadata •OLTP & DW
•Data Mining & Oracle R •Semantics
•Spatial
Oracle’s Big Data Solution
Engineered Systems
Oracle
Big Data Appliance
Oracle Exadata
Oracle Exalytics
InfiniBand
Acquire Organize Analyze & Visualize Stream
•18 Sun X4270 M2 Servers
–48 GB memory per node = 864 GB memory
–12 Intel cores per node = 216 cores
–24 TB storage per node = 432 TB storage
•40 Gb p/sec InfiniBand
Oracle Big Data Appliance Hardware
•40 Gb p/sec InfiniBand
•Oracle Linux 5.6
•Java Hotspot VM
•Apache Hadoop Distribution v0.20.x
•R Distribution
•Oracle NoSQL Database Enterprise
Oracle Big Data Appliance Software
•Oracle NoSQL Database Enterprise
Edition
•Oracle Data Integrator Application Adapter for Hadoop
•
Big Data Overview
•
Oracle Big Data solutions
•
Oracle NoSQL Database
•
Architecture
Technical Overview
Agenda
• New, rapidly emerging database technology
• Simple data storage, typically non-SQL or Not-only-SQL
• Distributed (Cloud) storage
• Large amounts of data (Terabyte – Petabyte range)
• Solution categories
The Challenge – NoSQL
• Storage for Web Services Our focus is here
• ETL Processing (MR & Hadoop) M and we integrate here
• Common data models
• Key-Value Our focus is here
• Document
• Columnar
Target Use Cases
• Large semi- or un-structured data repositories
• Simple data structure, simple relationships
• High volume random reads and/or data capture
• Data capture
• Sensor data capture (i.e. IA, SmartGrid, Earth Sc., BioMedical Sc.)
• Statistics & network capture (QOS Network Mgmt)
• Web applications (click-through capture)
• Backup services for mobile devices
• Data services
• NoSQL data sharing (Earth Sci, BioMedical)
• Scalable authentication
• Real-time communication (MMS, SMS, routing)
Technical Requirements
• Based on talking to many prospects
• Requirements
• Terabytes to petabytes of unstructured or semi-structured data
• No single point of failure
• Cost effective, distributed storage
• Elastic scalability on commodity hardware
• Elastic scalability on commodity hardware
• Fast, bounded response to simple queries
• Fast, reliable transactions
• Simple administration, enterprise support
• Commercial-grade NoSQL solution • Real 24x7 support
• Real database expertise
• Simple Data Model
• Key-value pair with major+minor-key paradigm
• Read/insert/update/delete
• Scalability
• Dynamic data partitioning and distribution
• Optimized data access via intelligent driver
High availability
Oracle NoSQL DB Overview
A Distributed, Scalable Key-Value Database
NoSQL DB Driver
Application
NoSQL DB Driver
Application
• High availability
• One or more replicas
• Resilient to partition master failures
• No single point of failure
• Disaster recovery through location of replicas
• Transparent load balancing • Reads from master or replicas
• Driver is network topology & latency aware
• Elastic (Planned for Release 2)
• Online addition/removal of storage nodes and automatic data redistribution
Storage Nodes
Data Center A
Storage Nodes
Oracle NoSQL Database
Enterprise Topology
• Replicated Application Servers
• Driver is linked into each Application
• Data Nodes are kept current
(underlying BDB JE HA technology)
• Storage Nodes can be created
across Data Centers across Data Centers
• Automatic Storage Node failure
handling
• Graceful degradation until • Node is fixed, or M • M node is replaced • Automatic recovery
• Operation result • New Partition Map
Oracle NoSQL Database Driver
Resolving a Request
Hash Major Key to determine Partition id
Use Partition Map to map Partition id to a Rep Group
Operation + Key[M,m] + Value + Transaction Policy
NoSQL DB Driver
Application
RepNodeStorageTable
• New Partition Map
• RepNodeStorageTable
information
id to a Rep Group
Use State Table to determine eligible Storage Node(s) within Rep Group
Use Load Balancer to select best eligible Rep Node
• Simple data model – key-value pair (major+minor-key paradigm)
• Simple operations – read/insert/update/delete, RMW support
• Scope of transaction – records within a major key, single API call
• Unordered scan of all data (non-transactional)
Oracle NoSQL Database Key Features
Simple Data Model
Oracle NoSQL Database Key Features
ACID Transactions – Write Durability
•
Specified on per-operation basis, application can set
defaults
•
Durability consists of ...
a) Sync policy (on Master and Replica)
• Sync – force to disk
• Write No Sync – force to OS buffer
• Write No Sync – force to OS buffer
• No Sync – write to local log buffer, flush when convenient
b) Replica Ack Policy
• All
• Simple Majority
Oracle NoSQL Database Key Features
ACID Transactions – Read Consistency
•
Specified on per-operation basis, default can be
changed
•
Consistency
–
Absolute (read from Master)
–
Time-based
–
Time-based
–
Version
High Availability/Replication
Failure and Recovery
• Storage Node Failure
– System continues to function using remaining nodes in the replication group
– Failed nodes can be removed/replaced via the Administrative API
– Rejoining nodes automatically synchronize with the master
– Isolated nodes can still service reads as long as consistency guarantees are met
• Master Failover
• Master Failover
– Nodes detect failure via missing heartbeat or connection failure
– Automatic election of new master, via a distributed two phase election algorithm (PAXOS)
– Master election based on highest LSN (log sequence number)
• Replication group failure
– System continues to function using remaining replication groups
– Read/Write requests to unavailable subset of keys time out
• Web-based console and
CLI commands
• Manages and Monitors
• Topology
• Configuration changes
Oracle NoSQL Database
Simple Administration
• Load: Number of operations,
data size
• Performance: Latency, throughput. Min, max,
average, trailing, M
• Events: Failover, recovery, load distribution
Oracle NoSQL Database Differentiation
Commercial Grade Software and Support
Simple Programming
and Operation Model Easy Management
•General Purpose • Simple Major + •Web-based Console
Scalable throughput and bounded Latency
•Intelligent Oracle
Integrates seamlessly with Oracle Stack (ODI, CEP, OLH)
•General Purpose
•Reliable –Based on proven Berkeley DB JE HA
•Easy to Install & Configure
• Simple Major + Minor key and Value data structure •ACID transactions • Configurable consistency and durability •Web-based Console andCLI commands
•Manages and Monitors:
• Topology • Load • Performance • Events • Alerts •Intelligent Oracle NoSQL DB Driver
• Evenly distributes Data • Sends operation to
fastest node
• Bounded network hops
Easy to use, easy to manage
Scalable, Available, Bounded Latency
A NoSQL Database from a vendor you trust
Oracle NoSQL Database
• Two versions
• Oracle NoSQL Database Community Edition. Open Source. GPLv2 or AGPL license.
• Oracle NoSQL Database Enterprise Edition. Closed Source. Standard Oracle License.
Community VS Enterprise Edition
• Community Edition has all of the basic functionality and APIs.
Gets you started. Competes with other OS NoSQL solutions.
• Enterprise Edition for large, production, multi-data center,
Oracle NoSQL Database
Easy to use, easy to manage
Scalable, Available, Bounded Latency
A NoSQL Database from a vendor you trust
Key-Value Store Workloads
Data Capture
• Write data as fast as you can
• Minimal indexing
• No referential integrity
• Relaxed durability guarantees (lower value data)
• Scale write throughput via data distribution
• Optimize write throughput per storage node (master,
append-only log file)
• Asynchronous replication
• Bulk operation support is useful for some applications
• Workload can be steady and/or bursty
Key-Value Store Workloads
Data Services
• Simple data reads, minimize I/O
• Primary key lookup(s)
• Relaxed consistency requirements (recent is good enough)
• Scale read throughput via load balancing
• Optimize read throughput per storage node – single I/O
• Node failure/partition tolerance (shift in CAP focus)
• Minimize number of operations to get data
• Workload tends to be highly random
• Data caching doesn’t help much
• Data clustering pays off
Oracle NoSQL Database
Storage Terminology
• Key space hashes into multiple hash
buckets (partitions)
• A set of partitions maps to a replication group (aka “shard” – a logical container for subset of data)
“full” key space
.
...
• A set of replication nodes for each replication group provides HA and read scalability for each replication group
• A Storage node (physical or virtual machine) runs each replication node
m m
.
...
Oracle NoSQL Database
Topology Example
• 100M keys, 9 Storage Nodes
might be configured as – 1000 Partitions – 3 Rep Groups – Replication Factor 3 – 9 Rep/Storage Nodes P9 P1 P8 P3 P2 P14 P13 ... P111 P333 ... ... P9 P8 P3 P2 P1 – 9 Rep/Storage Nodes SN2 SN9 SN7 SN3 SN5 SN6 SN1 SN4 SN8
•
Simple installation, configuration, setup
• Intelligent NoSQL DB Driver “knows” about the topology
•
Reliability - no single point of failure
• NoSQL DB Driver installed on all clients
• Replication groups for HA, DR
Oracle NoSQL Database Benefits
Commercial Grade Software and Support
• Replication groups for HA, DR
• Storage node based on proven Berkeley DB JE HA
• Automated node failover and recovery
•
General-purpose
Backup and Restore
• Backup
– Backups transactionally consistent per storage node, not system-wide
– Uses fast snapshots, creating hard link to database log files
– User can optionally copy files to remote storage
• Restore
– Supports system-wide recovery from remote storage
– Supports system-wide recovery from remote storage
– Requires identical storage node topology
– Copy snapshot backup files to at least one replica in each replication
group and bring up nodes. Remaining nodes will synch up.
– Individual storage nodes recover via HA/replication
• Load from Backup
– Supports loading system into a new NoSQL Database with different topology
•
Intelligent Oracle NoSQL DB Driver
• Data evenly distributed across nodes
• Operations sent to fastest appropriate node
• Bounded network hops for all operations
•
Optimized node configuration
Oracle NoSQL Database Benefits
Bounded Latency At Scale
•
Optimized node configuration
• Highly-tuned memory management
•
Read/write scale-out with bounded latency
•
Simple Major + Minor key and Value data structure
•
Simple, consistent transaction model
• Conflict resolution not required
• ACID transactions
•
Configurable consistency & durability, where needed
Oracle NoSQL Database Benefits
Simple Programming and Operational Model
•
Configurable consistency & durability, where needed
• Read consistency
•
Web-based console, API accessible
•
Manages and Monitors
• Topology
• Load
• Number of operations, data size
Oracle NoSQL Database Benefits
Easy Management
• Number of operations, data size
• Performance
• Latency, throughput. Min, max, average, trailing, M
• Events
• Failover, recovery, load distribution
• Alerts