No-SQL Databases
for High Volume Data
Edward Wijnen
3 November 2014
The New Connected World Needs
a Revolutionary New DBMS
“The Internet of Things”
Client-Server
Semi-Connected
Isolated
SocialRadically Connected
©2014 DataStax Confidential. Do not distribute without consent.
Mobile Cloud
Mainfram
e
1970’s
1990’s
Today
Businesses Must Close the Gap…and Fast
Connected
Customers
Connected
Partners
Connected
Employees
Devices
ConnectedConnected
Products
Connected
Partners
Connected
Employees
Devices
ConnectedConnected
Products
You Would End With This
Distributed
Transactional
Database
Connected
Apache Cassandra™
• Apache Cassandra™ is a massively scalable, open source, NoSQL,
distributed database built for modern, mission-critical online applications
• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable
• Masterless with
no single point of failure
• Distributed and data centre aware
• 100% uptime
• Predictable scaling
Dynamo
BigTable
Cassandra
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Apache Cassandra™
Distributed Transactional Database Advantages
The Hague
Distributed Transactional Database Advantages
The Hague
Distributed Transactional Database Advantages
The Hague
Distributed Transactional Database Advantages
The Hague
Distributed Transactional Database Advantages
The Hague
London
C*
C*
Distributed Transactional Database Advantages
The Hague London
C*
C*
Delivers 150+ Billion Content Recommendations Per Month
Serves content for largest media brands in the world: Reuters, Wall St Journal, USA Today
Needed a massively scalable data store
High velocity of data with 58,000 links to content per second
Always-on data architecture
Distributed Transactional Database Advantages
The Hague
London
C*
C*
Distributed Transactional Database Advantages
The Hague
Groningen
London
C*
C*
C*
Distributed Transactional Database Advantages
The Hague
Groningen
London
C*
C*
C*
Distributed Transactional Database Advantages
The Hague
Groningen
London
C*
C*
C*
Distributed Transactional Database Advantages
The Hague
Groningen
London
C*
C*
C*
Distributed Transactional Database Advantages
The Hague
Groningen
London
C*
C*
C*
Distributed Transactional Database Advantages
The Hague
Groningen
London
C*
C*
C*
Netflix Delights Customers with Personal Recommendations
World’s leading streaming media provider with digital revenue $1.5BN+
Tailors content delivery based on viewing preference data captured in Cassandra Increased market cap by 600% since 2012
Introduction of ‘Profiles’ drove throughput to over 10M transactions per second Replaced Oracle in six data centers, worldwide, 100% in the cloud
Cassandra – Always On, No Matter What
The Hague
Groningen
London
C*
C*
C*
C*
l
Transactional Backbone
l
Industry Leading
Performance
l
Predictable Scalability
l
Operational Simplicity
l
Business Flexibility
Cassandra
– Tunable Consistency
• Consistency Level (CL)
• Client specifies per read or write
• Handles multi-data center operations
• ALL = All replicas ack
• QUORUM = > 51% of replicas ack
• LOCAL_QUORUM = > 51% in local DC ack • ONE = Only one replica acks
• Plus more…. (see docs)
• Blog: Eventual Consistency != Hopeful Consistency
http://planetcassandra.org/blog/post/a-netflix-experiment-eventual-consistency-hopeful-consistency-by-christos-kalantzis/
Node 1 1st copy
Node 4
Node 5 Node 2
2nd copy
Node 3 3rd copy Parallel
Write
Write
CL=QUORUM
5 μs ack
12 μs ack
500 μs ack 12 μs ack
Messaging Product Catalogs and
Playlists
Recommendation/ Personalization
Fraud detection
Internet of things/ Sensor data
Challenges
Customers
A product catalog is an organized collection of products or services. Playlists refer to user-defined queues of songs, movies, games and lessons.
Examples: Shopping carts, gift registries, media playlists.
Why DataStax?
• Rigidity of relational databases • Increase in volume and diversity of
data
• Application must have zero downtime
• Predictable scalability is hard • Desire to operate in the cloud
• Real-time database infrastructure • Rich analytics for flexible access to
information
• Fast search and indexing of data • Add new features while the
application is online
• Multiple data centers to ensure applications and data have 100% uptime
Challenges
Customers
Recommendation and Personalization Engines understand each person’s unique habits and
preferences and bring to light products and items that a user may be unaware of and not looking for. Examples: News sites, shopping carts.
Why DataStax?
• Large volumes of user data makes accuracy challenging
• Merging real-time and historical information
• Cross-product information
• Response times need to be fast • Predictable scalability is hard
• Rich query language and enterprise search to store, search and analyze user activity data
• Integrations with data lakes allow for the merging of real time and
historical data
• Multi-data center replication ensures applications and data suffer no
downtime
• Linear scalability is predictable
Challenges
Customers
Fraud detection solutions identify out-of-the-ordinary patterns to prevent malicious attacks on digital and physical assets from unauthorized applications and individuals. Examples: credit card monitoring, application infiltration
Why DataStax?
• Increasing volume of fraudulent attacks across all industries • Technology sophistication • Limited historical and trend
information
• Information is stored across multiple channels
• The customer can be the first to spot the fraud
• Easy management of high-data volumes
• Real-time monitoring across channels, sites and data centers • Integrations with data lakes allow for
the merging of real time and historical data
• Ease of use in managing and monitoring data
• Multi-data center replication ensures applications and data suffer no
downtime
Challenges
Customers
IOT refers to the revolution of a growing number of internet-connected devices that can network and communicate with each other.
Why DataStax?
• Vast and diverse amounts of unstructured data from internet enabled devices
• Volume of sensors is increasing exponentially
• Fast-changing technology • Support multiple channels with
varying data types
• Predictable scalability is hard
• Easy management of high-data volumes
• Rich query language and enterprise search to store, search and analyze data
• Dynamic database schema
• Linear scalability offers predictability
Challenges
Customers
Messaging facilitates communication, interaction and collaboration between diverse user-groups and applications via social networks, cloud services and more.
Examples: SMS, email and instant messaging.
Why DataStax?
• Managing large data volumes at a reasonable cost
• Real-time updates and information, getting detailed alerts and
notifications
• Predictable scalability is hard
• Information is stored across multiple platforms and systems
• Agility
• Easy management of high-data volumes
• Real-time monitoring across channels, sites and data centers • Multi-data center replication ensures
applications and data suffer no downtime
• Ease of use in managing and monitoring data
• Dynamic database schema
The Weather Channel on
Learning to use Cassandra
“If you had a look in the past, you may have found
Cassandra had a high learning curve and a fair amount
of complexity. CQL3, the native drivers, and virtual
nodes have
changed the game entirely
, making
Cassandra a much more accessible and friendly
platform.
While I have years of experience using Cassandra, my
team was mostly new to it;
CQL made their transition
essentially painless
. But where Cassandra really
shines is in
speed and operational simplicity
, and I
would say those two points were critical.”
•
Application drivers and connectors for all popular developer
languages exist for Cassandra and DataStax Enterprise.
•
CQL (Cassandra Query Language) is the primary API
•
Drivers/connectors include:
•
Java
•
C++
•
Python
•
Ruby
•
PHP
Connected
Partners
Connected
Employees
Devices
ConnectedConnected
Products
DataStax – Enabling The Future
Distributed Transactional
Database
Connected