Big Data and Databases
Vijay Gadepally (
[email protected]
)
Lauren Milechin ([email protected])
This work is sponsored, by the Department of the Air Force, under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
Outline
•
Challenge Overview
•
General Strategies
•
Database Fundamentals and
Technologies
Big Data Challenge
Elderly Kids Adults Users (deciders) Classroom Tablets CommuterVehicles Wearables Fitness
Building Security Sources (providers) Student Smartphones Building Usage Building Environment Things Humans
Gap
10 Years Ago 5 Years Ago Today In 5 Years
Work Vehicles Transport Vehicles Rapidly increasing - Data volume - Data velocity - Data variety - Date veracity
Challenge of Data Volume
Where do I store my data?
How much do I store?
1 TB total Applications & Data
How do I access it?
How do I index it?
2 TB total Data Scalable Data Center Data Flat file Spreadsheet Database Distributed database
Challenge of Data Velocity
2011
Data Generated Per Minute
•
Facebook: 684,478 pieces of content
•
Twitter: 100,000 tweets
•
YouTube: 48 hours of new video
•
Google: 2,000,000 new queries
Challenge of Data Velocity
2014
Data Generated Per Minute
•
Facebook: 2,460,00 pieces of content
•
Twitter: 277,000 tweets
•
YouTube: 72 hours of new video
•
Google: 4,000,000 new queries
Challenge of Data Velocity
2011 – 2014
Increase in Data Generated
•
Facebook: 350 MB/min
•
Twitter: 50 MB/min
Challenge of Data Velocity
2011 – 2014
Increase in Data Generated
•
Facebook: 350 MB/min
•
Twitter: 50 MB/min
•
YouTube: 24 – 48 GB/min
How do I process
the data within the
specified time
constraints?
How do I capture
my data for
Challenge of Data Variety
What does the data look like?
Challenge of Data Variety
How do I index heterogeneous data formats?
•
Strings may be easily stored in a database
•
Image and document metadata may fit in traditional database
•
Raw images/documents may require file system or alternate
database
What does the data look like?
Challenge of Data Variety
How do I fuse heterogeneous data formats to
provide uniform view?
•
Fusion drives
•
Indexing/schema decisions
•
Technology (databases, storage, etc.) selection
•
Selection of software (visualization, language) tools
What does the data look like?
Challenge of Data Variety
How do I develop algorithms for heterogeneous data
formats?
•
Images can use High Performance Computing tools
•
Strings and documents require a new algebra to take advantage of
High Performance computing systems
•
Visualization requires merging image with string data
What does the data look like?
Challenge of Data Veracity
How do I balance privacy with availability?
•
What level of security is required?
•
How do I make data available only to vetted analysts?
•
How is data kept secure and private while minimizing impact on
analysis?
Challenge of Data Veracity
How confident am I in the integrity of my data?
•
Where did it come from?
•
Who has accessed it?
•
Has anyone modified data stream?
•
Has anyone tampered with the data stream?
Outline
•
Challenge Overview
•
General Strategies
•
Database Fundamentals and
Technologies
General Strategy:
System Design
Elderly Kids Adults Users (deciders) Classroom Tablets CommuterVehicles Wearables Fitness
Building Security Sources (providers) Student Smartphones Building Usage Building Environment Work Vehicles Transport Vehicles Things Humans
Gap
10 Years Ago 5 Years Ago Today In 5 Years
Analytics A C D E B Computing User Interface Files Scheduler Ingest & Enrichment Enrichment Ingest &
General Strategies:
Collection
100 101 102 103 100 101 102 103 104 Degree Distribution Degree Count dmaxCollect, Store, and Process only Useful Data
General Strategies:
Collection
NOISE
SIGNAL
N-D SPACE
Example background model: Power Law Graph
100 101 102 103 100 101 102 103 104 Degree Distribution Degree Count dmax
Intelligently Reduce the Amount of Data through Sampling Techniques Collect, Store, and Process only
Elderly
Kids Adults
Classroom Tablets Commuter
Vehicles Wearables Fitness
Building
Security Smartphones Student
Building Usage Building Environment Work Vehicles Transport Vehicles Analytics A C D E B Computing Web Raw Data Scheduler Ingest & Enrichment Enrichment Ingest &
Ingest Databases Humans (deciders) Things (providers)
General Strategy:
Privacy-Preserving Technology
Elderly
Kids Adults
Classroom Tablets Commuter
Vehicles Wearables Fitness
Building
Security Smartphones Student
Building Usage Building Environment Work Vehicles Transport Vehicles Analytics A C D E B Computing Web Raw Data Scheduler Ingest & Enrichment Enrichment Ingest &
Ingest Databases Humans (deciders) Things (providers)
General Strategy:
Privacy-Preserving Technology
Data Integrity Data Integrity AttackElderly
Kids Adults
Classroom Tablets Commuter
Vehicles Wearables Fitness
Building
Security Smartphones Student
Building Usage Building Environment Work Vehicles Transport Vehicles Analytics A C D E B Computing Web Raw Data Scheduler Ingest & Enrichment Enrichment Ingest &
Ingest Databases Humans (deciders) Things (providers)
General Strategy:
Privacy-Preserving Technology
Data Loss / Exfiltration Data Integrity Data Integrity AttackElderly
Kids Adults
Classroom Tablets Commuter
Vehicles Wearables Fitness
Building
Security Smartphones Student
Building Usage Building Environment Work Vehicles Transport Vehicles Analytics A C D E B Computing Web Raw Data Scheduler Ingest & Enrichment Enrichment Ingest &
Ingest Databases Humans (deciders) Things (providers)
General Strategy:
Privacy-Preserving Technology
Insider Threat Data Loss / Exfiltration Data Integrity Data Integrity AttackGeneral Strategy:
Privacy-Preserving Technology
Use Cryptographic Protocols to Protect the Confidentiality, Integrity, and/or Availability of Data
•
Lots of ongoing research
•
Popular techniques:
–
Fully Homomorphic Encryption
–
Multiparty Computation
–
Computing on Masked Data (CMD)
Big Data Cloud Masked! Query! Plaintext! Query! Encrypt CMD Masked! Analytic! Result! Decrypt Plaintext! Analytic! Result! • Cryptographic protections
for NoSQL Accumulo database
• Uses order preserving,
deterministic and semantically secure encryption
Outline
•
Challenge Overview
•
General Strategies
•
Database Fundamentals and
Technologies
Database Fundamentals
Collection of data and supporting data structures
Database
Software that provides interface between user
and database
• Define new data and schema
• Update data
• Retrieve (Query) data
• DB administration: set security and permissions
Database
Management
System
(DBMS)
Database Fundamentals
Atomicity- each transaction either fully succeeds or fails
A
C
I
D
B
A
S
E
Successful Transaction Failed Transaction Failed Transaction Successful TransactionDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
A
C
I
D
B
A
S
E
UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update or Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
A
C
I
D
B
A
S
E
Update Update Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
Durability- committed transactions remain committed
A
C
I
D
B
A
S
E
Transaction UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
Durability- committed transactions remain committed
A
C
I
D
B
A
S
E
Transaction UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
Durability- committed transactions remain committed
A
C
I
D
B
A
S
E
Database Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
Durability- committed transactions remain committed
A
C
I
D
Basically Available Soft-state services with Eventual-consistency
B
A
S
E
Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
Durability- committed transactions remain committed
A
C
I
D
Basically Available Soft-state services with Eventual-consistency
B
A
S
E
Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
Durability- committed transactions remain committed
A
C
I
D
Basically Available Soft-state services with Eventual-consistency
B
A
S
E
Update UpdateDatabase Fundamentals
Atomicity- each transaction either fully succeeds or fails
Consistency- all nodes see same valid data all the time
Isolation- concurrent transactions result in system state obtained
from serial transactions
Durability- committed transactions remain committed
A
C
I
D
Basically Available Soft-state services with Eventual-consistency
B
A
S
E
ACID
BASE
BigTableDatabase Fundamentals
Impossible for a distributed system to
simultaneously provide:
CAP Theorem
Database Fundamentals
Impossible for a distributed system to
simultaneously provide:
CAP Theorem
Database Fundamentals
Impossible for a distributed system to
simultaneously provide:
CAP Theorem
Consistency
Availability
Partition Tolerance
Consistency
Availability
Partition Tolerance
Database Fundamentals
1995 2004 2006 2008 2010 2012 Cluster MapReduce Hadoop D A TA B A SES P A R A L L EL PR O C ESSI N GSlide Source: S. Sawyer, B. D. O'Gwynn, A. Tran, T. Yu. Understanding Query Performance in Accumulo. HPEC 2013. 2014 2016 BigTable Dremel NoSQL Pregel
D4M
Giraph SQL NewSQLDatabase Fundamentals
Performance C o n si ste n cy Relational DB Systems NoSQL DB Systems NewSQL DB SystemsRelational Databases
What it Is
•
Database that stores information about data and how it is related
•
Highly structured normalized table based database
•
Predefined schema/organization of data
•
Vertically scalable with good quality hardware
•
Use SQL as query interface
•
Typically provide full consistency
Relational Databases
Who Uses It
When to Use It
•
Dealing with transactional data
•
Problem sizes are moderate
•
Need for ACID guarantees
How to Use It
•
JDBC (Java DataBase Connector)
Relational Databases
Tweet ID User ID Location ID Tweet Text
096360448 67555 wwz4p7jd Omg earthquake
544019456 67554 wwh1hss5 We're gonna have an
earthquake
600791040 67556 wwwygbvq Omg it's a earthquake
User ID Username Friends Count
67554 _zariaaa_ 541
67555 gnvrly_ron 693
67556 yolvndv 424
Location ID Latitude Longitude
wwh1hss5 33.951186 -118.328370
wwwygbvq 37.754312 -122.164388
wwz4p7jd 38.337154 -122.670192
Tweet Table
NoSQL Databases
What it Is
•
Database based on documents, key-value pairs, graphs, or
wide-column stores
•
Dynamic schema
•
Horizontal scalability
•
Typically provide “eventual consistency”
NoSQL Databases
Who Uses It
When to Use It
•
Large unstructured datasets
•
Strong need for high performance
•
Only require BASE guarantees
How to Use It
•
Python/JAVA bindings
•
Lincoln Laboratory D4M
NoSQL Databases
F ri en d C o u n t| 42 4 F ri en d C o u n t| 54 1 F ri en d C o u n t| 69 3 L ati tu d e| 33 .9 51 18 6 L ati tu d e| 37 .7 54 31 2 L ati tu d e| 38 .3 37 15 4 L o ca ti o n |w w h 1h ss 5 L o ca ti o n |w w w yg b vq L o ca ti o n |w w z4 p 7j d … UserID|67556 UserName |_ za ri aa a_ UserName|gnvrly_ron UserName|yolvndv Wo rd |O m g Wo rd |a Wo rd |a n W o rd |e ar th q u ak e … 096360448 544019456 600791040 096360448 544019456 600791040 FriendCount|424 FriendCount|541 … Word|an Word|earthquake … Degree FriendCount|424 1 FriendCount|541 1 FriendCount|693 1 Latitude|33.951186 1 … Word|an 1 Word|earthquake 3 … Edge Table Transpose Table Degree Table Text Table Text 096360448 Omg earthquake544019456 We're gonna have an earthquake 600791040 Omg it's a earthquake
Accumulo Design Drivers
Scalability
Near linear performance improvements at thousands of nodes
Durable and reliable under increased failures that come with scale
2
Diverse, Interactive Analytics
Sorted key/value core performs well in a diverse set of domains
Information retrieval, statistics, graph analysis, geo indexing, and more
3
Cell-Level Security
Express common security requirements in the infrastructure, not just in the
application
Data-centric approach encourages secure sharing
1
Flexible, Adaptive Schema
Start with universal structures and indexing
Refine the schema over time
4
Accumulo Features
•
Visibility Labels
•
Iterators
•
Automatic table splitting
•
Support for Apache Thrift proxy
Visibility Iterator Table-split Thrift Schema D4M
volume ✓ ✓ ✓
velocity ✓ ✓ ✓
variety ✓ ✓ ✓
NewSQL Databases
What it Is
•
Database systems that emulate performance of NoSQL along with
ACID guarantees of Relational Databases
•
Usually scaled up version of a relational database
•
Often uses array data model
•
Other data models include graph-based data structures and
distributed relational tables
•
May make use of in-memory processing or specialized hardware
NewSQL Databases
Who Uses It
When to Use It
•
Large multidimensional datasets
•
Data that doesn’t fit in traditional databases
•
Have the volume for NoSQL, but need for ACID guarantees
How to Use It
•
Each have custom API
Massive
Parallel
Processing
Database
Array
data model
Complex
analytics
Commodity clusters or cloud
R, Python, Matlab, Julia,…
SciDB
SciDB Example Schema
time! stock! price: 15.76 ! volume: 200! price: 17.50 ! volume: null! price: 17.40 ! volume: 100! “MSFT”! price: 234.2 ! volume: 10! “MSFVX”! “MT”! price: 0.02 ! volume: null! 12342778213! 12342778214! 12342778215! …! …!•
Highly customizable to application
•
Each cell is a strongly-typed structure of attributes:
<int>, or <double, string, float>, or
…
•
Nullable attributes, empty cells, sparse, or dense
SciDB Features
•
Massive Parallel Database
•
Array Data Model
•
Analytic language support
•
In-database analytics
MPP DB Array Languages Analytics
volume ✓ ✓ ✓ ✓
velocity ✓ ✓
variety ✓ ✓ ✓
Quick Reference
RDBMS vs. NoSQL vs. NewSQL
Relational
Databases
NoSQL
NewSQL
Examples
MySQL,
PostgreSQL,
Oracle
HBase,
Cassandra,
Accumulo
SciDB, VoltDB,
MemSQL
Schema
Typed columns
with relational
keys
Schema-less
Strongly-typed
structure of
attributes
Architecture
Single-node or
sharded
Distributed,
scalable
Distributed, scalable
Guarantees
ACID
transactions
Eventually
consistent
ACID transactions
(most)
Access
SQL, indexing,
joins, and query
planning
Low-level API
(scans and
filtering)
Custom API, JDBC,
Bindings to popular
languages
Slide Source: S. Sawyer, B. D. O'Gwynn, A. Tran, T. Yu. Understanding Query Performance in Accumulo. HPEC 2013.
Outline
•
Challenge Overview
•
General Strategies
•
Database Fundamentals and
Technologies
On The Horizon
New Technologies and Techniques
New database and processing technology such as:
• Apache Spark: In memory
distributed processing
• TileDB: Database for scientific big data
• S-Store: Database tuned
for streaming data
New cross database and storage engine standards, API, and practices:
• BigDawg: An API to simplify big data analytics currently being designed
• GraphBLAS: An effort to standardize
graph algorithms and databases
Advances in privacy preserving technology:
• SPED: Signal processing
in the encrypted domain
• Greater efficiency of protocols such as
Functional Encryption and Multiparty Computation
Tools and technologies will continue to evolve – important to keep students abreast of new developments
Conclusions
•
Lots of stuff going on!
•
Very important to understand details of your dataset, end
analytic, and other requirements
•
Topics covered:
– Challenge overview (What is the problem?)
– Some general strategies
– Databases