• No results found

Big Data and Databases

N/A
N/A
Protected

Academic year: 2021

Share "Big Data and Databases"

Copied!
80
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data and Databases

Vijay Gadepally (

[email protected]

)

Lauren Milechin ([email protected])

This work is sponsored, by the Department of the Air Force, under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.

(2)

Outline

Challenge Overview

General Strategies

Database Fundamentals and

Technologies

(3)

Big Data Challenge

Elderly Kids Adults Users (deciders) Classroom Tablets Commuter

Vehicles Wearables Fitness

Building Security Sources (providers) Student Smartphones Building Usage Building Environment Things Humans

Gap

10 Years Ago 5 Years Ago Today In 5 Years

Work Vehicles Transport Vehicles Rapidly increasing -  Data volume -  Data velocity -  Data variety -  Date veracity

(4)

Challenge of Data Volume

Where do I store my data?

How much do I store?

1 TB total Applications & Data

How do I access it?

How do I index it?

2 TB total Data Scalable Data Center Data Flat file Spreadsheet Database Distributed database

(5)

Challenge of Data Velocity

2011

Data Generated Per Minute

Facebook: 684,478 pieces of content

Twitter: 100,000 tweets

YouTube: 48 hours of new video

Google: 2,000,000 new queries

(6)
(7)

Challenge of Data Velocity

2014

Data Generated Per Minute

Facebook: 2,460,00 pieces of content

Twitter: 277,000 tweets

YouTube: 72 hours of new video

Google: 4,000,000 new queries

(8)

Challenge of Data Velocity

2011 – 2014

Increase in Data Generated

Facebook: 350 MB/min

Twitter: 50 MB/min

(9)

Challenge of Data Velocity

2011 – 2014

Increase in Data Generated

Facebook: 350 MB/min

Twitter: 50 MB/min

YouTube: 24 – 48 GB/min

How do I process

the data within the

specified time

constraints?

How do I capture

my data for

(10)

Challenge of Data Variety

What does the data look like?

(11)

Challenge of Data Variety

How do I index heterogeneous data formats?

Strings may be easily stored in a database

Image and document metadata may fit in traditional database

Raw images/documents may require file system or alternate

database

What does the data look like?

(12)

Challenge of Data Variety

How do I fuse heterogeneous data formats to

provide uniform view?

Fusion drives

Indexing/schema decisions

Technology (databases, storage, etc.) selection

Selection of software (visualization, language) tools

What does the data look like?

(13)

Challenge of Data Variety

How do I develop algorithms for heterogeneous data

formats?

Images can use High Performance Computing tools

Strings and documents require a new algebra to take advantage of

High Performance computing systems

Visualization requires merging image with string data

What does the data look like?

(14)

Challenge of Data Veracity

How do I balance privacy with availability?

What level of security is required?

How do I make data available only to vetted analysts?

How is data kept secure and private while minimizing impact on

analysis?

(15)

Challenge of Data Veracity

How confident am I in the integrity of my data?

Where did it come from?

Who has accessed it?

Has anyone modified data stream?

Has anyone tampered with the data stream?

(16)

Outline

Challenge Overview

General Strategies

Database Fundamentals and

Technologies

(17)

General Strategy:

System Design

Elderly Kids Adults Users (deciders) Classroom Tablets Commuter

Vehicles Wearables Fitness

Building Security Sources (providers) Student Smartphones Building Usage Building Environment Work Vehicles Transport Vehicles Things Humans

Gap

10 Years Ago 5 Years Ago Today In 5 Years

Analytics A C D E B Computing User Interface Files Scheduler Ingest & Enrichment Enrichment Ingest &

(18)

General Strategies:

Collection

100 101 102 103 100 101 102 103 104 Degree Distribution Degree Count dmax

Collect, Store, and Process only Useful Data

(19)

General Strategies:

Collection

NOISE

SIGNAL

N-D SPACE

Example background model: Power Law Graph

100 101 102 103 100 101 102 103 104 Degree Distribution Degree Count dmax

Intelligently Reduce the Amount of Data through Sampling Techniques Collect, Store, and Process only

(20)

Elderly

Kids Adults

Classroom Tablets Commuter

Vehicles Wearables Fitness

Building

Security Smartphones Student

Building Usage Building Environment Work Vehicles Transport Vehicles Analytics A C D E B Computing Web Raw Data Scheduler Ingest & Enrichment Enrichment Ingest &

Ingest Databases Humans (deciders) Things (providers)

General Strategy:

Privacy-Preserving Technology

(21)

Elderly

Kids Adults

Classroom Tablets Commuter

Vehicles Wearables Fitness

Building

Security Smartphones Student

Building Usage Building Environment Work Vehicles Transport Vehicles Analytics A C D E B Computing Web Raw Data Scheduler Ingest & Enrichment Enrichment Ingest &

Ingest Databases Humans (deciders) Things (providers)

General Strategy:

Privacy-Preserving Technology

Data Integrity Data Integrity Attack

(22)

Elderly

Kids Adults

Classroom Tablets Commuter

Vehicles Wearables Fitness

Building

Security Smartphones Student

Building Usage Building Environment Work Vehicles Transport Vehicles Analytics A C D E B Computing Web Raw Data Scheduler Ingest & Enrichment Enrichment Ingest &

Ingest Databases Humans (deciders) Things (providers)

General Strategy:

Privacy-Preserving Technology

Data Loss / Exfiltration Data Integrity Data Integrity Attack

(23)

Elderly

Kids Adults

Classroom Tablets Commuter

Vehicles Wearables Fitness

Building

Security Smartphones Student

Building Usage Building Environment Work Vehicles Transport Vehicles Analytics A C D E B Computing Web Raw Data Scheduler Ingest & Enrichment Enrichment Ingest &

Ingest Databases Humans (deciders) Things (providers)

General Strategy:

Privacy-Preserving Technology

Insider Threat Data Loss / Exfiltration Data Integrity Data Integrity Attack

(24)

General Strategy:

Privacy-Preserving Technology

Use Cryptographic Protocols to Protect the Confidentiality, Integrity, and/or Availability of Data

Lots of ongoing research

Popular techniques:

Fully Homomorphic Encryption

Multiparty Computation

Computing on Masked Data (CMD)

Big Data Cloud Masked! Query! Plaintext! Query! Encrypt CMD Masked! Analytic! Result! Decrypt Plaintext! Analytic! Result! •  Cryptographic protections

for NoSQL Accumulo database

•  Uses order preserving,

deterministic and semantically secure encryption

(25)

Outline

Challenge Overview

General Strategies

Database Fundamentals and

Technologies

(26)

Database Fundamentals

Collection of data and supporting data structures

Database

Software that provides interface between user

and database

•  Define new data and schema

•  Update data

•  Retrieve (Query) data

•  DB administration: set security and permissions

Database

Management

System

(DBMS)

(27)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

A

C

I

D

B

A

S

E

Successful Transaction Failed Transaction Failed Transaction Successful Transaction

(28)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

A

C

I

D

B

A

S

E

Update

(29)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update

(30)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(31)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(32)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(33)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(34)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(35)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(36)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(37)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update or Update Update

(38)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(39)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update

(40)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(41)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(42)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(43)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(44)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(45)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(46)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

A

C

I

D

B

A

S

E

Update Update Update Update

(47)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

Durability- committed transactions remain committed

A

C

I

D

B

A

S

E

Transaction Update

(48)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

Durability- committed transactions remain committed

A

C

I

D

B

A

S

E

Transaction Update

(49)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

Durability- committed transactions remain committed

A

C

I

D

B

A

S

E

(50)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

Durability- committed transactions remain committed

A

C

I

D

Basically Available Soft-state services with Eventual-consistency

B

A

S

E

Update Update

(51)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

Durability- committed transactions remain committed

A

C

I

D

Basically Available Soft-state services with Eventual-consistency

B

A

S

E

Update Update

(52)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

Durability- committed transactions remain committed

A

C

I

D

Basically Available Soft-state services with Eventual-consistency

B

A

S

E

Update Update

(53)

Database Fundamentals

Atomicity- each transaction either fully succeeds or fails

Consistency- all nodes see same valid data all the time

Isolation- concurrent transactions result in system state obtained

from serial transactions

Durability- committed transactions remain committed

A

C

I

D

Basically Available Soft-state services with Eventual-consistency

B

A

S

E

ACID

BASE

BigTable

(54)

Database Fundamentals

Impossible for a distributed system to

simultaneously provide:

CAP Theorem

(55)

Database Fundamentals

Impossible for a distributed system to

simultaneously provide:

CAP Theorem

(56)

Database Fundamentals

Impossible for a distributed system to

simultaneously provide:

CAP Theorem

Consistency

Availability

Partition Tolerance

Consistency

Availability

Partition Tolerance

(57)

Database Fundamentals

1995 2004 2006 2008 2010 2012 Cluster MapReduce Hadoop D A TA B A SES P A R A L L EL PR O C ESSI N G

Slide Source: S. Sawyer, B. D. O'Gwynn, A. Tran, T. Yu. Understanding Query Performance in Accumulo. HPEC 2013. 2014 2016 BigTable Dremel NoSQL Pregel

D4M

Giraph SQL NewSQL

(58)

Database Fundamentals

Performance C o n si ste n cy Relational DB Systems NoSQL DB Systems NewSQL DB Systems

(59)
(60)

Relational Databases

What it Is

Database that stores information about data and how it is related

Highly structured normalized table based database

Predefined schema/organization of data

Vertically scalable with good quality hardware

Use SQL as query interface

Typically provide full consistency

(61)

Relational Databases

Who Uses It

When to Use It

Dealing with transactional data

Problem sizes are moderate

Need for ACID guarantees

How to Use It

JDBC (Java DataBase Connector)

(62)

Relational Databases

Tweet ID User ID Location ID Tweet Text

096360448 67555 wwz4p7jd Omg earthquake

544019456 67554 wwh1hss5 We're gonna have an

earthquake

600791040 67556 wwwygbvq Omg it's a earthquake

User ID Username Friends Count

67554 _zariaaa_ 541

67555 gnvrly_ron 693

67556 yolvndv 424

Location ID Latitude Longitude

wwh1hss5 33.951186 -118.328370

wwwygbvq 37.754312 -122.164388

wwz4p7jd 38.337154 -122.670192

Tweet Table

(63)

NoSQL Databases

What it Is

Database based on documents, key-value pairs, graphs, or

wide-column stores

Dynamic schema

Horizontal scalability

Typically provide “eventual consistency”

(64)

NoSQL Databases

Who Uses It

When to Use It

Large unstructured datasets

Strong need for high performance

Only require BASE guarantees

How to Use It

Python/JAVA bindings

Lincoln Laboratory D4M

(65)

NoSQL Databases

F ri en d C o u n t| 42 4 F ri en d C o u n t| 54 1 F ri en d C o u n t| 69 3 L ati tu d e| 33 .9 51 18 6 L ati tu d e| 37 .7 54 31 2 L ati tu d e| 38 .3 37 15 4 L o ca ti o n |w w h 1h ss 5 L o ca ti o n |w w w yg b vq L o ca ti o n |w w z4 p 7j d UserID|67556 UserName |_ za ri aa a_ UserName|gnvrly_ron UserName|yolvndv Wo rd |O m g Wo rd |a Wo rd |a n W o rd |e ar th q u ak e 096360448 544019456 600791040 096360448 544019456 600791040 FriendCount|424 FriendCount|541 Word|an Word|earthquake Degree FriendCount|424 1 FriendCount|541 1 FriendCount|693 1 Latitude|33.951186 1 Word|an 1 Word|earthquake 3 Edge Table Transpose Table Degree Table Text Table Text 096360448 Omg earthquake

544019456 We're gonna have an earthquake 600791040 Omg it's a earthquake

(66)
(67)

Accumulo Design Drivers

Scalability

„  Near linear performance improvements at thousands of nodes

„  Durable and reliable under increased failures that come with scale

2

Diverse, Interactive Analytics

„  Sorted key/value core performs well in a diverse set of domains

„  Information retrieval, statistics, graph analysis, geo indexing, and more

3

Cell-Level Security

„  Express common security requirements in the infrastructure, not just in the

application

„  Data-centric approach encourages secure sharing

1

Flexible, Adaptive Schema

„  Start with universal structures and indexing

„  Refine the schema over time

4

(68)

Accumulo Features

Visibility Labels

Iterators

Automatic table splitting

Support for Apache Thrift proxy

Visibility Iterator Table-split Thrift Schema D4M

volume ✓ ✓ ✓

velocity ✓ ✓ ✓

variety ✓ ✓ ✓

(69)

NewSQL Databases

What it Is

Database systems that emulate performance of NoSQL along with

ACID guarantees of Relational Databases

Usually scaled up version of a relational database

Often uses array data model

Other data models include graph-based data structures and

distributed relational tables

May make use of in-memory processing or specialized hardware

(70)

NewSQL Databases

Who Uses It

When to Use It

Large multidimensional datasets

Data that doesn’t fit in traditional databases

Have the volume for NoSQL, but need for ACID guarantees

How to Use It

Each have custom API

(71)
(72)

Massive

Parallel

Processing

Database

Array

data model

Complex

analytics

Commodity clusters or cloud

R, Python, Matlab, Julia,…

SciDB

(73)

SciDB Example Schema

time! stock! price: 15.76 ! volume: 200! price: 17.50 ! volume: null! price: 17.40 ! volume: 100! “MSFT”! price: 234.2 ! volume: 10! “MSFVX”! “MT”! price: 0.02 ! volume: null! 12342778213! 12342778214! 12342778215! …! …!

Highly customizable to application

Each cell is a strongly-typed structure of attributes:

<int>, or <double, string, float>, or

Nullable attributes, empty cells, sparse, or dense

(74)

SciDB Features

Massive Parallel Database

Array Data Model

Analytic language support

In-database analytics

MPP DB Array Languages Analytics

volume ✓ ✓ ✓ ✓

velocity ✓ ✓

variety ✓ ✓ ✓

(75)

Quick Reference

RDBMS vs. NoSQL vs. NewSQL

Relational

Databases

NoSQL

NewSQL

Examples

MySQL,

PostgreSQL,

Oracle

HBase,

Cassandra,

Accumulo

SciDB, VoltDB,

MemSQL

Schema

Typed columns

with relational

keys

Schema-less

Strongly-typed

structure of

attributes

Architecture

Single-node or

sharded

Distributed,

scalable

Distributed, scalable

Guarantees

ACID

transactions

Eventually

consistent

ACID transactions

(most)

Access

SQL, indexing,

joins, and query

planning

Low-level API

(scans and

filtering)

Custom API, JDBC,

Bindings to popular

languages

Slide Source: S. Sawyer, B. D. O'Gwynn, A. Tran, T. Yu. Understanding Query Performance in Accumulo. HPEC 2013.

(76)

Outline

Challenge Overview

General Strategies

Database Fundamentals and

Technologies

(77)

On The Horizon

New Technologies and Techniques

New database and processing technology such as:

•  Apache Spark: In memory

distributed processing

•  TileDB: Database for scientific big data

•  S-Store: Database tuned

for streaming data

New cross database and storage engine standards, API, and practices:

•  BigDawg: An API to simplify big data analytics currently being designed

•  GraphBLAS: An effort to standardize

graph algorithms and databases

Advances in privacy preserving technology:

•  SPED: Signal processing

in the encrypted domain

•  Greater efficiency of protocols such as

Functional Encryption and Multiparty Computation

Tools and technologies will continue to evolve – important to keep students abreast of new developments

(78)

Conclusions

Lots of stuff going on!

Very important to understand details of your dataset, end

analytic, and other requirements

Topics covered:

–  Challenge overview (What is the problem?)

–  Some general strategies

–  Databases

(79)

Leading Science and Engineering Research University

80 Nobel laureates, 50 National Medal of Science recipients

Thousands of companies (11

th

largest world economy)

1000 faculty, 10000 employees, 10000 students

$1.4B in annual external research funding

(80)

References

Related documents

Statistical comparisons between the adjusted intensity, Ice and grayscaled intensity, Igd Percentage of enhancement using gain varying sigmoid mapping function. Percentage

• Task 4: Evaluate the impacts of oil, methane and dispersant on pelagic food web structure and organic matter cycling along the Alabama coast.. • Task 5: Evaluate the extent to

The main wall of the living room has been designated as a &#34;Model Wall&#34; of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

Our results on the adsorption of cationic potato starch and cationic waxy maize show that if about 1.5% (on dry fiber weight) is added, the retention at low salt concentration (2

I also examined the epidemiology of falls and FRIs among children, youth, middle-aged, and older adults treated at the nationally representative hospitals at the emergency

The Crepidotaceae (Basidiomycota, Agaricales): phylog- eny and taxonomy of the genera and revision of the family based on molecular evidence.. Mushrooms: poisons

Structured data ingest (ELT) Oracle Data Integrator RDBMS to HDFS Apache Sqoop (Big Data Appliance). Apache Sqoop Hadoop startup

This molecular technique can be useful in the assessment of cryptic species which is widespread in marine environment and linking the different life cycle stages to the adult which