• No results found

Data Growth. Von 2000 bis 2002 sind mehr Daten generiert worden als in den Jahren davor

N/A
N/A
Protected

Academic year: 2021

Share "Data Growth. Von 2000 bis 2002 sind mehr Daten generiert worden als in den Jahren davor"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

2

Data Growth

 Von 2000 bis 2002 sind mehr Daten

generiert worden als in den 40.000 Jahren davor

 Von 2004-2005 hat sich diese Datenmenge wiederum vervierfacht

 Datenvolumen 2012: 2,5 Zetabytes, d.h. 10x das Datenvolumen von 2006!

 Datenvolumen 2020: 100 Zettabytes  Nicht nur Datenvolumen, sondern

insbesondere auch Datenvielfalt wächst.

BitKom (2012)

(3)

3

Data, data, everywhere…

 Unstructured, coming from sources that haven’t been mined before

 Compounded by internet, social media, cloud computing, mobile devices, digital images…

 Exponential. Every 2 days we create as much data as from the Dawn of Civilisation to

2003*

 Hard to keep up. Communication Operators managing petabyte scale expect 10-100 x times data growth in next 5 years**

(4)

4

Generating statistical

models out of high

volume databases

(5)

5

Smart Everything

- Smart „things“ - Smart places - Smart networks - Smart services - Smart solutions

 „Smart-*“ infrastructure

 need to make things Smart…!

 Requirements for “Smart Everything”

- Interactive (“tangible”)  low latency - High volume  high throughput

(6)

6

…from smart phone to smart lenses

http://ngm.nationalgeographic.com

 novel

Big Data Analytics

apps with ms-response time

incorporating local context as well as global state

your personal coupon arrived!!! Buy x get y free

(7)

7

Big Computing

 First Phase of the next generation HRSK - 7.000 cores

 Second Phase (by end of March 2015) - >40.000 cores in total

(8)

8

Observation 1: Infrastructure

 Massive computing power in cloud/cluster

environments

 Huge variety of „mobile/distributed“ devices

- Significant computing power in “mobile” devices

 Significant communication and computing capabilities

5G Lab

(9)

9

Observation 2: Computing Hardware

(1)

Main Memory and non-volatile memory as the main driver

 Main-Memory is KING, disk is DEAD

(2)

Non-Uniform Memory Access requires data-centric database system architectures

 Shared Everything (within a box)

(3)

Dark Silicon Effect

allows for highly-specific chip sets  Application support on chip-level (“DB on a chip”)

(10)

10

Observation 3: Data Production Process

 Different steps with quality gates - from raw data to knowlegde extraction

Data integration/ annotation Data extraction / cleaning Data aquisition Data analysis and visualization Inter-pretation

(11)

11

(12)
(13)

13

…a plea for specialized DB systems

 They are selling “one size fits all“

 Which is 30 year old legacy technology that good at nothing

Is/Was he right?

(almost 10 years ago!)

(14)

14

The Extremes

 strict consistency

 internal data format (data lock-in)  sophisticated access method

 defined schema (semantics known to the system/optimizer)

 semantics of the operators known to the system (closed set of operators)

 only read access, focus on scalability  use (CSV) files as data container  scan and shuffle methods

 schema defined during query time (schema on the fly)

 2nd order functions; semantics of the operator is totally unknown to the system, only a contract exists between operators and infrastructure

?

?

operators schema data DBMS data operators schema MR Infrastructure Application

(15)

15

Limits of classical DB systems

 perserving consistency in a distributed encironment is costly…

 ensure serializability even if the application can ensure no conflicting writes

 necessity to put data completely into control of the system (effort to load data into a

database system, perform runstats, …)  no native support for regular CSV files  -> optimize time to query

 need to follow the „data comes first, schema comes second principle

 … with the data model – the tabular model is still very popular (with flexibility)

 … with the query model – SQL is just fine (everybody knows SQL)

- NoSQL systems hard to program, e.g. Cassandra 1.0 did not ensure consistency within a row!

- responsibility is left to the application programmer (e.g. store redundant hash codes to compare versions at the application level)

 R1 recovery for queries/statements, no easy compensation of loosing a node

(16)

16

Impact on Database Systems

Extreme data

Extreme performance

Dynamo

“Three things are important in the database world: performance, performance and performance.“ Bruce Lindsey

(17)
(18)

18

Apache Data Management Family

Apache Spark

(19)

19

(20)

20

(21)

21

(22)

22

(23)
(24)

24

A Look at Hardware Trends – 201x

(25)

25

A Look at Hardware Trends - 2015

25 Extreme NUMA Effects

System Le vel Application-Specific Instruction Sets Comp one nt L

evel Storage-Class Memory

1

2

3

(26)

26

(27)

27

TA versus Data-oriented Architecture

(DORA)

… Data … Data Indirection Transactions Transaction-Oriented Architecture shared-everything Data-Oriented Architecture

mixed everything & shared-nothing

Lack of scalability

No load balancing & indirection required

Scales on massive parallel systems Load Balancing and indirection required Energy proportional by design Not energy proportional by design

Pros & Cons

Challenges

(1) Speed up load balancing indirection to work efficiently for in-memory systems

(2) How to make the data-oriented architecture energy proportional

Well investigated and widely deployed

Which Architecture

?

(28)

28

ERIS Data Management Core

 data-oriented architecture

(via message passing)  NUMA awareness

 heterogeneous hardware  aggressive elasticity

strategies

 dynamic data placement policies

dynamic loading

(29)

29

(30)

30

What‘s Next? Wireless Interconnects!

 Optical interconnects, …

 High-speed, short-range wireless links

Antennas and Wave Propagation

(31)

31

A Look at Hardware Trends - 2015

31 Extreme NUMA Effects

System Le vel Application-Specific Instruction Sets Comp one nt L

(32)

32

xPU Developments and Consequences

32

http://upload.wikimedia.org/wikipedia/en/c/ce/Clock_CPU_Scaling.jpg

(33)

33

Motivation of „DB Processor“

 HW/SW Co-design based on customizable processor  Application-specific ISA extensions

(34)

35

Selectivity: Intersection

35 0 200 400 600 800 1000 1200 1400 1600 1800 0 10 20 30 40 50 60 70 80 90 100

DBA_2LSU_EIS w/ partial loading DBA_1LSU_EIS w/ partial loading DBA_2LSU_EIS w/o partial loading DBA_1LSU_EIS w/o partial loading

DBA_1LSU 108Mini Final processor

+1 Load-Store unit Data bus: 32->128 bit + Partial loading + Extended ISA 𝑏

(35)

36

Timing and Area

36

Relative Area Consumption(DBA_2LSU_EIS)

(36)

37

Comparison

37 7x improvement 963x improvement 175x improvement

(37)

38

A Look at Hardware Trends - 2015

38 Extreme NUMA Effects

System Le vel Application-Specific Instruction Sets Comp one nt L

(38)

39

Storage Class Memory / Non-Volatile RAM

 Examples: MRAM(IBM), FRAM (Ramtron), PCM(Samsung)  Merging Point between storage and memory

39

Adapted from: M. K. Qureshi, V. Srinivasan, and J. A. Rivers. Scalable high performance main memory system using phase-change memory technology. In ISCA 2009

 ~4x denser than DRAM

 SCM does not consume energy if not used  SCM is cheaper, persistent, byte-addressable

 Number of writes is limited (life expectancy 2-10 years)  SCM has higher latency than DRAM

- Read latency ~2x slower than DRAM - Write latency ~10x slower than DRAM

(39)

40

SCM Access // File Level

 Application may distinguish between - Traditional memory

- Persistent memory blocks

 Files names are used as identification handle

40  Shows how to get persistent memory

to the application level  Requires a persistent

memory-aware file system  Direct access to regions of

(40)

41

Hybrid Storage Architecture for Column Stores

 Write optimized store (WOS)

 Read optimized (compressed) store (ROS)

41

update/insert/delete

REDO

log savepoint data area

merge/ tuple mover

• Dictionary compressed • Unsorted dictionary • Efficient B-tree structures

• Compression schemes according to existing data distribution • Sorted dictionary

(41)

42

NVRAM for ROS-Structure

 With prefetching: average penalty for using SCM instead of DRAM is only 8%.  Without prefetching: average penalty for using SCM instead of DRAM is 41%.

 For operators with sequential memory access patterns, SCM performs almost as good as DRAM

(42)

43

NVRAM for WOS-Structure

 Skip List read/write performance on DRAM and SCM

 47% penalty for reads, and 43-47% penalty for writes for using SCM instead of DRAM.

 Writing persistent and concurrent data structures is NOT trivial

(43)

44

Experiment: Recovery Performance

Different recovery schemes. TATP scale factor 500, 4 users. The database is crashed at second 15.

 Scenario 1: rebuild all secondary data structures before starting answering queries.

 Scenario 2: rebuild secondary data structures in the background and start immediately answering queries using primary, persistent data. Recovery area decreased by 16%.

 Scenario 3: similar to scenario 2, with 40% persistent secondary data structures. Recovery area decreased by 82%. Throughput decreased by 14%.

 Scenario 4: all secondary data structures are persistent. Recovery area decreased by 99.8%.

(44)
(45)

46

Summary and Conclusion

 In General – Big Data is an enabler! – NOT a final product

 Let‘s head for new frontiers!

 Significant developments on infrastructure level  Significant developments in the hardware sector

- HTM, SCM, etc.

- Heterogenous systems (speclialized cores)

(46)

References

Related documents

Whether grown as freestanding trees or wall- trained fans, established figs should be lightly pruned twice a year: once in spring to thin out old or damaged wood and to maintain

T h e second approximation is the narrowest; this is because for the present data the sample variance is substantially smaller than would be expected, given the mean

Players can create characters and participate in any adventure allowed as a part of the D&D Adventurers League.. As they adventure, players track their characters’

We called the model L n RBAC (n-leveled RBAC), which offer the following features: (1) it allows different control granularity in different cases, (2) it solves the covert

In order to provide the proposed backup service, a combination of the following technologies including spatial random scrambling of file data, subsequent random fragmentation of

That is to say agreement between the verb and its subject does not hold for other verb forms (e.g. simple past). The most common question that students ask when this

La estrategia parte del entorno socio institucional, las políticas educativas públicas en el Ecuador, así como las potencialidades formativas de la Universidad UNIANDES,