## Big Data & Scripting

## storage networks and

## distributed file systems

• in the remainder we use networks of computing nodes to enable computations on even larger datasets

• for a computation, each node will work on the part of the dataset that is locally available to the node

• computing nodes will have a partial, local copy of the whole dataset

• an optimal scenario will distribute the data in advance using nodes in parallel for storage and computations

### general setting

• nodes connected by network

• each node has external memory (e.g. hard disk)

• in addition: internal memory and computing capacity in this part we consider only storage and distribution of data

**design issues for storage networks**

• Space and Access balance

– even distribution of data to machines

• Availability

– implement redundancy and tolerance for data loss

• Resource Efficiency

– use resources in useful way (don’t waste space)

• Access Efficiency

– provide fast access to stored data

• Heterogeneity

– integrate different types of hardware

• Adaptivity

– storage of growing amounts of data

• Locality

– minimize degree of communication for data access

**storage networks – model**

*• n nodes N*_{1}*, . . . , N*_{n}

*• node N*_{i}*has capacity C*_{i}

*• total capacity: S =*^{P}^{n}_{i =1}*C*_{i}*, i.e. space for S blocks in total*

*• blocks stored on N*_{i}*: F** _{i}* (filling state)

• nodes are connected by network:

*N*_{i}*can send data to N*_{j}*for arbitrary i , j*

• data is accessed by users from outside:

– retrieve a set of blocks (for now)

– retrieve the result of an operation on a set of blocks (later)

**balancing problem:**

*• consider a simplified scenario with C** _{i}* constant, i.e. all nodes have
the same capacity

*and distribute m blocks to n nodes*
subject to:

• minimize ^{P}_{i}*|F*_{i}*− m/n| (close to equal distribution) and*

• minimize max*i**F**i* (minimize max load)

**striping**

• all objects combined to single stream of data

*• divide data into blocks B*_{i}

*• divide blocks into striping units U*_{i}*of k blocks each*

*• store striping unit i to node N*_{(i mod n)}*at position i div n*

0 1

2 3

4 5

6 7 8

9 10 11

12 13

14 15

Stripe 0 Stripe 1

D1 D2 D3 D4

block stripe

unit

advantage: units in one stripe can be read in parallel

**striping: size of striping unit k?**

**striping: size of striping unit k?**

assumptions:

• operations tend to involve adjacent blocks

example: one big file (e.g. large csv table) spanning several blocks

• several data accesses in parallel e.g. different users using different files

*• small k*

– high bandwidth (access in parallel) – many parallel accesses block each other

*• large k*

– low bandwidth (most files in single node)

– parallel accesses (to different files) are distributed among nodes
*choice of k only depends on access-structure and average node*
performance^{1}

1*Chen, Patterson, Maximizing performance in a striped disk array, 1990*

**striping: advantages/disadvantages** advantages

• perfectly balanced data distribution

• simple addressing/storage scheme

### disadvantages

• modifying stored data (blocks)

– block deletion yields holes (new data at the end or into holes) fragmentation (additional indexing

• adding and removing nodes (machines) – addition could be solved by new striping – removal leads to (partial) redistribution

• solutions exists, but striping is best for static scenarios

**balancing: centralized approach**

### idea

• one central address and positioning node **master**

• coordinates all data access, knows state of nodes

• store new blocks to nodes with lowest filling state

• adding/removing storage nodes is straightforward

### data access:

• client sends operation to server (read/write, add, delete)

• server answers with address of node to interact with

• operation is executed between client and node

**centralized approach: advantages/disadvantages**

### advantages

• optimal data distribution can be guaranteed

• operations can be synchronized

### disadvantages

• address and positioning node is bottleneck

• one centralized dictionary block id → node

• return to access schemes later

**balancing: distribution by hashing**

*• treat nodes as bins, use hash function h() for distribution*

*• write block B to node N*_{h(B)}

*• load factor α >> 1 (many blocks per node)*

### the “balls to bins” model

*• usual assumption in hashing: α < 1, avoid collision*

*• here: α >> 1 achieve balanced distribution of blocks (balls) to*
nodes (bins)

*optimal distribution: m/n blocks (out of m) on each node (out of n)*
question:

can we guarantee that maximum elements in one bin is not too large?

**balancing: distribution by hashing**

when using the distribution of a hash function directly, the fill states of the bins tend to be unbalanced

●●

●

●

● ● ●

● ●● ● ●● ●● ●● ●● ● ● ●● ●● ●

● ● ●● ● ●●● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●

● ● ●● ● ●● ●● ●● ●

● ●

● ● ●

●● ●● ●● ●● ● ● ●●● ●

●

●

●

● ●●

●●

0 20 40 60 80 100

−50050

**bin fill state, m=10.000 blocks in n=100 bins**

bin

(blocks in bin) − m/n

experiment: distribute 10.000 blocks to 100 bins

**balancing: distribution by hashing**

*the “simple case”: m elements, n bins, m > n log n,*
**assumption h(x ) uniform distributed**

**then with high probability:**

*• expected number of elements in most B*_{i}*: m/n*

*• ∃ bin with m/n + Θ*^{q}*mln(n)/n*^{}

additional load more than^{q}*m/n (compared to opt.)*

### “with high probability”

*In a system with some parameter n, an event X appears with high*
*probability if P(X ) ≥ 1 −*_{n}^{1}*α* *for some constant α > 0.*

*similar cases often denoted as P(X ) = 1 − o(1)*

**balancing: greedy improvement**

*• the expected distribution O(m/n) in each node is good*

• bins with higher load can block computations and data access

**improvement: greedy(d)**

*• for each block, choose d ≥ 2 nodes N*_{i}_{1}*, . . . , N*_{i}_{d}

*• find b = arg min**k∈{1,...,d }**F**i** _{k}* (break ties arbitrary)

*• place block in N*_{b}

**example: consider block h(B) and blocks to the left and right****retrieval: recalculate addresses and test all (in parallel)**

**balancing: greedy improvement**

experiment: comparing default choice and greedy improvement

●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●

●●●●

●●

0 20 40 60 80 100

−50050

**bin fill states, m=10.000 blocks in n=100 bins, (2 alternatives in greedy)**

bin

(blocks in bin) − m/n

● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●

●

●

direct greedy

*each greedy insert uses bin from h(B) − 1, h(B), h(B + 1) with*
minimal fill state

**analysis of greedy(d)**

^{2}

### theorem: maximal load

* Insert m blocks into n nodes using greedy(d ), then with high*
probability: max

_{i}*F*

_{i}*is ln(ln(n))/ln(d ) · Θ(m/n)*

### theorem: number of “overloaded” bins

*Let γ be a suitable constant. If m balls are distributed into n bins*
*using strategy greedy(d ), with probability > 1 −*^{1}* _{n}* at most

*n · exp(−d*^{i}*) bins have load >* ^{m}_{n}*+ i + γ.*

1. the maximal load is not too extreme

2. only few bins with much more than the optimal load exist

2c.f. Berenbrink, Czumaj, Steger, V¨*ocking, Balanced allocations: The heavily*
*loaded case, 2000*

**heterogeneity**

*• implicit assumption above: C*_{i}*= C** _{j}*
– all nodes have equal capacities
– useful assumption but not realistic

• heterogeneity:

– arbitrary hardware for nodes

*– in general C**i* *6= C** _{j}* (differing capacities)

• load balancing is more complicated

• more freedom of hardware choice

– e.g. upgrade with constantly larger nodes

**heterogeneity: virtual buckets**

the hashing approach can be extended to heterogeneous settings by
*subdividing all node capacities into virtual buckets*

*• choose largest common storage unit C as size of virtual bucket*

*• real capacities C*_{i}*should be approx. multiples of C :*
*C*_{i}*≈ k*_{i}*· C with k** _{i}* ∈ N

*• every node N*_{i}*is split into k*_{i}*buckets s.t. K =*^{P}_{i}*k*_{i}*(K is the total number of buckets)*

*• hash function maps blocks to {1, . . . , K } (buckets)*

*• second mapping m : {1, . . . , K } → {1, . . . , N}, with*
*{m*^{−1}*(i )} = k*_{i}

*– map K buckets to N nodes*

– number of buckets for each node corresponding to node size

**availability: prevent data loss**

avoid loss of data, i.e. ensure that stored data is available

### motivational example

*• storage network with N uniform nodes*

*• probability of node failure within one month is p*
*P(node survives a month) = (1 − p)*

*P(N nodes survive k months) = (1 − p)*^{N·k}

• failure probability exponential in number of nodes and time failures will happen eventually

• can not be avoided with fail-safe hardware use redundancy to handle failures

**availability: implementing redundancy**

### basic principle

• store additional information (more than only the given data)

• use that information to recover in case of partial data loss

### two basic approaches

• mirroring

– store data elements several times

• parity codes

– create additional information to recover missing bits

**availability: redundancy by mirroring**

### idea (simple version)

**• for each block store r duplicate on different nodes**

*• failure rate for one node p ⇒ probability of loosing block: p*^{r}

*• problem: need rm space instead of r*

• when node fails:

create copies of all blocks on failed node from duplicates

• on update of nodes:

update all duplicates

**availability: parity codes**

*• assume string of bits s = s*_{1}*s*_{2}*s*_{3}*. . . s** _{n}* e.g. 0110101001001110

*• parity: p(s) =*^{P}_{i}*s** _{i}* mod 2, e.g. 0

*• if one bit of s is lost, e.g. s*^{0} *= s*_{1}*xs*_{3}*. . . s*_{n}*, was x =1 or x =0?*

use parity of available part:

*• x =*

( *0 , if p(s) = p(s*^{0})
1 , else

• one additional bit allows recovering of one arbitrary lost bit

• can be extended to larger amounts of missing bits

• one example: Hamming code

store additional bits instead of duplicates and restore on data loss

• often implemented on hardware level

**adaptivity**

• capacity is constantly extended by adding nodes

• problem: rehashing for every new node to expensive

### idea: adaptive hash function

• hash function with adaptive range

• change of range avoids total reorganization, but rearranges only (small) portion of input values

when new nodes are added, only a few blocks have to be rearranged

**adaptivity: adaptive hashing** basic idea

*• position nodes in space S*

*• for each block determine position in S by hash function*

• store block on nearest node

• find nearest position for arbitrary point by binary search

• adapt to new/removed nodes:

– removing/adding points in space – reassign neighboring blocks

• problem:

– when node is removed, all blocks go to neighbor(s) – when node is added, takes huge load from neighbors refine using multiple positions for each node

**adaptivity: adaptive hashing**

*• use one-dimensional ring [0, 1) as space (distance using modulo)*

*• assign k positions to each node i: P*_{1}^{i}*, . . . , P*_{k}^{i}

*• every block is mapped to [0, 1)-position by hash function h*

**• block positioning**

*– determine hash value h(B) for block*
*– assign block B to nearest node by position:*

arg min

*i* min

*j* *{|h(B) − P*_{j}^{i}*|, |1 − h(B) − P*_{j}* ^{i}*|}

**• adding a node**

– create new positions for node

– reassign blocks from neighboring positions

**• remove node**
– reassign blocks

– remove positions, remove node

**adaptivity: adaptive hashing**

*• the points P*_{j}^{i}*of node i can be determined by hash-functions*

• for each insertion, a search for the nearest point has to be done

*• until now: homogeneous setting (C** _{i}* constant)

• heterogeneous settings:

– model different sizes by additional points

– reflect capacity by corresponding number of points using the “virtual blocks” approach

large number of points