CS 91: Cloud Systems & Datacenter Networks Failures & Replica=on

(1)

CS 91: Cloud Systems &

Datacenter Networks

(2)

Types of Failures

•  “fail stop”: process/machine dies and doesn’t

come back. Rela=vely easy to detect. (oIen planned)

•  performance degrada=on: something has failed

that makes it slow, but it’s s=ll correct (straggler). Harder to detect.

•  “Byzan=ne”: process has failed but is s=ll

running. Might be incorrect (spewing garbage) or even malicious. Can’t be trusted. VERY hard to detect.

(3)

Failure Impact: Availability

•  _{The degree to which your system is opera=ng} •  Op=ons aren’t necessarily discrete:

– Fully opera=onal: all is good, full capacity – Down: all is broken, no service for anybody

– In-‐between: service is degraded, but accessible

•  Example: 100 servers needed to handle load,

(4)

Failure Impact: Availability

•  _{The degree to which your system is opera=ng}

(5)

Failure Impact: Availability

•  _{The degree to which your system is opera=ng}

(6)

Failure Sources

•  _{Hardware (oIen disks, why?)}

•  _{h]ps://www.youtube.com/watch?}

(7)

Failure Sources

•  _{Hardware (oIen disks, why?)} •  _{SoIware bugs}

•  _{Conﬁgura=on / human mistakes} •  _{Network (Internet) connec=vity} •  _{Planned maintenance}

(8)

Cloud / Datacenter Scale

•  _{So, how reliable must our hardware and}

soIware be to become “reliable enough”?

•  _{If it’s not 100%, then it doesn’t really ma]er…} •  _{Even if the failure rate of any one thing is}

really low, there are SO MANY things in a datacenter, something will fail soon.

(9)

Fault-‐tolerant SoIware

•  _{With so many failure sources, it’s cri=cal that}

soIware be made reliable.

•  _Pros:

– can handle unexpected failures

– can handle planned maintenance, (de)commissions

(10)

Fault-‐tolerant SoIware

•  _{Common solu=on:}

(What’s the most important principle in systems design?)

(11)

Fault-‐tolerant SoIware

•  _{Common solu=on: Abstrac=on!}

– Hide complex details whenever possible

•  _{Typically build a layer of soIware}

infrastructure that can handle common failures

•  _{Build applica=on logic on top of that}

(12)

Seen Before: ISIS (+ others)

ISIS SoIware Reliability Layer

The reali=es of networks and

distributed systems… Important system that doesn’t want to worry

(13)

Will See Again: Harp (+ others)

Harp Reliability Layer

(14)

In General

Layer that does something to handle some failures.

(15)

Failure: what can we do?

•  _{Suppose you’re soon to take an exam, but}

you’re worried about your pencil breaking (let’s say it’s a 20% chance)

•  _{Easy solu=on: bring mul=ple (equal) pencils} •  _{Redundancy: chances that they all break is}

(16)

Replica=on

•  _{If something important might fail, keep some}

backups / spares around ready to stand in

•  _{For data, this implies it must be copied to}

mul=ple loca=ons (replicated)

(17)

Tough Ques=ons

•  _{What type(s) of failures must we survive?} •  _{How many failures must we survive?}

•  _{How do we ﬁnd a replica if failure happens?} •  _{What sort of consistency seman=cs must be}

maintained between replicas?

•  _{If failures make the situa=on bad, what are we}

(18)

Brewer’s CAP Theorem

•  _Consistency •  Availability •  _{Par==on tolerance} •  _{Pick two*.} * h]p://www.infoq.com/ar=cles/cap-‐twelve-‐years-‐later-‐how-‐the-‐rules-‐have-‐changed

(19)

Brewer’s CAP Theorem

Consistency, Availability, Par==on tolerance

(20)

Can’t operate, even if online. (As if these two stop-‐failed.)

Brewer’s CAP Theorem

Any machine that con=nues opera=ng must be in the majority par==on (also applies to stop failures).

(21)

Can’t operate, even if online. (As if these two stop-‐failed.)

Brewer’s CAP Theorem

Generally, to deal with fail-‐stop failures, need 2N + 1 machines to survive N failures because N+1 cons=tutes a majority.

(22)

Brewer’s CAP Theorem

•  This case is less well-‐deﬁned. Can’t really

build your system such that par==ons are impossible.

•  If you think you’re doing this, you probably

s=ll have to give up one or the other if a par==on does occur.

(23)

Brewer’s CAP Theorem

X = 1, Y = 2 X = 1, Y = 2

(24)

Brewer’s CAP Theorem

X = 1, Y = 2 X = 1, Y = 2 Y = 9 X = 5

Changes might be no problem.

(25)

Brewer’s CAP Theorem

X = 1, Y = 2 X = 1, Y = 2 X = 7

X = 5

Changes might conﬂict with one another.

(26)

System Classiﬁca=on

•  _{Reality: systems fall somewhere in a spectrum}

– Some systems even let you tune to your taste

•  _ACID(Atomicity, Consistency, Isola=on, Durability) – Strongly consistent and conserva=ve

•  _BASE(Basically Available, SoI state, Eventually consistent)

(27)

ACID (Favors C in CAP)

•  From database world: data protec=on is key

•  Based on idea of “transac=on”

–  Sequence of commands that are related

•  Atomicity: transac=on is all or nothing

•  Consistency: transac=on sequence is ordered

•  Isola=on: transac=ons behave as if serial

•  Durability: if transac=on commits, it’s safe on disk

(28)

BASE (Favors A in CAP)

•  Relaxes the constraints of ACID

•  Basically Available: don’t panic on failures

•  SoI state: keep performance hints (eh…)

•  Eventual consistency: data will eventually

converge to all replicas if given =me

(29)

Comparison

ACID

•  Easier to reason about

•  _Safer

•  Less scalable

•  _{Lower performance} •  Failures might render

system unusable (fewer 9’s)

BASE

•  Scales well, more performance

•  _{Usually available (more 9’s)} •  Consistency unclear to users

•  _{Reconciling diverged state is a}

(30)

Paper Preview

•  _{“Characterizing Cloud Compu=ng Hardware}

Reliability”

•  _{“Replica=on in the Harp File System”} •  _{“The Google File System”}

CS 91: Cloud Systems & Datacenter Networks Failures & Replica=on