CS 91: Cloud Systems &
Datacenter Networks
Types of Failures
• “fail stop”: process/machine dies and doesn’t
come back. Rela=vely easy to detect. (oIen planned)
• performance degrada=on: something has failed
that makes it slow, but it’s s=ll correct (straggler). Harder to detect.
• “Byzan=ne”: process has failed but is s=ll
running. Might be incorrect (spewing garbage) or even malicious. Can’t be trusted. VERY hard to detect.
Failure Impact: Availability
• The degree to which your system is opera=ng • Op=ons aren’t necessarily discrete:
– Fully opera=onal: all is good, full capacity – Down: all is broken, no service for anybody
– In-‐between: service is degraded, but accessible
• Example: 100 servers needed to handle load,
Failure Impact: Availability
• The degree to which your system is opera=ng
Failure Impact: Availability
• The degree to which your system is opera=ng
Failure Sources
• Hardware (oIen disks, why?)
• h]ps://www.youtube.com/watch?
Failure Sources
• Hardware (oIen disks, why?) • SoIware bugs
• Configura=on / human mistakes • Network (Internet) connec=vity • Planned maintenance
Cloud / Datacenter Scale
• So, how reliable must our hardware and
soIware be to become “reliable enough”?
• If it’s not 100%, then it doesn’t really ma]er… • Even if the failure rate of any one thing is
really low, there are SO MANY things in a datacenter, something will fail soon.
Fault-‐tolerant SoIware
• With so many failure sources, it’s cri=cal that
soIware be made reliable.
• Pros:
– can handle unexpected failures
– can handle planned maintenance, (de)commissions
Fault-‐tolerant SoIware
• Common solu=on:
(What’s the most important principle in systems design?)
Fault-‐tolerant SoIware
• Common solu=on: Abstrac=on!
– Hide complex details whenever possible
• Typically build a layer of soIware
infrastructure that can handle common failures
• Build applica=on logic on top of that
Seen Before: ISIS (+ others)
ISIS SoIware Reliability Layer
The reali=es of networks and
distributed systems… Important system that doesn’t want to worry
Will See Again: Harp (+ others)
Harp Reliability Layer
The reali=es of networks and
distributed systems… Important system that doesn’t want to worry
In General
Layer that does something to handle some failures.
The reali=es of networks and
distributed systems… Important system that doesn’t want to worry
Failure: what can we do?
• Suppose you’re soon to take an exam, but
you’re worried about your pencil breaking (let’s say it’s a 20% chance)
• Easy solu=on: bring mul=ple (equal) pencils • Redundancy: chances that they all break is
Replica=on
• If something important might fail, keep some
backups / spares around ready to stand in
• For data, this implies it must be copied to
mul=ple loca=ons (replicated)
Tough Ques=ons
• What type(s) of failures must we survive? • How many failures must we survive?
• How do we find a replica if failure happens? • What sort of consistency seman=cs must be
maintained between replicas?
• If failures make the situa=on bad, what are we
Brewer’s CAP Theorem
• Consistency • Availability • Par==on tolerance • Pick two*. * h]p://www.infoq.com/ar=cles/cap-‐twelve-‐years-‐later-‐how-‐the-‐rules-‐have-‐changedBrewer’s CAP Theorem
Consistency, Availability, Par==on tolerance
Can’t operate, even if online. (As if these two stop-‐failed.)
Brewer’s CAP Theorem
Consistency, Availability, Par==on tolerance
Any machine that con=nues opera=ng must be in the majority par==on (also applies to stop failures).
Can’t operate, even if online. (As if these two stop-‐failed.)
Brewer’s CAP Theorem
Consistency, Availability, Par==on tolerance
Generally, to deal with fail-‐stop failures, need 2N + 1 machines to survive N failures because N+1 cons=tutes a majority.
Brewer’s CAP Theorem
Consistency, Availability, Par==on tolerance
• This case is less well-‐defined. Can’t really
build your system such that par==ons are impossible.
• If you think you’re doing this, you probably
s=ll have to give up one or the other if a par==on does occur.
Brewer’s CAP Theorem
Consistency, Availability, Par==on tolerance
X = 1, Y = 2 X = 1, Y = 2
Brewer’s CAP Theorem
Consistency, Availability, Par==on tolerance
X = 1, Y = 2 X = 1, Y = 2 Y = 9 X = 5
Changes might be no problem.
Brewer’s CAP Theorem
Consistency, Availability, Par==on tolerance
X = 1, Y = 2 X = 1, Y = 2 X = 7
X = 5
Changes might conflict with one another.
System Classifica=on
• Reality: systems fall somewhere in a spectrum
– Some systems even let you tune to your taste
• ACID (Atomicity, Consistency, Isola=on, Durability) – Strongly consistent and conserva=ve
• BASE (Basically Available, SoI state, Eventually consistent)
ACID (Favors C in CAP)
• From database world: data protec=on is key
• Based on idea of “transac=on”
– Sequence of commands that are related
• Atomicity: transac=on is all or nothing
• Consistency: transac=on sequence is ordered
• Isola=on: transac=ons behave as if serial
• Durability: if transac=on commits, it’s safe on disk
BASE (Favors A in CAP)
• Relaxes the constraints of ACID
• Basically Available: don’t panic on failures
• SoI state: keep performance hints (eh…)
• Eventual consistency: data will eventually
converge to all replicas if given =me
Comparison
ACID
• Easier to reason about
• Safer
• Less scalable
• Lower performance • Failures might render
system unusable (fewer 9’s)
BASE
• Scales well, more performance
• Usually available (more 9’s) • Consistency unclear to users
• Reconciling diverged state is a
Paper Preview
• “Characterizing Cloud Compu=ng Hardware
Reliability”
• “Replica=on in the Harp File System” • “The Google File System”