• No results found

CS 91: Cloud Systems & Datacenter Networks Failures & Replica=on

N/A
N/A
Protected

Academic year: 2021

Share "CS 91: Cloud Systems & Datacenter Networks Failures & Replica=on"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

CS  91:  Cloud  Systems  &  

Datacenter  Networks  

 

(2)

Types  of  Failures  

•  “fail  stop”:  process/machine  dies  and  doesn’t  

come  back.    Rela=vely  easy  to  detect.  (oIen   planned)  

•  performance  degrada=on:  something  has  failed  

that  makes  it  slow,  but  it’s  s=ll  correct  (straggler).   Harder  to  detect.  

•  “Byzan=ne”:  process  has  failed  but  is  s=ll  

running.    Might  be  incorrect  (spewing  garbage)  or   even  malicious.    Can’t  be  trusted.    VERY  hard  to   detect.  

(3)

Failure  Impact:  Availability  

•  The  degree  to  which  your  system  is  opera=ng   •  Op=ons  aren’t  necessarily  discrete:  

– Fully  opera=onal:  all  is  good,  full  capacity   – Down:  all  is  broken,  no  service  for  anybody  

– In-­‐between:  service  is  degraded,  but  accessible  

•  Example:  100  servers  needed  to  handle  load,  

(4)

Failure  Impact:  Availability  

•  The  degree  to  which  your  system  is  opera=ng  

(5)

Failure  Impact:  Availability  

•  The  degree  to  which  your  system  is  opera=ng  

(6)

Failure  Sources  

•  Hardware  (oIen  disks,  why?)  

•  h]ps://www.youtube.com/watch?

(7)

Failure  Sources  

•  Hardware  (oIen  disks,  why?)   •  SoIware  bugs  

•  Configura=on  /  human  mistakes   •  Network  (Internet)  connec=vity   •  Planned  maintenance  

(8)

Cloud  /  Datacenter  Scale  

•  So,  how  reliable  must  our  hardware  and  

soIware  be  to  become  “reliable  enough”?  

•  If  it’s  not  100%,  then  it  doesn’t  really  ma]er…   •  Even  if  the  failure  rate  of  any  one  thing  is  

really  low,  there  are  SO  MANY  things  in  a   datacenter,  something  will  fail  soon.  

(9)

Fault-­‐tolerant  SoIware  

•  With  so  many  failure  sources,  it’s  cri=cal  that  

soIware  be  made  reliable.  

•  Pros:  

– can  handle  unexpected  failures  

– can  handle  planned  maintenance,  (de)commissions  

(10)

Fault-­‐tolerant  SoIware  

•  Common  solu=on:  

(What’s  the  most  important  principle  in  systems   design?)  

(11)

Fault-­‐tolerant  SoIware  

•  Common  solu=on:  Abstrac=on!  

– Hide  complex  details  whenever  possible  

•  Typically  build  a  layer  of  soIware  

infrastructure  that  can  handle  common  failures  

•  Build  applica=on  logic  on  top  of  that  

(12)

Seen  Before:  ISIS  (+  others)  

ISIS  SoIware  Reliability  Layer  

The  reali=es  of   networks  and  

distributed  systems…   Important  system  that   doesn’t  want  to  worry  

(13)

Will  See  Again:  Harp  (+  others)  

Harp  Reliability  Layer  

The  reali=es  of   networks  and  

distributed  systems…   Important  system  that   doesn’t  want  to  worry  

(14)

In  General  

Layer  that  does  something  to  handle  some  failures.  

The  reali=es  of   networks  and  

distributed  systems…   Important  system  that   doesn’t  want  to  worry  

(15)

Failure:  what  can  we  do?  

•  Suppose  you’re  soon  to  take  an  exam,  but  

you’re  worried  about  your  pencil  breaking   (let’s  say  it’s  a  20%  chance)  

•  Easy  solu=on:  bring  mul=ple  (equal)  pencils   •  Redundancy:  chances  that  they  all  break  is  

(16)

Replica=on  

•  If  something  important  might  fail,  keep  some  

backups  /  spares  around  ready  to  stand  in  

•  For  data,  this  implies  it  must  be  copied  to  

mul=ple  loca=ons  (replicated)    

(17)

Tough  Ques=ons  

•  What  type(s)  of  failures  must  we  survive?   •  How  many  failures  must  we  survive?  

•  How  do  we  find  a  replica  if  failure  happens?   •  What  sort  of  consistency  seman=cs  must  be  

maintained  between  replicas?  

•  If  failures  make  the  situa=on  bad,  what  are  we  

(18)

Brewer’s  CAP  Theorem  

•  Consistency   •  Availability   •  Par==on  tolerance   •  Pick  two*.   *  h]p://www.infoq.com/ar=cles/cap-­‐twelve-­‐years-­‐later-­‐how-­‐the-­‐rules-­‐have-­‐changed    

(19)

Brewer’s  CAP  Theorem  

Consistency,  Availability,  Par==on  tolerance  

(20)

Can’t  operate,  even  if  online.   (As  if  these  two  stop-­‐failed.)  

Brewer’s  CAP  Theorem  

Consistency,  Availability,  Par==on  tolerance  

Any  machine  that  con=nues  opera=ng  must  be  in  the   majority  par==on  (also  applies  to  stop  failures).  

(21)

Can’t  operate,  even  if  online.   (As  if  these  two  stop-­‐failed.)  

Brewer’s  CAP  Theorem  

Consistency,  Availability,  Par==on  tolerance  

Generally,  to  deal  with  fail-­‐stop  failures,  need  2N  +  1  machines   to  survive  N  failures  because  N+1  cons=tutes  a  majority.  

(22)

Brewer’s  CAP  Theorem  

Consistency,  Availability,  Par==on  tolerance    

•  This  case  is  less  well-­‐defined.    Can’t  really  

build  your  system  such  that  par==ons  are   impossible.  

•  If  you  think  you’re  doing  this,  you  probably  

s=ll  have  to  give  up  one  or  the  other  if  a   par==on  does  occur.  

(23)

Brewer’s  CAP  Theorem  

Consistency,  Availability,  Par==on  tolerance  

X  =  1,  Y  =  2   X  =  1,  Y  =  2  

(24)

Brewer’s  CAP  Theorem  

Consistency,  Availability,  Par==on  tolerance  

X  =  1,  Y  =  2   X  =  1,  Y  =  2                        Y  =  9   X  =  5  

Changes  might  be  no  problem.  

(25)

Brewer’s  CAP  Theorem  

Consistency,  Availability,  Par==on  tolerance  

X  =  1,  Y  =  2   X  =  1,  Y  =  2   X  =  7  

X  =  5  

Changes  might  conflict  with  one  another.  

(26)

System  Classifica=on  

•  Reality:  systems  fall  somewhere  in  a  spectrum  

– Some  systems  even  let  you  tune  to  your  taste  

•  ACID  (Atomicity,  Consistency,  Isola=on,  Durability)   – Strongly  consistent  and  conserva=ve  

•  BASE  (Basically  Available,  SoI  state,  Eventually  consistent)  

(27)

ACID  (Favors  C  in  CAP)  

•  From  database  world:  data  protec=on  is  key  

•  Based  on  idea  of  “transac=on”  

–  Sequence  of  commands  that  are  related  

•  Atomicity:  transac=on  is  all  or  nothing  

•  Consistency:  transac=on  sequence  is  ordered  

•  Isola=on:  transac=ons  behave  as  if  serial  

•  Durability:  if  transac=on  commits,  it’s  safe  on  disk  

(28)

BASE  (Favors  A  in  CAP)  

•  Relaxes  the  constraints  of  ACID  

•  Basically  Available:  don’t  panic  on  failures  

•  SoI  state:  keep  performance  hints  (eh…)  

•  Eventual  consistency:  data  will  eventually  

converge  to  all  replicas  if  given  =me    

(29)

Comparison  

ACID  

•  Easier  to  reason  about  

•  Safer  

•  Less  scalable  

•  Lower  performance   •  Failures  might  render  

system  unusable  (fewer  9’s)  

BASE  

•  Scales  well,  more  performance  

•  Usually  available  (more  9’s)   •  Consistency  unclear  to  users  

•  Reconciling  diverged  state  is  a  

(30)

Paper  Preview  

•  “Characterizing  Cloud  Compu=ng  Hardware  

Reliability”  

•  “Replica=on  in  the  Harp  File  System”   •  “The  Google  File  System”  

References

Related documents

The Ministry for Environmental Protection, Physical Planning and Construction (MoEPPC) is responsible for the Air and Waste sector, as well as the Environmental Protection and

[r]

The Vermont Board of Professional Engineering, after receiving testimony and researching the topic, does not support amending Chapter 20 of Title 26 of the Vermont

Ulusoy (2006) argues that doing it is critical because students may have different reading level either below or high level. In addition, it is obvious that

Furthermore, this complex communication between pseu- dogenes, miRNAs, and mRNA transcripts affects the valida- tion of putative miRNA targets, given that certain cells may have

attempt 2: make one replica the primary replica, and have a coordinator in place to help manage failures. if primary fails, C switches

The response of diploid and triploid Atlantic salmon siblings to vaccination with a commercial furunculosis vaccine was compared in the present study, as well as injected with

Results for the price variables suggest that households in the `traditional' and `traditional^high roots' clusters face lower relative prices for ¢sh and roots and higher