• No results found

LSST Database Design Jacek Becla

N/A
N/A
Protected

Academic year: 2021

Share "LSST Database Design Jacek Becla"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

1

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013 Name  of  Mee)ng  •  Loca)on  •  Date    -­‐    Change  in  Slide  Master  

LSST  Database  Design

 

Jacek  Becla  

Database  and  Data  Access  Lead

 

 

October  21-­‐25,  2013  

FINAL DESIGN REVIEW

(2)

2

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Outline  

− 

Driving  requirements  

− 

Baseline  architecture  

−  Baseline  schema  highlights  

− 

Prototype  design  

− 

Tes)ng  results  

− 

Summary  

Docushare  LDM-­‐135   (LSST  Database  Design)     WBS:  02C.06.02.03   (Query  Services)  

(3)

3

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Driving  Requirements  –  Database  PerspecGve  

Level  1

 

− 

(Almost)  real  )me  

− 

Live  updates    

− 

…  with  reproducibility  

− 

…  and  user  queries  

C 

Moderate  data  volume  

C 

Moderate  access  paRerns  

C 

No  complex  queries  

Level  2

 

− 

Large  data  volume  

− 

Large  query  volume  

− 

Wide  range  of  query  types  

C 

Immutable  

C 

Reasonable  response    

)me  expecta)ons  

(4)

4

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

(5)

5

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Level  1  Requirements  –  Database  PerspecGve  

 

(Almost)  real  )me  

Live  updates    

…  with  reproducibility  

…  and  user  queries  

C

Moderate  data  volume  

C

Moderate  access  paRerns  

C

No  complex  queries  

 

"  

189  CCD-­‐size  queries  every  39  sec  

"  

~90K/visit  

"  

Well  understood  update  paRerns  

"  

Low  volume  queries  

"  

Core  table  up  to  44  B  rows,  75  TB  

"  

Hot  spots,  no  heavy  I/O  

"  

Small  area,  )me  series  for  

(6)

6

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Baseline  Database  Architecture  for  Level  1  

− 

Off-­‐the-­‐shelf  RDBMS  

•  Horizontally  par))oned  and  spa)ally  sorted  

− 

Live  database  for  produc)on  +  replica  for  user  query  access  

•  Real-­‐)me  master-­‐slave  replica)on  

− 

Reproducibility  

•  No-­‐overwrite  updates   •  Validity  )me  ranges  

− 

Fault  tolerance  

•  Hot  stand-­‐by  replica  

•  Plus,  the  user  replica  can  be  turned  into  live  database  

− 

Annual  refresh  –  DR  catalogs  brought  to  L1  

LDM-­‐135,  chapter  3.1              

(7)

7

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

(8)

8

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Level  2  Requirements  –  Database  PerspecGve  

Large  data  volume  

Large  query  volume  

Wide  range  of  query  types  

C

Immutable  

C

Reasonable  response    

)me  expecta)ons  

 

− 

Data  volume  

•  Correla)ons  on  mul)-­‐billion-­‐row  tables   •  Scans  through  petabytes  

•  Mul)-­‐billion  to  mul)-­‐trillion  table  joins  

− 

Query  volume  &  types  

•  Interac)ve  queries  

•  Concurrent  scans/aggrega)ons/joins   •  Spa)al  correla)ons  

•  Time  series  

•  Unpredictable,  ad-­‐hoc  analysis  

− 

Plus…  

•  Mul)-­‐decade  data  life)me   •  Low  cost  

1.  Massively  parallel,  distributed   2.  Indices  

3.  Shared  scans      

4.  Highly  specialized  indexing   5.  Efficient  joins  

6.  Robust  schema  and  catalog   7.  Commodity  H/W,  open  source     1 2 5 3 3 4 6 7

(9)

9

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Baseline  Database  Architecture  for  Level  2    

− 

MPP*  RDBMS  on  shared-­‐nothing  commodity  cluster,    

with  incremental  scaling,  non-­‐disrup)ve  failure  recovery  

− 

Data  clustered  spa)ally  and  by  )me,  par))oned  w/overlaps  

•  Two-­‐level  par))oning  

•  2nd  level  materialized  on-­‐the-­‐fly  

•  Transparent  to  end-­‐users  

− 

Selec)ve  indices  to  speed  up  interac)ve  queries,    

spa)al  searches,  joins  including  )me  series  analysis  

− 

Shared  scans  

•  Predictable  I/O  cost  and  response  )me  

− 

Custom  somware  based  on  open  source    

RDBMS  (MySQL)  +  XRootD  

*MPP  –  Massively  Parallel  Processing  

LDM-­‐135,  chapter  3.3              

(10)

10

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013 FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

(11)

11

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Baseline  Schema  Highlights  

− 

Object  

•  ~330  columns,  ~0.1  PB   •  Most  frequently  used   •  Advanced  analy)cs  

− 

Object_Extras  

•  ~7,650  columns,  ~1  PB   •  Specialized  analy)cs  

− 

Source  

•  ~50  columns,  up  to  ~5  PB  

•  Time  series  (high  SNR)  analysis  

− 

ForcedSource  

•  6  columns,  up  to  ~2  PB  

•  Time  series  (low  SNR)  analysis  

class CoreTables Object_Periodic DiaObject DiaObject_To_Object_Match DiaSource ForcedDiaSource ForcedSource

Object Object_APMean Object_NonPeriodic Object_Extra SSObject Source_APMean Source Name: Package: Version: Author: CoreTables CoreTables 1.0 Jacek Becla Hourly  scan   3  per  day   2  per  day   2  per  day  

(12)

12

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Prototype  ImplementaGon  -­‐  Qserv  

Intercep)ng  user  queries  

Worker  dispatch,  query   fragmenta)on  genera)on,  

spa)al  indexing,  query   recovery,  op)miza)ons,  

scheduling,  aggrega)on  

Communica)on,   replica)on  

Metadata,  result  cache  

MySQL  dispatch,  shared   scanning,  op)miza)ons,  

scheduling  

Single  node  RDBMS  

RDBMS-­‐agnos)c  

(13)

13

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Fault  Tolerance  /  Recoverability  

− 

Spare  nodes  -­‐  3%  of  cluster  

− 

20%  space  on  each  disk  

reserved  for  serving  chunks    

from  failed  node(s)  

− 

2  replicas  

− 

Chunks  appropriately  

distributed  

− 

Components  replicated  

− 

Failures  isolated  

− 

Narrow  interfaces  

− 

Every  table  checksumed  

− 

Logic  for  handling  errors  

− 

Logic  for  recovering  from  

errors  

− 

Most  implemented  and  

demonstrated  

LDM-­‐135,  chapter    8.13  AND  10.2              

(14)

14

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Tests  &  DemonstraGons  

− 

Tests  

− 

Demonstra)ons    

•  Concurrency  (up  to  100K  in-­‐flight  segment-­‐queries,  on  ~100  nodes)   •  Fault  tolerance  (catching  errors,  transparent  fail  over  to  a  replica)  

•  Shared  scanning    (30-­‐query  scan:  5m27s,  avg  speed  for  a  single  query:  3m)  

Scale   acGve  Inter-­‐ Table  scans   Large  joins   Notes  

The  PDR  test   (2011)  

150  nodes,  32TB,   2B  objects,    

55B  sources   4-­‐9  sec   3-­‐8  min   10  min  –  5  h  

Problems  with     >4  concurrent  queries   (<20K  segment-­‐queries)   JHU   (2012)   20  nodes,  100TB,   2B  objects,  

80B  sources   ~5  sec   <  7  min  

Numerous  problems   with  unstable  hardware   IN2P3  

(2013)  

300  nodes,  10TB,   0.4B  objects,    

14B  sources   1.2-­‐4  sec   10  sec  –  10  min   ~  5  min  

Showed  good  scaling  and     low  dispatch  overhead,   proved  concurrency  

(15)

15

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Qserv’s  Development  (R&D)  

− 

Pre-­‐construc)on  

•  Improve  automated  tes)ng  suite   •  Develop  unit  tests  

•  Refactor  and  op)mize  low-­‐level  design  details   •  Revisit  build  system  and  packaging  

•  Rewrite  XRootD  client   •  Logging  

FY’13   FY’09   FY’10   FY’11   FY’12  

Scale/speed  tes)ng:  

FY’14  

All  major  risks   reGred  

Design  and  development:   Core  func)ons   Scalability  /  performance  

Usability  /  

stability   Code  refactoring   Shared  

(16)

16

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Qserv’s  Development  (ConstrucGon)  

FY’19   FY’15   FY’16   FY’17   FY’18  

Scale/speed  tes)ng:  

FY’20  

Scalability  /  fault  tolerance   Resource  mgmt   Shared  scans   Usability  /  stability   Level  3   Administra)on   Par)al  results   Query  syntax   Performance  

(17)

17

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

(18)

18

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Level  3    

− 

“MyDB”  –  per-­‐user  database  space  

•  Storage  near  L2  (designated  drives  on  db  nodes)  

− 

Op)ons  for  storage  

•  Updatable,  centralized  

•  Immutable  (post-­‐crea)on),  distributed  

− 

Next-­‐to-­‐database  analy)cs  

•  Load  user  code  into  external  daemons   •  Issue  special  SELECT  query  in  Qserv  

•  Worker  streams  rows  to  external  daemons   •  User  code  processes  rows  arbitrarily  

(19)

19

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

(20)

20

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Summary  of  Post-­‐PDR  AcGviGes  

− 

Level  1  

•  Refined  baseline  design  based  on  more  detailed  requirements  

− 

Level  2  

•  Major  redesign  of  query  parser,  analyzer,  and  dispatch   •  Resolved  concurrency  problems  

•  Implemented/demonstrated  basic  shared  scans,  fault  tolerance,     cluster  consistency,  installa)on  and  cluster  mgmt  tools,  automated     test  suite,  RDBMS-­‐independence,  improved  query  coverage  and   robustness.  Numerous  op)miza)ons  

•  Scalability  tests  (100TB  @JHU,  300-­‐node  @IN2P3)  

(21)

21

FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013

Summary  

− 

Baseline  architecture:  MPP  RDBMS  on  shared-­‐nothing  cluster  

•  With  custom  par))oning  and  indices,  shared  scans   •  Architecture  driven  by  volume,  access  paRerns,    

query  complexity,  data  life)me  and  low-­‐cost  

− 

Have  baseline  schema  (see  other  talk)  

− 

Have  working,  scalable  Qserv  prototype  

•  Based  on  simple  open  source  RDBMS  and  XRootD     •  Will  be  turned  into  a  produc)on  system  

We  are  confident  we  will  deliver    the  LSST  database  &  

query  access  system  mee@ng  the  LSST  requirements    

References

Related documents