1
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013 Name of Mee)ng • Loca)on • Date -‐ Change in Slide Master
LSST Database Design
Jacek Becla
Database and Data Access Lead
October 21-‐25, 2013
FINAL DESIGN REVIEW
2
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Outline
−
Driving requirements
−
Baseline architecture
− Baseline schema highlights
−
Prototype design
−
Tes)ng results
−
Summary
Docushare LDM-‐135 (LSST Database Design) WBS: 02C.06.02.03 (Query Services)3
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Driving Requirements – Database PerspecGve
Level 1
−
(Almost) real )me
−
Live updates
−
… with reproducibility
−
… and user queries
C
Moderate data volume
C
Moderate access paRerns
C
No complex queries
Level 2
−
Large data volume
−
Large query volume
−
Wide range of query types
C
Immutable
C
Reasonable response
)me expecta)ons
4
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
5
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Level 1 Requirements – Database PerspecGve
−
(Almost) real )me
−
Live updates
−
… with reproducibility
−
… and user queries
C
Moderate data volume
C
Moderate access paRerns
C
No complex queries
"
189 CCD-‐size queries every 39 sec
"
~90K/visit
"
Well understood update paRerns
"
Low volume queries
"
Core table up to 44 B rows, 75 TB
"
Hot spots, no heavy I/O
"
Small area, )me series for
6
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Baseline Database Architecture for Level 1
−
Off-‐the-‐shelf RDBMS
• Horizontally par))oned and spa)ally sorted
−
Live database for produc)on + replica for user query access
• Real-‐)me master-‐slave replica)on
−
Reproducibility
• No-‐overwrite updates • Validity )me ranges
−
Fault tolerance
• Hot stand-‐by replica
• Plus, the user replica can be turned into live database
−
Annual refresh – DR catalogs brought to L1
LDM-‐135, chapter 3.1
7
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
8
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Level 2 Requirements – Database PerspecGve
−
Large data volume
−
Large query volume
−
Wide range of query types
C
Immutable
C
Reasonable response
)me expecta)ons
−
Data volume
• Correla)ons on mul)-‐billion-‐row tables • Scans through petabytes
• Mul)-‐billion to mul)-‐trillion table joins
−
Query volume & types
• Interac)ve queries
• Concurrent scans/aggrega)ons/joins • Spa)al correla)ons
• Time series
• Unpredictable, ad-‐hoc analysis
−
Plus…
• Mul)-‐decade data life)me • Low cost
1. Massively parallel, distributed 2. Indices
3. Shared scans
4. Highly specialized indexing 5. Efficient joins
6. Robust schema and catalog 7. Commodity H/W, open source 1 2 5 3 3 4 6 7
9
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Baseline Database Architecture for Level 2
−
MPP* RDBMS on shared-‐nothing commodity cluster,
with incremental scaling, non-‐disrup)ve failure recovery
−
Data clustered spa)ally and by )me, par))oned w/overlaps
• Two-‐level par))oning
• 2nd level materialized on-‐the-‐fly
• Transparent to end-‐users
−
Selec)ve indices to speed up interac)ve queries,
spa)al searches, joins including )me series analysis
−
Shared scans
• Predictable I/O cost and response )me
−
Custom somware based on open source
RDBMS (MySQL) + XRootD
*MPP – Massively Parallel Processing
LDM-‐135, chapter 3.3
10
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013 FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
11
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Baseline Schema Highlights
−
Object
• ~330 columns, ~0.1 PB • Most frequently used • Advanced analy)cs
−
Object_Extras
• ~7,650 columns, ~1 PB • Specialized analy)cs
−
Source
• ~50 columns, up to ~5 PB
• Time series (high SNR) analysis
−
ForcedSource
• 6 columns, up to ~2 PB
• Time series (low SNR) analysis
class CoreTables Object_Periodic DiaObject DiaObject_To_Object_Match DiaSource ForcedDiaSource ForcedSource
Object Object_APMean Object_NonPeriodic Object_Extra SSObject Source_APMean Source Name: Package: Version: Author: CoreTables CoreTables 1.0 Jacek Becla Hourly scan 3 per day 2 per day 2 per day
12
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Prototype ImplementaGon -‐ Qserv
Intercep)ng user queries
Worker dispatch, query fragmenta)on genera)on,
spa)al indexing, query recovery, op)miza)ons,
scheduling, aggrega)on
Communica)on, replica)on
Metadata, result cache
MySQL dispatch, shared scanning, op)miza)ons,
scheduling
Single node RDBMS
RDBMS-‐agnos)c
13
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Fault Tolerance / Recoverability
−
Spare nodes -‐ 3% of cluster
−
20% space on each disk
reserved for serving chunks
from failed node(s)
−
2 replicas
−
Chunks appropriately
distributed
−
Components replicated
−
Failures isolated
−
Narrow interfaces
−
Every table checksumed
−
Logic for handling errors
−
Logic for recovering from
errors
−
Most implemented and
demonstrated
LDM-‐135, chapter 8.13 AND 10.214
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Tests & DemonstraGons
−
Tests
−
Demonstra)ons
• Concurrency (up to 100K in-‐flight segment-‐queries, on ~100 nodes) • Fault tolerance (catching errors, transparent fail over to a replica)
• Shared scanning (30-‐query scan: 5m27s, avg speed for a single query: 3m)
Scale acGve Inter-‐ Table scans Large joins Notes
The PDR test (2011)
150 nodes, 32TB, 2B objects,
55B sources 4-‐9 sec 3-‐8 min 10 min – 5 h
Problems with >4 concurrent queries (<20K segment-‐queries) JHU (2012) 20 nodes, 100TB, 2B objects,
80B sources ~5 sec < 7 min
Numerous problems with unstable hardware IN2P3
(2013)
300 nodes, 10TB, 0.4B objects,
14B sources 1.2-‐4 sec 10 sec – 10 min ~ 5 min
Showed good scaling and low dispatch overhead, proved concurrency
15
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Qserv’s Development (R&D)
−
Pre-‐construc)on
• Improve automated tes)ng suite • Develop unit tests
• Refactor and op)mize low-‐level design details • Revisit build system and packaging
• Rewrite XRootD client • Logging
FY’13 FY’09 FY’10 FY’11 FY’12
Scale/speed tes)ng:
FY’14
All major risks reGred
Design and development: Core func)ons Scalability / performance
Usability /
stability Code refactoring Shared
16
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Qserv’s Development (ConstrucGon)
FY’19 FY’15 FY’16 FY’17 FY’18
Scale/speed tes)ng:
FY’20
Scalability / fault tolerance Resource mgmt Shared scans Usability / stability Level 3 Administra)on Par)al results Query syntax Performance
17
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
18
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Level 3
−
“MyDB” – per-‐user database space
• Storage near L2 (designated drives on db nodes)
−
Op)ons for storage
• Updatable, centralized
• Immutable (post-‐crea)on), distributed
−
Next-‐to-‐database analy)cs
• Load user code into external daemons • Issue special SELECT query in Qserv
• Worker streams rows to external daemons • User code processes rows arbitrarily
19
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
20
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Summary of Post-‐PDR AcGviGes
−
Level 1
• Refined baseline design based on more detailed requirements
−
Level 2
• Major redesign of query parser, analyzer, and dispatch • Resolved concurrency problems
• Implemented/demonstrated basic shared scans, fault tolerance, cluster consistency, installa)on and cluster mgmt tools, automated test suite, RDBMS-‐independence, improved query coverage and robustness. Numerous op)miza)ons
• Scalability tests (100TB @JHU, 300-‐node @IN2P3)
21
FINAL DESIGN REVIEW | TUCSON, AZ | OCTOBER 21-25, 2013
Summary
−
Baseline architecture: MPP RDBMS on shared-‐nothing cluster
• With custom par))oning and indices, shared scans • Architecture driven by volume, access paRerns,
query complexity, data life)me and low-‐cost
−
Have baseline schema (see other talk)
−
Have working, scalable Qserv prototype
• Based on simple open source RDBMS and XRootD • Will be turned into a produc)on system