CERN - IT Department CH-1211 Genève 23 Switzerland
The LCG Distributed
Database Infrastructure
Dirk Düllmann, CERN & LCG 3D DESY Computing Seminar
The LCG Distributed Database Infrastructure -
Outline of the Talk
•
Why databases and why distributed?
•
LCG Distributed Database Project (LCG 3D)
•
Building blocks for a scalable database
services
– database clusters
•
Distribution techniques
– streams replication and distributed caching
•
Database and replication monitoring
optimisation
•
Experiment models and schedules
Offline Reconstruction Conditions DB ONline subset CERN Computer Center
Online Master Data Storage
Conditions
Formatting
Production Validation Development
Offline Reconstruction Conditions DB OFfline subset Master copy Configurations Pixel Strips ECAL HCAL RPC DT ES CSC Trigger DAQ DCS Conditions Pixel Strips ECAL HCAL RPC DT ES CSC Trigger DAQ DCS LHC data Logbook DDD EMDB Calibration
Create rec conDB data set
Tier 0
The LCG Distributed Database Infrastructure -
4
Distributed Deployment of Databases (=3D)
• LCG provided initially an infrastructure for distributed access to file based data and file replication
• Physics applications (and grid services) depend crucially on relational databases for meta data and require similar services for databases
– Physics applications and grid services use RDBMS
• eg configuration, conditions, calibration, event tags, file and collection catalogues, production/transfer workflow
– LCG sites have already experience in providing RDBMS services
• Goals for common database project as part of LCG
– increase the availability and scalability of LCG and experiment database components
– allow applications to access databases in a consistent, location independent way
– connect database services via data replication mechanisms
– shared deployment and administration of this infrastructure during 24 x 7 operation
• Scope set by LCG PEB: Databases Online, Offline and at LCG Tier sites
LCG 3D Service Architecture
O O O O S F S S M S M S T1 – DB backbone– all data replicated – reliable service
T2 – local DB cache
– subset data
– only local service
Online DB – autonomous reliable service database (DB) cluster T0 – autonomous reliable service
Read-only access at Tier-1/2 (at least initially)
Oracle Streams http cache (Squid) Cross DB copy and MySQL/SQLight files O M Squid FroNTier F MySQL/SQLight DB file
The LCG Distributed Database Infrastructure -
Database Services for Physics
at CERN in 2002
The LCG Distributed Database Infrastructure -
Database Services for Physics
at CERN in 2002
• 24x7 service based on
Oracle 9i / Solaris
• Most experiment
databases hosted on
– 2-node (4 CPU) cluster on
Solaris with 6GB of RAM
– 18 disks, 32GB each
Database Services for Physics
at CERN in 2002
• 24x7 service based on
Oracle 9i / Solaris
• Most experiment
databases hosted on
– 2-node (4 CPU) cluster on
Solaris with 6GB of RAM
– 18 disks, 32GB each
– Veritas Volume Manager
• But also increasing many Linux
disk servers for LHC and non-LHC data (COMPASS, HARP)
– Locally attached disk
– Spread all over the CERN
computing centre
– Several different versions of Linux and Oracle
The LCG Distributed Database Infrastructure -
How to build a highly available and
scalable DB service?
The LCG Distributed Database Infrastructure -
How to build a highly available and
scalable DB service?
How to build a highly available and
scalable DB service?
• Scaling Up
• Scaling out -> clustering
The LCG Distributed Database Infrastructure -
How to build a highly available and
scalable DB service?
• Scaling Up
• Scaling out -> clustering
Chosen Architecture:
Database Cluster on Linux
• Oracle Real Application Cluster (RAC) on
commodity hardware
– Redundancy at all levels (CPU, storage, networking)
– Oracle 10gR2 and RedHat ES 4
– Oracle ASM as volume manager
CERN LAN RAC1 Infortrend storage lxs5033 lxs5037 lxs5038 lxs5030 GB ethernet switch SANbox 5200 – A
ASGC 9
Hardware configuration
Four servers
CPU : Intel Pentium-D 830 3.0 GHz Memory 2G (ECC)
Local Disk S-ATA2 80G 7200 rpm Fiber Channel LSI 7102XP-LC, PCI
X 1
SAN Switch : Silkworm 3850 16
ports
Backend Raid subsystem:
StorageTek B280
Each RAC group shares 1.7TB
exported from SAN
RAC group for 3D RAC group for other LCG services
Dual channel &
redundant controller
The LCG Distributed Database Infrastructure -
The LCG Distributed Database Infrastructure -
Beginning of 2007
• ~220 Database CPUs
• ~440 GB of RAM (shared DB cache)
• ~1100 disks
• Some 15 DB clusters deployed
– One production cluster per LHC experiment for offline
applications
– ATLAS Online, COMPASS cluster
– Number of nodes varying from 4 to 8
• Several validation and test clusters
– 1 or 2 per experiment (typically 2 node clusters)
Application Validation and
Optimisation
• Database clusters can amplify effects of poor application
design - need scalability test during application release cycle
Development DB service Validation DB service Production DB service
• Significant fraction of DB administrator work
• needs close collaboration with application developers
• Focus on application with large resource consumption:
– File transfer systems (FTS, PhEDEx)
– Grid catalogs (LFC)
– Experiment Dashboards
– Condition data (COOL)
The LCG Distributed Database Infrastructure -
Evolving Database Hardware
• Need to continuously replace database h/w with
next generation CPU and storage
– still maintaining as few s/w configurations as
possible: only a single Linux and Oracle version
• CPU side - continued performance increase via multi-core
– Recently tested dual quad-core CPU & 16 GB of RAM
• performance similar to 5 node RAC built with the hardware currently used
– Multi core works well for database servers
• but implies a move to 64-bit Linux & Oracle
– May run into memory bandwidth limitations with many cores
• Storage side - slower performance increase
– Sizing for IO operations per second rather than just volume
increase -> disk numbers increase
– Investigating higher performance disks (eg raptor) and
CERN - IT Department CH-1211 Genève 23 Switzerland
Database Replication and
Distributed Caching
13 Sept, 2006 CMS Frontier Report
FroNTier “Launchpad” software
• Squid caching proxy
– Load shared with Round-Robin DNS – Configured in “accelerator mode” – Peer-to-peer caching
– “Wide open frontier”*
• Tomcat - standard • FroNTier servlet
– Distributed as “war” file
• Unpack in Tomcat webapps dir • Change 2 files if name is different – One xml file describes DB connection
DB Squid
Squid Squid
Tomcat Tomcat Tomcat
FroNTier
servlet FroNTierservlet FroNTierservlet
Round-Robin DNS
server1 server2 server3
*In the past, we required the registration so we could add IP/mask
to our Access Control List (ACL) at CERN. Recently decided to run in “wide-open” mode so installations can be tested w/o registration.
How to keep Databases up-to-date?
Asynchronous Replication via Streams
CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN apply propagation capture
LCG 3D Status Dirk Duellmann 15
How to keep Databases up-to-date?
Asynchronous Replication via Streams
CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN
insert into emp values ( 03, “Joan”, ….) apply propagation capture
How to keep Databases up-to-date?
Asynchronous Replication via Streams
CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN
insert into emp values ( 03, “Joan”, ….) apply propagation capture capture
LCG 3D Status Dirk Duellmann 15
How to keep Databases up-to-date?
Asynchronous Replication via Streams
CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN LCR
insert into emp values ( 03, “Joan”, ….) apply propagation capture capture
How to keep Databases up-to-date?
Asynchronous Replication via Streams
CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN LCR
insert into emp values ( 03, “Joan”, ….) apply propagation capture propagation capture
LCG 3D Status Dirk Duellmann 15
How to keep Databases up-to-date?
Asynchronous Replication via Streams
CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN LCR LCR LCR LCR LCR LCR LCR LCR
insert into emp values ( 03, “Joan”, ….) apply propagation capture propagation capture
How to keep Databases up-to-date?
Asynchronous Replication via Streams
CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN LCR LCR LCR LCR LCR LCR LCR LCR
insert into emp values ( 03, “Joan”, ….) apply propagation
capture propagation apply
capture
LCG 3D Status Dirk Duellmann 15
How to keep Databases up-to-date?
Asynchronous Replication via Streams
CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN
insert into emp values ( 03, “Joan”, ….) apply propagation
capture propagation apply
capture
Database s/w Licenses and
Support
•
Tier 1 licenses acquired
– Experiment and grid service request collected for
all T1 sites
– License payment and signed agreement forms
received
•
LCG Support ID has been created with Oracle
•
Accounts for Oracle support (MetaLink)
created for all T1 database contacts
– access to Oracle s/w, patches, security upgrades
and problem database
The LCG Distributed Database Infrastructure -
Tier 0 setup and operational
procedures
•
Main streams operations have been included
in the Tier 0 DBA operations manual
•
Automated alerts for database or streams
problems from 3D monitoring
– Integrated with GGUS (grid user support)
– so far: handling of streams problems during
working hours only
•
Downstream capture setup has been installed
as part of the planned Tier 0 service extension
– log-mining step runs on a separate box,
offloading the source database
Further Decoupling between Databases
CERN RAC
SOURCE DATABASE
COPY redo log files
DOWNSTREAM DATABASE @ CERN DESTINATION SITES
CNAF FNAL
CERN
propagation jobs
Objectives
Remove impact of capture from Tier 0 database Isolate destination sites from each other
pair capture process + queue x each target site big Streams pool size
redundant events ( x number of queues)
capture process capture process
capture process
The LCG Distributed Database Infrastructure -
Database Backup / Recovery
with Streams
•
Collected DB / streams recovery scenarios
– Recovery after T1 data loss - OK
• RAL recovered and re-synchronised
• Replication CENR to CNAF continued unaffected
– Recovery after T0 data loss - OK
– Coordinated point-in-time recovery - OK
•
Service procedure documented and validated
with two Tier 1 sites and CERN
•
Full recovery exercise with all sites scheduled
for the 3D Workshop at CNAF June 12-13
– Recover Tier 1 database from tape
– Resynchronise replication streams
Streams Performance Tuning
•
Focus recently: Wide Area Network
– Remote sites significantly affected by latency
• eg ASCG with 300ms round trip time
•
Studied TCP level data flow between CERN
and Taiwan
•
Resulting Optimisations
– increased TCP buffer size (OS level)
– decrease frequency of acknowledgements
between src and dest DB
•
Total improvement of factor 10
– Now: 4000 logical change records / sec
– Checklist for Tier 1 sites has been prepared
CERN - IT Department CH-1211 Genève 23 Switzerland
www.cern.ch/it
Experiment Database
Activities & Plans
Andrea Valassi Conditions Databases Rimini, 7 May 2007
LHCb – COOL service model
• Two servers at CERN – essentially for online and offline
– Replication to Tier1’s from the online database is a two-step replication – Online server at the pit (managed by LHCb); offline server in the CC
GRIDKA RAL IN2P3 CNAF SARA PIC
(Marco Clemencic, 3D workshop 13 Sep 2006)
LFC Replication Testbed
LFC Read-Only Server LFC Oracle Server Replica DB LFC R-W Server LFC Oracle Server Master DB LFC R-W Server Population Clients Population Clients Oracle Streams rls1r1.cern.ch lxb0716.cern.c h lxb0717.cern.c hRead Only Clients
lfc-streams.cr.cnaf.infn.it
lfc-replica.cr.cnaf.infn.it
File Catalogue replication
between CERN and CNAF
•
LFC replication via streams between CERN
and CNAF in production since last November
– Requested by LHCb to provide read-only catalog
replicas
•
Stable operation without major problems
– Several site interventions at CNAF have been
performed
– Site restart and resynchronisation worked
– Rate is low compared to conditions
•
In contact with LHCb for adding remaining
Andrea Valassi Conditions Databases Rimini, 7 May 2007
ATLAS – COOL service model
• COOL Oracle services at Tier0 and ten Tier1’s
– Two COOL servers at CERN for online/offline (similar to LHCb)
• Online database within the Atlas pit network, but physically in the CC
– In addition: Oracle at three ‘muon calibration center’ Tier2’s
Online OracleDB Offline master CondDB Tier-0 Sqlite replica (1 file/run) Tier-1 replica Tier-1 replica Online / PVSS / HLT farm Tier-0 Dedicated 10Gbit link
ATLAS pit Computer centre Outside world
CERN public network Calibration updates Streams replication ATLAS pit network (ATCN) gateway
(Sasha Vaniachine and Richard Hawkings, 3D workshop 14 Sep 2006)
GRIDKA TAIWAN RAL IN2P3 CNAF SARA BNL TRIUMF PIC Nordugrid
LCG 3D Project Status -
3D Database Resource Request
and Current Predictions
4
Dual CPU DB Nodes
DB Storage [TB usable]
Request no change wrt GDB Nov’05
Conditions Challenges (April - Jul) 3 2 0.3 0.1
ATLAS: COOL + TAGs
LHCb: COOL+ LFC r/o replica
Predictions next review after initial CDC phase (eg May)
Dress Rehearsals (Jul-Nov) 3 2 0.3 0.1
ATLAS: 4GB on 64bit DB server LHCb: 2 LFC r/o servers in place
expect resource upgrade: double storage and CPU review
LHC Startup (from Nov) 3+x 2+y 1.0 0.3 ATLAS LHCb from June 2008 2009 nominal year 0.2 + 1.4 0.5 + 3.7 0.8 + 6.0
Andrea Valassi Conditions Databases Rimini, 7 May 2007
CMS – conditions data at CERN
Vincenzo Innocente, CMS Software Week, April 2007
ORCON and ORCOFF conditions data are in POOL_ORA format ORCON-ORCOFF Oracle Streams prototype set up (integration RAC). Production set up later in 2007.
11 Oct 2006 3
FroNTier Launchpad Setup
CERN DNS round Robbin
• 3 servers running Frontier & Squid (worker nodes) • Backend Oracle Database 10gR2 (4-node RAC) WAN WAN Provides load balancing and failover T0 Farm Slide: L. Lueking
CMS Request Update
26 Jan 2007 CMS Req. Update and Plans 9
Rough Estimates of CMS DB Resources
March 2007 through August 2007
10 (currently ~5) 10 (currently ~2) 20 Offline DBS Tier-0 (CMSR) 10 per (Squid) >100 ! all sites 10(per Squid) >100 ! all sites 100 (per Squid) 2-3 Squids/site Offline Conditions
Tier-1 (each site)
10 (DB)* 10 (Squid)
* Incl. On2off Xfer 10 (DB)*
10 (Squid)
* Incl. On2off Xfer 500 (DB) 100 (per Squid) 2-3 Squids Offline Conditions Tier-0 (CMSR) 10 20 500 Online P5 Transactions Peak usage (Hz) Concurrent Users Peak usage Disk Maximum (GB) Area (CMS contact)
The LCG Distributed Database Infrastructure -
Deployment Status and
Next Steps
• CMS tested Frontier/SQUID setup during CSA ’06
– now some 30 Tier 1 and Tier 2 sites connected
• ATLAS and LHCb moved production mode in April
• All ten Tier 1 database sites integrated into a single
distributed database infrastructure
– ASGC, BNL, CNAF, GridKA, IN2P3, SARA/
NIKHEF, NDGF, PIC, RAL, TRIUMF
– One of the largest distributed database setups
worldwide
• Preparing for Tier 1 replica scaling test with O(100)
client nodes using ATLAS off line framework ATHENA
• In summer: participate in WLCG “Dress Rehearsal” tests
More information at
•
WLCG 3D Project
– http://lcg3d.cern.ch or – http://twiki.cern.ch/twiki/bin/view/PSSGroup/ LCG3DWiki•
CERN Physics Database Service
– http://phydb.web.cern.ch/phydb/
• WLCG Persistency Framework
– http://pool.cern.ch
– http://pool.cern.ch/coral