The LCG Distributed Database Infrastructure

(1)

CERN - IT Department CH-1211 Genève 23 Switzerland

The LCG Distributed

Database Infrastructure

Dirk Düllmann, CERN & LCG 3D DESY Computing Seminar

(2)

[email protected]

The LCG Distributed Database Infrastructure -

Outline of the Talk

• Why databases and why distributed?

• LCG Distributed Database Project (LCG 3D)

• Building blocks for a scalable database

services

– database clusters

• Distribution techniques

– streams replication and distributed caching

• Database and replication monitoring

optimisation

• Experiment models and schedules

(3)

Offline Reconstruction Conditions DB ONline subset CERN Computer Center

Online Master Data Storage

Conditions

Formatting

Production Validation Development

Offline Reconstruction Conditions DB OFfline subset Master copy Configurations Pixel Strips ECAL HCAL RPC DT ES CSC Trigger DAQ DCS Conditions Pixel Strips ECAL HCAL RPC DT ES CSC Trigger DAQ DCS LHC data Logbook DDD EMDB Calibration

Create rec conDB data set

Tier 0

(4)

[email protected]

4

Distributed Deployment of Databases (=3D)

• LCG provided initially an infrastructure for distributed access to file based data and file replication

• Physics applications (and grid services) depend crucially on relational databases for meta data and require similar services for databases

– Physics applications and grid services use RDBMS

• eg configuration, conditions, calibration, event tags, file and collection catalogues, production/transfer workflow

– LCG sites have already experience in providing RDBMS services

• Goals for common database project as part of LCG

– increase the availability and scalability of LCG and experiment database components

– allow applications to access databases in a consistent, location independent way

– connect database services via data replication mechanisms

– shared deployment and administration of this infrastructure during 24 x 7 operation

• Scope set by LCG PEB: Databases Online, Offline and at LCG Tier sites

(5)

[email protected]

LCG 3D Service Architecture

O O O O S F S S M S M S T1 – DB backbone

– all data replicated – reliable service

T2 – local DB cache

– subset data

– only local service

Online DB – autonomous reliable service database (DB) cluster T0 – autonomous reliable service

Read-only access at Tier-1/2 (at least initially)

Oracle Streams http cache (Squid) Cross DB copy and MySQL/SQLight files O M Squid FroNTier F MySQL/SQLight DB file

(6)

[email protected]

Database Services for Physics

at CERN in 2002

(7)

[email protected]

Database Services for Physics

at CERN in 2002

• 24x7 service based on

(8)

[email protected]

Database Services for Physics

at CERN in 2002

Oracle 9i / Solaris

• Most experiment

databases hosted on

– 2-node (4 CPU) cluster on

Solaris with 6GB of RAM

– 18 disks, 32GB each

(9)

[email protected]

Database Services for Physics

at CERN in 2002

Oracle 9i / Solaris

• Most experiment

databases hosted on

– 2-node (4 CPU) cluster on

Solaris with 6GB of RAM

– 18 disks, 32GB each

– Veritas Volume Manager

• But also increasing many Linux

disk servers for LHC and non-LHC data (COMPASS, HARP)

– Locally attached disk

– Spread all over the CERN

computing centre

– Several different versions of Linux and Oracle

(10)

[email protected]

How to build a highly available and

scalable DB service?

(11)

[email protected]

How to build a highly available and

scalable DB service?

(12)

[email protected]

How to build a highly available and

scalable DB service?

(13)

[email protected]

How to build a highly available and

scalable DB service?

• Scaling Up

• Scaling out -> clustering

(14)

[email protected]

How to build a highly available and

scalable DB service?

• Scaling Up

• Scaling out -> clustering

(15)

[email protected]

Chosen Architecture:

Database Cluster on Linux

• Oracle Real Application Cluster (RAC) on

commodity hardware

– Redundancy at all levels (CPU, storage, networking)

– Oracle 10gR2 and RedHat ES 4

– Oracle ASM as volume manager

CERN LAN RAC1 Infortrend storage lxs5033 lxs5037 lxs5038 lxs5030 GB ethernet switch SANbox 5200 – A

(16)

ASGC 9

Hardware configuration

 Four servers

 CPU : Intel Pentium-D 830 3.0 GHz  Memory 2G (ECC)

 Local Disk S-ATA2 80G 7200 rpm  Fiber Channel LSI 7102XP-LC, PCI

X 1

 SAN Switch : Silkworm 3850 16

ports

 Backend Raid subsystem:

StorageTek B280

 Each RAC group shares 1.7TB

exported from SAN

RAC group for 3D RAC group for other _{LCG services}

Dual channel &

redundant controller

(17)

(18)

(19)

(20)

Beginning of 2007

• ~220 Database CPUs

• ~440 GB of RAM (shared DB cache)

• ~1100 disks

• Some 15 DB clusters deployed

– One production cluster per LHC experiment for offline

applications

– ATLAS Online, COMPASS cluster

– Number of nodes varying from 4 to 8

• Several validation and test clusters

– 1 or 2 per experiment (typically 2 node clusters)

(21)

[email protected]

Application Validation and

Optimisation

• Database clusters can amplify effects of poor application

design - need scalability test during application release cycle

Development DB service Validation DB service Production DB service

• Significant fraction of DB administrator work

• needs close collaboration with application developers

• Focus on application with large resource consumption:

– File transfer systems (FTS, PhEDEx)

– Grid catalogs (LFC)

– Experiment Dashboards

– Condition data (COOL)

(22)

[email protected]

Evolving Database Hardware

• Need to continuously replace database h/w with

next generation CPU and storage

– still maintaining as few s/w configurations as

possible: only a single Linux and Oracle version

• CPU side - continued performance increase via multi-core

– Recently tested dual quad-core CPU & 16 GB of RAM

• performance similar to 5 node RAC built with the hardware currently used

– Multi core works well for database servers

• but implies a move to 64-bit Linux & Oracle

– May run into memory bandwidth limitations with many cores

• Storage side - slower performance increase

– Sizing for IO operations per second rather than just volume

increase -> disk numbers increase

– Investigating higher performance disks (eg raptor) and

(23)

Database Replication and

Distributed Caching

(24)

13 Sept, 2006 CMS Frontier Report

FroNTier “Launchpad” software

• Squid caching proxy

– Load shared with Round-Robin DNS – Configured in “accelerator mode” – Peer-to-peer caching

– “Wide open frontier”*

• Tomcat - standard • FroNTier servlet

– Distributed as “war” file

• Unpack in Tomcat webapps dir • Change 2 files if name is different – One xml file describes DB connection

DB Squid

Squid Squid

Tomcat Tomcat Tomcat

FroNTier

servlet FroNTierservlet FroNTierservlet

Round-Robin DNS

server1 server2 _server3

*In the past, we required the registration so we could add IP/mask

to our Access Control List (ACL) at CERN. Recently decided to run in “wide-open” mode so installations can be tested w/o registration.

(25)

How to keep Databases up-to-date?

Asynchronous Replication via Streams

CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN apply propagation capture

(26)

LCG 3D Status Dirk Duellmann 15

How to keep Databases up-to-date?

Asynchronous Replication via Streams

CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN

insert into emp values ( 03, “Joan”, ….) apply propagation capture

(27)

How to keep Databases up-to-date?

Asynchronous Replication via Streams

insert into emp values ( 03, “Joan”, ….) apply propagation capture capture

(28)

How to keep Databases up-to-date?

Asynchronous Replication via Streams

CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN LCR

insert into emp values ( 03, “Joan”, ….) apply propagation capture capture

(29)

How to keep Databases up-to-date?

Asynchronous Replication via Streams

CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN LCR

insert into emp values ( 03, “Joan”, ….) apply propagation capture propagation capture

(30)

How to keep Databases up-to-date?

Asynchronous Replication via Streams

CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN LCR LCR LCR LCR LCR LCR LCR LCR

insert into emp values ( 03, “Joan”, ….) apply propagation capture propagation capture

(31)

How to keep Databases up-to-date?

Asynchronous Replication via Streams

CNAF RAL Sinica FNAL IN2P3 BNL CERN CERN LCR LCR LCR LCR LCR LCR LCR LCR

insert into emp values ( 03, “Joan”, ….) apply propagation

capture propagation apply

capture

(32)

How to keep Databases up-to-date?

Asynchronous Replication via Streams

insert into emp values ( 03, “Joan”, ….) apply propagation

capture propagation apply

capture

(33)

[email protected]

Database s/w Licenses and

Support

• Tier 1 licenses acquired

– Experiment and grid service request collected for

all T1 sites

– License payment and signed agreement forms

received

• LCG Support ID has been created with Oracle

• Accounts for Oracle support (MetaLink)

created for all T1 database contacts

– access to Oracle s/w, patches, security upgrades

and problem database

(34)

[email protected]

Tier 0 setup and operational

procedures

• Main streams operations have been included

in the Tier 0 DBA operations manual

• Automated alerts for database or streams

problems from 3D monitoring

– Integrated with GGUS (grid user support)

– so far: handling of streams problems during

working hours only

• Downstream capture setup has been installed

as part of the planned Tier 0 service extension

– log-mining step runs on a separate box,

offloading the source database

(35)

Further Decoupling between Databases

CERN RAC

SOURCE DATABASE

COPY redo log files

DOWNSTREAM DATABASE @ CERN DESTINATION SITES

CNAF FNAL

CERN

propagation jobs



Objectives

 Remove impact of capture from Tier 0 database  Isolate destination sites from each other

 pair capture process + queue x each target site  big Streams pool size

 redundant events ( x number of queues)

capture process capture process

capture process

(36)

[email protected]

Database Backup / Recovery

with Streams

• Collected DB / streams recovery scenarios

– Recovery after T1 data loss - OK

• RAL recovered and re-synchronised

• Replication CENR to CNAF continued unaffected

– Recovery after T0 data loss - OK

– Coordinated point-in-time recovery - OK

• Service procedure documented and validated

with two Tier 1 sites and CERN

• Full recovery exercise with all sites scheduled

for the 3D Workshop at CNAF June 12-13

– Recover Tier 1 database from tape

– Resynchronise replication streams

(37)

[email protected]

Streams Performance Tuning

• Focus recently: Wide Area Network

– Remote sites significantly affected by latency

• eg ASCG with 300ms round trip time

• Studied TCP level data flow between CERN

and Taiwan

• Resulting Optimisations

– increased TCP buffer size (OS level)

– decrease frequency of acknowledgements

between src and dest DB

• Total improvement of factor 10

– Now: 4000 logical change records / sec

– Checklist for Tier 1 sites has been prepared

(38)

www.cern.ch/it

Experiment Database

Activities & Plans

(39)

Andrea Valassi Conditions Databases Rimini, 7 May 2007

LHCb – COOL service model

• Two servers at CERN – essentially for online and offline

– Replication to Tier1’s from the online database is a two-step replication – Online server at the pit (managed by LHCb); offline server in the CC

GRIDKA RAL IN2P3 CNAF SARA PIC

(Marco Clemencic, 3D workshop 13 Sep 2006)

(40)

LFC Replication Testbed

LFC Read-Only Server LFC Oracle Server Replica DB LFC R-W Server LFC Oracle Server Master DB LFC R-W Server Population Clients Population Clients Oracle Streams rls1r1.cern.ch lxb0716.cern.c h lxb0717.cern.c h

Read Only Clients

lfc-streams.cr.cnaf.infn.it

lfc-replica.cr.cnaf.infn.it

(41)

[email protected]

File Catalogue replication

between CERN and CNAF

• LFC replication via streams between CERN

and CNAF in production since last November

– Requested by LHCb to provide read-only catalog

replicas

• Stable operation without major problems

– Several site interventions at CNAF have been

performed

– Site restart and resynchronisation worked

– Rate is low compared to conditions

• In contact with LHCb for adding remaining

(42)

ATLAS – COOL service model

• COOL Oracle services at Tier0 and ten Tier1’s

– Two COOL servers at CERN for online/offline (similar to LHCb)

• Online database within the Atlas pit network, but physically in the CC

– In addition: Oracle at three ‘muon calibration center’ Tier2’s

Online OracleDB Offline master CondDB Tier-0 Sqlite replica (1 file/run) Tier-1 replica Tier-1 replica Online / PVSS / HLT farm Tier-0 Dedicated 10Gbit link

ATLAS pit Computer centre Outside world

CERN public network Calibration updates Streams replication ATLAS pit network (ATCN) gateway

(Sasha Vaniachine and Richard Hawkings, 3D workshop 14 Sep 2006)

GRIDKA TAIWAN RAL IN2P3 CNAF SARA BNL TRIUMF PIC Nordugrid

(43)

[email protected]

ATLAS Muon Calibration Data

Flow

(44)

[email protected]

LCG 3D Project Status -

3D Database Resource Request

and Current Predictions

4

Dual CPU DB Nodes

DB Storage [TB usable]

Request no change wrt GDB Nov’05

Conditions Challenges (April - Jul) 3 2 0.3 0.1

ATLAS: COOL + TAGs

LHCb: COOL+ LFC r/o replica

Predictions next review after initial CDC phase (eg May)

Dress Rehearsals (Jul-Nov) 3 2 0.3 0.1

ATLAS: 4GB on 64bit DB server LHCb: 2 LFC r/o servers in place

expect resource upgrade: double storage and CPU review

LHC Startup (from Nov) 3+x 2+y 1.0 0.3 ATLAS LHCb from June 2008 2009 nominal year 0.2 + 1.4 0.5 + 3.7 0.8 + 6.0

(45)

CMS – conditions data at CERN

Vincenzo Innocente, CMS Software Week, April 2007

ORCON and ORCOFF conditions data are in POOL_ORA format ORCON-ORCOFF Oracle Streams prototype set up (integration RAC). Production set up later in 2007.

(46)

11 Oct 2006 3

FroNTier Launchpad Setup

CERN DNS round Robbin

• 3 servers running Frontier & Squid (worker nodes) • Backend Oracle Database 10gR2 (4-node RAC) WAN WAN Provides load balancing and failover T0 Farm Slide: L. Lueking

(47)

[email protected]

CMS Request Update

26 Jan 2007 CMS Req. Update and Plans 9

Rough Estimates of CMS DB Resources

March 2007 through August 2007

10 (currently ~5) 10 (currently ~2) 20 Offline DBS Tier-0 (CMSR) 10 per (Squid) >100 _! all sites 10(per Squid) >100 _! all sites 100 (per Squid) 2-3 Squids/site Offline Conditions

Tier-1 (each site)

10 (DB)* 10 (Squid)

* Incl. On2off Xfer 10 (DB)*

10 (Squid)

* Incl. On2off Xfer 500 (DB) 100 (per Squid) 2-3 Squids Offline Conditions Tier-0 (CMSR) 10 20 500 Online P5 Transactions Peak usage (Hz) Concurrent Users Peak usage Disk Maximum (GB) Area (CMS contact)

(48)

[email protected]

Deployment Status and

Next Steps

• CMS tested Frontier/SQUID setup during CSA ’06

– now some 30 Tier 1 and Tier 2 sites connected

• ATLAS and LHCb moved production mode in April

• All ten Tier 1 database sites integrated into a single

distributed database infrastructure

– ASGC, BNL, CNAF, GridKA, IN2P3, SARA/

NIKHEF, NDGF, PIC, RAL, TRIUMF

– One of the largest distributed database setups

worldwide

• Preparing for Tier 1 replica scaling test with O(100)

client nodes using ATLAS off line framework ATHENA

• In summer: participate in WLCG “Dress Rehearsal” tests

(49)

[email protected]

More information at

• WLCG 3D Project

– http://lcg3d.cern.ch or – http://twiki.cern.ch/twiki/bin/view/PSSGroup/ LCG3DWiki

• CERN Physics Database Service

– http://phydb.web.cern.ch/phydb/

• WLCG Persistency Framework

– http://pool.cern.ch

– http://pool.cern.ch/coral