Managing managed storage

(1)

Managing managed storage

CERN Disk Server operations

HEPiX 2004 / BNL

(2)

Outline

Which are our “Data Services”?

Disk server hardware @ CERN

Management tools

(3)

A lot of hardware

Disk storage

350 “storage in a box” Linux diskservers 6700 disks

550 TeraBytes of raw disk space

Tape storage

2 robotic installations

each with 5 STK 9310 silos

(4)

Many applications

repair/spare 11% CASTOR 55% AFS 3% CDR 6% R&D 14% Oracle 11% 200 CASTOR! 40 Oracle 20 CDR 10 AFS scratch dCache, LHC@home, … LCG, OpenLab, EGEE, data challenges 40 in repair/spare

A very heterogeneous environment!

(5)

Players

Many teams involved:

Application responsibles / Users Service managers

System administrators team Suppliers

Software often not redundant…

need to minimize downtime!

(6)

“Storage in a box”

13 different hardware configurations:

8 – 26 IDE disks, hot-swappable trays

2 – 4 3-Ware RAID controllers

2 CPUs

2 – 3 power supplies

GigE network card

(7)

hardware interventions

55 interventions

since Sep 1 disk replacements (70%)

trays, cables, fans, PSU

33% involve (un)scheduled downtime

Older hardware harder to maintain

One supplier out of business

Incidents to spice up life…

(8)

Disk replacement

10 months before case agreed: Head instabilities 4 weeks to execute

1224 disks exchanged (=18%); And the cages as well

0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5%

Dec-03 Jan-04 Feb-04 Mar-04 Apr-04 May-04 Jun-04 Jul-04 Aug-04 Sep-04

% B

roken Mirrors

1224 disks replaced

Jumbo servers

(9)

65 Jumbo’s

1 – 1.5 TB raw disk space 6800 3-Ware controllers 600 MHz PIII No PXE Becoming hard to maintain

Many still under warranty

(10)

175 4U servers

4U (5U) rack mounted 1 – 1.5 TB

2 * 3-Ware 7000 series

currently upgrading firmware

2 * 1 GHz PIII’s

No PXE (yet)

Various maintenance issues

(11)

115 8U servers

8U rack mounted 2 – 2.5 Tb 3 – 4 * 3-Ware 7500(6)-8 2 * 2.4 GHz Xeon Well controlled, well maintained, well behaved,

(12)

Diskserver evolution

0 500 1000 1500 2000 2500 3000

Oct-00 Apr-01 Oct-01 Apr-02 Oct-02 Apr-03 Tender date C a p a c ity [G B ] 0 5 10 15 20 25 30 35 40 45 CHF /G B Gross capacity Usable capacity Price/usable GB

(13)

That was then…

HW RAID1

Ext2 filesystems many of them

13 different kernels! RedHat 6.1/6.2, 7.2/7.3, 2.1ES

Need for automation + standardization

ELFms toolsuite

Quattor – installation + configuration

LEMON – performance + exception monitoring

(14)

…this is now

RedHat 7.3,

preparing for SLC3

Oracle: RHEL 2.1,

preparing RHEL 3

kernel has old 3-Ware driver

HW RAID5 + hot spare disk

Up to 50% more usable space

On 3-Ware 7000 controller with up-to-date firmware

SW RAID0 + XFS

Improved performance expected iozone benchmark

Old XFS version

(15)

Updating the toolbox

SMART – to predict disk failure

daily and weekly self-tests, on every disk

IPMI v1.5

HW monitoring and event control Power control, resets

Lm_sensors – temperature monitoring

(16)

(17)

This is now

Quattorized + Lemonized

Rely on Operator and SysAdmin teams Operated in same way as PC farms

Getting more out of suppliers

(18)

What’s next?

New hardware

360 TB “SATA in a box”, 2 different suppliers

140 TB FC attached external SATA disk arrays

New software

SLC3, RHEL 3

New CASTOR stager

New challenges

Oracle SAN setup Alice data challenge

(19)

Conclusions

A lot of work has been done to

Stabilize Hardware and Software

Automate + hand over basic operations

Integrate into standard work flows

Get more out of available hardware

(20)

Useful links

“Standing on the shoulders of giants”

Tim Smith CHEP 2004

http://indico.cern.ch/contributionDisplay.py?contribId=374&sessionId=10&confId=0

Helge Meinhard CHEP 2004

http://indico.cern.ch/contributionDisplay.py?contribId=325&sessionId=10&confId=0

Peter Kelemen CERN IT “After C5” http://cern.ch/Peter.Kelemen/talk/2004/C5/diskserver

Jan Iven HEPiX 2004 Edinburgh http://hepwww.rl.ac.uk/hepix/nesc/iven.pdf