Controls System UNIX Computing Environment And Data Management Facilities. Solutions Production data staging and distribution Local NFS system

(1)

Controls System UNIX Computing Environment

And Data Management Facilities

• Review

• Issues

– data storage

– backup

– security

– scalability

– reliability

– system administration

• Solutions

• Production data staging and distribution

• Local NFS system

(2)

Existing Production Unix System

opi00gtw00 (gateway server) netmon-pep (network monitor) opi99rfs02 (system monitor) MCC VMS (backup & dns) ioc NAS 0 NAS 1 slclavc pepii opi00gtw02 (gateway server) opi00gtw01 (gateway server) slcs2 (babar gateway) slcsun1 (epics dev) nlcdev nlcdev-opi01 nlcdev-opi02 nlcdev-opi03

babar mccux02 (gpib) SCS pubs AFS NFS NIS server Web server ORACLE server staging system (backup) lan lan SUN leb ioc lan ioc

opi04rfs00 opi04lfb00 opi04lfb01 opi08rfs00 opi04lfb02 opi12rfs00 nlcdev-opi05 nlcdev-opi04 slcs1 (backup for slcs2) opi00gtw04 (nlcta server) ioc

(3)

Issues

• Standalone computing environment

– Controls functional if disconnected

• Balance of high security and availability ?

– Systems in production

• Security enforcement behind

• Reboot often needed, system upgrade needed

– Latest incident: GPIB server (kernel corrupted)

• Balance of reliability and scalability ?

– Standalone:

• redundant, robust, very reliable

• problem: scalability, system admin (local services)

– A new model: Localized yet Transparent

• Data storage and backup

– Limited capacity

– Constraints

(4)

A Transparent Model

Production Server 0 Production Server 1 unix client unix client

SCS pubs AFS NFS NIS server Web server ORACLE server staging system (backup) SUN

unix client unix client … … pepii slclavc nlcdev leb babar

•Model

•Router added

•Gateways -> Production servers

•SCS: OS, security, customized tailor, sumptuous services

•We:

•Local NIS and DNS

•Automout : NFS locally and AFS from SCS

•Data storage buffer

•Prod (and some dev) software

•Procedures (backup, data dist. monitoring and tools)

•Localized yet Transparent and Scalable

•Served by local NFS, NIS and DNS, local prod software

•Control functional if disconnected from SCS

•Resources and data transparent

•Dev and prod platform symmetric (OS)

•Easy in dev, deployment, maintenance

•Monitoring in a timely fashion

•Upgrade simple, easy expansion

•Concerns

•We: reliability, availability, controls network security

•SCS: 24 hrs on call, more workloads

(5)

Storage Facilities

MCC VMS ioc _NAS 0 NAS 1 LAVC PEPII PEPII gateway servers LEB SCS pubs backup ioc ioc

/u1 (PEPII data)

(cmlog data) NLCTA gateway server backup SCS NFS Oracle server (NLCTA Channel Archiver data)

•Scattered

•PEPII -> /u1

•CMLOG -> NAS

•NLCTA -> /u1 and SCS NFS

•Channel Archiver -> NFS & AFS

•Luminosity -> AFS

•etc..

•Processes needed, CPU drain

•Increased network traffic

_BABAR

gateway server

(6)

OPI00GTW04

TARF Archive Engine

And Archive Browser

SLCS2

PEPII Archive Engine

And Archive Browser

RTR-MCC

(disconnect point)

SLCLAVC

BABAR

LEB

Data Backup

NFS File

System

AFS File

System

Current Archiver Architecture

Data Flow

Networks

(7)

NLCTA Production

LAVC PEPII PEPII gateway servers LEB SCS pubs ioc

/u1 (PEPII data) NLCTA server

backup SCS NFS

• computing resources shared with PEPII

• NLCTA server (opi00gtw04)

• iocs booted from PEPII gateway servers

• logged into PEPII gateway storage area (/u1)

• accumulating at a high rate of > 100MB/day

(approx 1600 binary files, 48K/file)

•constraints:

•limited storage capacity

•constraints to add additional disks

•isolated production network

• a solution:

•data staging

•distribution

(8)

Data Distribution

LAVC PEPII PEPII gateway servers LEB SCS pubs ioc

/u1 (PEPII data) NLCTA server

backup SCS NFS

NLCTA accumulating data

•logged into PEPII staging area ($STAG/fault)

Process:

distnlctadata

on NLCTA server

•archiving the data files (if >1 day)

•creating a tar file: fault_mm-dd-yy.tar

•distributed to SCS NFS

tar -> $NLCTA/fault/archive

for save keeping

data restored -> $NLCTA/fault/data

(mirrored) for analysis

compressed, backup via SCS facility

time distnlctadata

Process:

cleanup

on PEPII gateway server

•trigged when archiving completed

•offset by the time distnlctadata takes

•removed from staging

Automatic:

trscrontab.nlcta

•daily, extended trscrontab token

Reliability:

•all involved systems validated

•any failure

tracked and informed via e-mail

cleanup disabled, local backup, data protected

(Thanks Krisiti.)

distnlctadata trscrontab

staging cleanup

(9)

Protection of data loss

•Data loss (incident I)

•Incremental

•Fault trips occurred after

archived done and cleanup

triggered

•Solution

•Archiving the data files (>1day)

•On a daily basis

•Name: fault_mm-dd-yy.tar

Dara > 1 day archived

Data > 1 day removed (

cleanup

)

offset

•Data loss (incident II)

•The elapsed time of

distnlctadata

•Solution

•Time

distnlctadata

•Offset

cleanup

by the time

Archived

(10)

Backup Facilities

• In different forms:

– PEPII systems, applications and data -> VMS buffer -> tape

– CMLOG data -> NAS disk (mirrored)

– NLCTA data -> SCS tape staging facility

– Channel Archiver data -> SCS AFS & CD

– GPIB server system and application -> local tape

– Network monitor utilities -> none

• Restoring in different methods

• A dilemma

– VMS buffer limited (image and incremental)

– problem of writing (> 2GB) file to VMS buffer from Solaris

– Multinet NFS protocol file size limitation ~2GB

(11)

A Long Term Solution: Local NFS

Local NFS file systems

:

Two SUN Netra 20

Redundant (master & failover)

Expandable

Optimized for serving files on network

Sun StorEdge Disk Array

RAID controller: disk mirrored

Dynamic reconfig. of disk storage

Hot-swap redundant components

Optical fiber cable

Backup facility:

Magneto optical storage media tech

HP Optical Jukeboxes

High-volume reference data

Write once, read for ever

Scalable (up to 2TB)

Transfer rate up to 10MB/s

More in research

(Thanks KenB)

Sun StorEdge

Disk Array

SUN Netra 20

Master Server

SUN Netra 20

Failover Server

HP Optical

Jukeboxes

optical

(12)

New Storage System

MCC VMS ioc _NAS 0 NAS 1 LAVC PEPII PEPII gateway servers LEB SCS pubs backup ioc ioc /u1

(PEPII & NLCTA data)

(Cmlog data) NLCTA

gateway server

backup

SCS NFS Oracle server

(NLCTA Channel Archiver data)

All production data BABAR gateway server ioc

(13)

Conclusions

• The standalone controls computing environment

–

Robust (e.g, no failure for two years)

–

Dilemmas & solutions

• An enlightenment: local NFS solution

–

Centralized

• All production data in one place, backup in one form and in one place

• Higher volume, higher data rate, yet less network traffic

• Fast data access, and simple data restore

–

Scalable and reliable

• Systems expandable

• Redundant, robust systems ; less network route, less platform , less procedures.

–

No SCS dependent at all

• All services critical to production provided locally

• Except for development and e-mail stuffs