Controls System UNIX Computing Environment
And Data Management Facilities
• Review
• Issues
– data storage
– backup
– security
– scalability
– reliability
– system administration
• Solutions
• Production data staging and distribution
• Local NFS system
Existing Production Unix System
opi00gtw00 (gateway server) netmon-pep (network monitor) opi99rfs02 (system monitor) MCC VMS (backup & dns) ioc NAS 0 NAS 1 slclavc pepii opi00gtw02 (gateway server) opi00gtw01 (gateway server) slcs2 (babar gateway) slcsun1 (epics dev) nlcdev nlcdev-opi01 nlcdev-opi02 nlcdev-opi03babar mccux02 (gpib) SCS pubs AFS NFS NIS server Web server ORACLE server staging system (backup) lan lan SUN leb ioc lan ioc
opi04rfs00 opi04lfb00 opi04lfb01 opi08rfs00 opi04lfb02 opi12rfs00 nlcdev-opi05 nlcdev-opi04 slcs1 (backup for slcs2) opi00gtw04 (nlcta server) ioc
Issues
•
Standalone computing environment
– Controls functional if disconnected
•
Balance of high security and availability ?
– Systems in production
• Security enforcement behind
• Reboot often needed, system upgrade needed
– Latest incident: GPIB server (kernel corrupted)
•
Balance of reliability and scalability ?
– Standalone:
• redundant, robust, very reliable
• problem: scalability, system admin (local services)
– A new model: Localized yet Transparent
•
Data storage and backup
– Limited capacity
– Constraints
A Transparent Model
Production Server 0 Production Server 1 unix client unix client
SCS pubs AFS NFS NIS server Web server ORACLE server staging system (backup) SUN
unix client unix client … … pepii slclavc nlcdev leb babar
•Model
•Router added
•Gateways -> Production servers
•SCS: OS, security, customized tailor, sumptuous services
•We:
•Local NIS and DNS
•Automout : NFS locally and AFS from SCS
•Data storage buffer
•Prod (and some dev) software
•Procedures (backup, data dist. monitoring and tools)
•Localized yet Transparent and Scalable
•Served by local NFS, NIS and DNS, local prod software
•Control functional if disconnected from SCS
•Resources and data transparent
•Dev and prod platform symmetric (OS)
•Easy in dev, deployment, maintenance
•Monitoring in a timely fashion
•Upgrade simple, easy expansion
•Concerns
•We: reliability, availability, controls network security
•SCS: 24 hrs on call, more workloads
Storage Facilities
MCC VMS ioc NAS 0 NAS 1 LAVC PEPII PEPII gateway servers LEB SCS pubs backup ioc ioc/u1 (PEPII data)
(cmlog data) NLCTA gateway server backup SCS NFS Oracle server (NLCTA Channel Archiver data)
•Scattered
•PEPII -> /u1
•CMLOG -> NAS
•NLCTA -> /u1 and SCS NFS
•Channel Archiver -> NFS & AFS
•Luminosity -> AFS
•etc..
•Processes needed, CPU drain
•Increased network traffic
BABARgateway server
OPI00GTW04
TARF Archive Engine
And Archive Browser
SLCS2
PEPII Archive Engine
And Archive Browser
RTR-MCC
(disconnect point)
SLCLAVC
BABAR
LEB
Data Backup
NFS File
System
AFS File
System
Current Archiver Architecture
Data Flow
Networks
NLCTA Production
LAVC PEPII PEPII gateway servers LEB SCS pubs ioc/u1 (PEPII data) NLCTA server
backup SCS NFS
• computing resources shared with PEPII
• NLCTA server (opi00gtw04)
• iocs booted from PEPII gateway servers
• logged into PEPII gateway storage area (/u1)
• accumulating at a high rate of > 100MB/day
(approx 1600 binary files, 48K/file)
•constraints:
•limited storage capacity
•constraints to add additional disks
•isolated production network
• a solution:
•data staging
•distribution
Data Distribution
LAVC PEPII PEPII gateway servers LEB SCS pubs ioc/u1 (PEPII data) NLCTA server
backup SCS NFS
NLCTA accumulating data
•logged into PEPII staging area ($STAG/fault)
Process:
distnlctadata
on NLCTA server
•archiving the data files (if >1 day)
•creating a tar file: fault_mm-dd-yy.tar
•distributed to SCS NFS
tar -> $NLCTA/fault/archive
for save keeping
data restored -> $NLCTA/fault/data
(mirrored) for analysis
compressed, backup via SCS facility
time distnlctadata
Process:
cleanup
on PEPII gateway server
•trigged when archiving completed
•offset by the time distnlctadata takes
•removed from staging
Automatic:
trscrontab.nlcta
•daily, extended trscrontab token
Reliability:
•all involved systems validated
•any failure
tracked and informed via e-mail
cleanup disabled, local backup, data protected
(Thanks Krisiti.)
distnlctadata trscrontab
staging cleanup
Protection of data loss
•Data loss (incident I)
•Incremental
•Fault trips occurred after
archived done and cleanup
triggered
•Solution
•Archiving the data files (>1day)
•On a daily basis
•Name: fault_mm-dd-yy.tar
Dara > 1 day archived
Data > 1 day removed (
cleanup
)
offset
•Data loss (incident II)
•The elapsed time of
distnlctadata
•Solution
•Time
distnlctadata
•Offset
cleanup
by the time
Archived
Backup Facilities
•
In different forms:
– PEPII systems, applications and data -> VMS buffer -> tape
– CMLOG data -> NAS disk (mirrored)
– NLCTA data -> SCS tape staging facility
– Channel Archiver data -> SCS AFS & CD
– GPIB server system and application -> local tape
– Network monitor utilities -> none
•
Restoring in different methods
•
A dilemma
– VMS buffer limited (image and incremental)
– problem of writing (> 2GB) file to VMS buffer from Solaris
– Multinet NFS protocol file size limitation ~2GB
A Long Term Solution: Local NFS
Local NFS file systems
:
Two SUN Netra 20
Redundant (master & failover)
Expandable
Optimized for serving files on network
Sun StorEdge Disk Array
RAID controller: disk mirrored
Dynamic reconfig. of disk storage
Hot-swap redundant components
Optical fiber cable
Backup facility:
Magneto optical storage media tech
HP Optical Jukeboxes
High-volume reference data
Write once, read for ever
Scalable (up to 2TB)
Transfer rate up to 10MB/s
More in research
(Thanks KenB)
Sun StorEdge
Disk Array
SUN Netra 20
Master Server
SUN Netra 20
Failover Server
HP Optical
Jukeboxes
optical
New Storage System
MCC VMS ioc NAS 0 NAS 1 LAVC PEPII PEPII gateway servers LEB SCS pubs backup ioc ioc /u1(PEPII & NLCTA data)
(Cmlog data) NLCTA
gateway server
backup
SCS NFS Oracle server
(NLCTA Channel Archiver data)
All production data BABAR gateway server ioc