• No results found

Automatic Software Updates on Heterogeneous Clusters with STACI

N/A
N/A
Protected

Academic year: 2021

Share "Automatic Software Updates on Heterogeneous Clusters with STACI"

Copied!
9
0
0

Loading.... (view fulltext now)

Full text

(1)

Rosen Center for Advanced Computation

Automatic Software Updates on

Heterogeneous Clusters with

STACI

Michael Shuey

Linux Developer and Administrator

LCI: The HPC Revolution

May 19, 2004

Outline

ʅ

Introduction

ʅ

Common maintenance problems

ʅ

STACI

ʅ

Usage experiences

(2)

Rosen Center for Advanced Computation

Our World

ʅ

We operate a variety of HPC resources

• All resources are 24x7, used by local researchers nearly at capacity

• Clusters becoming pervasive

• New nodes, clusters appear often

ʅ

We're a PBS shop

• Jobs last up to 30 days

• Several servers, about 1 per resource

ʅ

We support a LOT of applications

• Frequently adding more on request

Problem Child 1

ʅ

“ recycled” cluster

• Older machines, usually desktops

• Cheap to build, easy to assemble

• Random nodes die!

• VERY large, heavily used – scheduling maintenance times problematic

(3)

Rosen Center for Advanced Computation

Problem Child 2

ʅ

100-node dual Athlon

cluster

• Maintained for

prominent researcher

• Nodes die frequently due to hardware flaws

• Keeping software stack consistent across

nodes is challenging...

Large Cluster Headaches

ʅ Frequent software changes require frequent maintenance windows, hampering long jobs

ʅ Frequent hardware updates cause administrator oversights

ʅ Dead nodes make software updates tedious

• Difficult to ensure consistent software stack • Automated tools often choke on dying nodes

ʅ Multiple clusters, multiple architectures • Multiple software update mechanisms

(4)

Rosen Center for Advanced Computation

Possible solutions

ʅ

Reload nodes on job completion

• Easy to automate

• Hard to do with multiple jobs on SMP node

• May be possible to reload image during image update...

ʅ

Schedule maintenance time

• Nearly impossible to do all at once

• Too much bookkeeping to do piecemeal

• Causes large, user-visible disruption

ʅ

Write something new...

Goals

ʅ Must be fully automated

• running rsync is a waste of time ʅ Must provide minimal user disruption

• Shouldn't change software during active jobs! • Shouldn't reserve large chunks of the cluster(s) ʅ Must handle multiple clusters

ʅ Must handle multiple architectures, operating systems

ʅ Should keep administrators from stepping on each others' toes ʅ Needs to get list of nodes from job scheduler

• Only scheduler will know current node (and job) status ʅ Should allow different nodes to have different software

(5)

Rosen Center for Advanced Computation

STACI Design

ʅ Node software lives in chroot-able image • Images have version numbers

• Must be locked (and mounted rw) for edits • Software is delivered via rsync over ssh ʅ Two processes continually update nodes

• nodescanner – marks nodes to update • rolling_upgrade – does the heavy lifting ʅ Rely on the PBS server(s) for info

• PBS always knows node availability, state, etc.

• Job scheduler can be told to keep work off a node until it can be upgraded

ʅ All other info (per-node config data, image applied on node, etc.) is stored in a ;-delimited “nodes” file

Administrator commands

ʅ lock_img

• prepare image for edits, create lockfile ʅ unlock_img

• removes lockfile, increments versions number ʅ cimage

• sends out image irrespective of lock, compatibile with c3 tools

ʅ cmark

(6)

Rosen Center for Advanced Computation

Config files

ʅ /etc/staci/clusters.conf

• Lists available clusters, scheduler types ʅ /etc/staci/staci.conf

• Various configuration parameters

• PBS install root, delay between scans, etc. ʅ /var/lib/mgmt/nodes

• Lists all nodes, image names, other details • ;-delimited

ʅ /var/lib/mgmt/postinstall

• Postinstall script (invoked after imaging), lives on nodes • Can parse nodes file for node details

ʅ Image directories (and lockfiles) live in /var/lib/mgmt/images

Software update cycle

ʅ

lock_img <directory>

ʅ

chroot <directory>

ʅ

<fiddle with things>

ʅ

optional – use cimage for test deploys

ʅ

unlock_img <directory>

(7)

Rosen Center for Advanced Computation

STACI's behavior (1)

ʅ

nodescanner:

• contacts each node from each server

• compares /etc/image_version on node with version in image

• marks node for upgrade

» usually involves PBS comments, adding reservation to scheduler

• after unlock_img nodescanner will

immediately notice new version numbers and act accordingly

STACI's behavior (2)

ʅ

rolling_upgrade:

• activates periodically (default: 30 min), scans for nodes to upgrade

• selects a group of like-minded nodes

» same image, limited number

• pushes out upgrade(s) via parallel ssh

» rsyncs image

» runs postinstall script on nodes

» reboots nodes

• removes comments, reservations – node can be used again

(8)

Rosen Center for Advanced Computation

Usage experiences (1)

ʅ

Reduces admin time spent pushing out

changes

ʅ

Removes the need to schedule

maintenance time for most changes

• Security updates can still cause issues!

ʅ

Updates appear instantaneous (to users)

• Deploy new software to head node, image

• Mark appropriate nodes as needing upgrade

• New jobs will only run on upgraded nodes

Usage experiences (2)

ʅ

Drives up load on image server like crazy

• Not bad for 10 or 20 machines, but for 100...

• Lesson learned: use a big server, or set an upper bound on simultaneous updates

ʅ

ssh overhead is definitely not a factor...

• Nodes are idle HPC resources, ssh is trivial

ʅ

...network bandwidth can be an issue

(9)

Rosen Center for Advanced Computation

Future directions (1)

ʅ

No sense of indirect clusters from c3

• 3rd party machine could act as image server

• Reduces load on main server

• Potentially less network congestion

• Can be used to update firewalled clusters

ʅ

Deploying to more platforms

• Several Linux distributions now

• AIX, Solaris soon

Future directions (2)

ʅ

Integration with non-PBS schedulers

• PBSPro (with Maui) supported now

• Integration with SGE, LSF, etc. feasible

ʅ

Better fault detection and error reporting

• The nodescanner visits all nodes regularly

• Very easy to page sysadmin on timeouts...

• Would be better to invoke problem recovery scripts, avoid manual intervention if possible

References

Related documents

In the case of Great Expectations, it is entirely up to particular editors to decide on questions such as which authoritative early edition (e.g. 1861 book edition or

The resulting report contains information on leading sectors in the regional economy, de- mands of the SME with respect to training, advice, financial support, demands and capa- city

If you hold a ticket for a single or return journey, or a weekly Season Ticket (not Passenger Transport Executive products), and a delay to a First TransPennine Express service adds

prevalence of physical inactivity and perceived barriers to physical activity among individuals with hypertrophic cardiomyopathy (HCM), and to determine potential demographic,

It is widely recognized that by adopting Mobley model (1977), A clearer idea was presented to enhance the employment factors for reducing the turnover crisis; determining how

• Access to AIS • Training on use • Help Assistance RAI Education • Mentorship • Instructor-Led Education • Oversight of Proficiency Ministry Help Desk Provides. • Assistance

Not only has Red Bull been instrumental in creating brand awareness with the Stratos, “the company is known for its buzz-oriented marketing geared toward youth, including

The guiding principle for our integration effort was provided by Friesen and Friesen (2002) who describe integration as a mixing of ideas, a coming together of minds