• No results found

Bulk Synchronous Programmers and Design

N/A
N/A
Protected

Academic year: 2021

Share "Bulk Synchronous Programmers and Design"

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)

Linux Kernel Co-Scheduling For

Bulk Synchronous Parallel Applications

ROSS 2011

Tucson, AZ

Terry Jones

(2)

Outline

Motivation

Approach & Research

Design Attributes

Achieving Portable Performance

Measurements

(3)

We’re Experiencing an Architectural Renaissance

Increased
Core
Counts


Disrup1ve
Technologies


Factors To Change

Moore’s Law -- Number of transistors per IC

double every 24 months

No Power Headroom -- Clock speed will not

increase (and may decrease) because of

Power

Power


α

Voltage

2


*
Frequency


Power


α

Frequency


Power

α

Voltage

3

Increased Transistor

Density

(4)

A Key Component of the Colony Project

Adaptive System Software For Improved Resiliency and Performance

Approach


Objec;ves



Impact


Provide technology to make portable scalability a reality.

Remove the prohibitive cost of full POSIX APIs and

full-featured operating systems.

Enable easier leadership-class level scaling for domain

scientists through removing key system software

barriers.

Automatic and adaptive load-balancing plus fault

tolerance.

High performance peer-to-peer and overlay

infrastructure.

Address issues with Linux to provide the familiarity

and performance needed by domain scientists.

Full-featured environments allow for a full range of

programming development tools including debuggers,

memory tools, and system monitoring tools that depend

on separate threads or other POSIX API.

Automatic load balancing helps correct problems

associated with long running dynamic simulations.

Coordinated scheduling removes the negative impact of

OS jitter from full-featured system software.

Collaborators


Terry Jones, Project PI Laxmikant Kalé, UIUC PI José Moreira, IBM PI

Challenges


Computational work often includes large amounts of

state which places additional demands on successful

work migration schemes.

For widespread acceptance from the Linux community,

the effort to validate and incorporate HPC originated

advancements into the Linux kernel must be minimized.

(5)

Motivation – App Complexity

 

Linux

-> Familiar

-> Open Source

-> Support for common system calls

 

Support for daemons & threading packages

-> Debugging strategies

-> Asynchronous strategies

 

Support for administrative monitoring

 

OS Scalability

-> Eliminate OS Scalability Issues Through Parallel Aware Scheduling

(6)

The Need For Coordinated Scheduling

Bulk


Synchronous


Programs


(7)

The Need For Coordinated Scheduling

• 

Permit
Full
Linux
Func1onality


• 

Eliminate
Problema1c
OS
Noise


(8)

What About …

Core Specialization

Minimalist OS

Will Apps Always Be Bulk Synchronous?

(9)

The Tau team at the University of

Oregon has reported 23% to 32%

increase in runtime for parallel

applications running at 1024 nodes

and 1.6% operating system noise

Ferreira, Bridges and Brightwell

confirmed that a 1000 Hz 25

µ

s

noise interference (an amount

measured on a large-scale

commodity Linux cluster) can cause

a 30% slowdown in application

performance on ten thousand nodes

Time

Node1a
 Node1b Node1c Node1d
 Node2a Node2b Node2c Node2d
 Node1a
 Node1b Node1c
 Node1d Node2a Node2b Node2c Node2d

Time

HPC
Colony
Technology
–



Coordinated
Scheduling


(10)

Goals

Portable Performance

• Make OS Noise a non-issue for bulk-synchronous codes

• Permit sysadmin best practices

(11)

Proof of Concept – Blue Gene / L

Allreduce 0.1 1 10 100 1000 10000 1024 2048 4096 8192 CNK

Colony with SchedMods (quiet) Colony with SchedMods (30% noise) Colony (quiet) Colony (30% noise) GLOB 1 10 100 1000 10000 CNK

Colony with SchedMods (quiet) Colony with SchedMods (30% noise) Colony (quiet)

Colony (30% noise)

(12)

Approach

Introduces two new process flags & two new tunables

total time of epoch

percentage to parallel app (percentage of blue from co-schedule figure)

• Dynamically turned on or off with new system call

• Tunables are adjusted through use of a second new system call

Salient Features

Utilizes a new clock synchronization scheme

• Uses existing fair round-robin scheduler for both epochs

(13)
(14)
(15)

…and in conclusion…

 

For Further Info

contact:
Terry
Jones


[email protected]

hOp://www.hpc‐colony.org


hOp://charm.cs.uiuc.edu


 

Partnerships and Acknowledgements

 

Synchronized Clock work done by Terry Jones and Gregory Koenig

 

DOE Office of Science – major funding provided by FastOS 2

(16)
(17)

Improved
Clock
Synchroniza1on
Algorithms

 

Achievement


Developed
a
new
clock
synchroniza1on
algorithm.
The
new
algorithm
is
a
high
precision


design
suitable
for
large
leadership‐class
machines
like
Jaguar.
Unlike
most
high‐precision


algorithms
which
reach
their
precision
in
a
post‐mortem


analysis
aWer
the
applica1on
has
completed,
the
new
ORNL
developed
algorithm
rapidly


provides
precise
results
during
run1me.



Relevance


To
the
Sponsor;


Makes
more
effec1ve
use
of
OLCF
and
ALCF
systems
possible.


To
the
Laboratory,
Directorate,
and
Division
Missions;
and


Demonstrates
capabili1es
in
cri1cal
system
soWware
for
leadership‐class
machines.


To
the
Computer
Science
Research
Community.


High
precision
global
synchronized
clock
of
growing
interest
to
system
soWware


needs
including
parallel
analysis
tools,
file
systems,
and
coordina1on
strategies.



Demonstrates
techniques
for
high‐precision
coupled
with
guaranteed
answer
at


run1me.


Sponsor:
DOE
ASCR


FWP
ERKJT17


(18)

Test Setup

Compute Nodes

18,688 nodes

(

12 Opteron cores per node

)

Commodity Network

InfiniBand Switches

(3000+ ports)

Gateway Nodes

192 nodes

(

2 Opteron cores per node

)

Storage Nodes

192 nodes

(

8 Xeon cores per node

)

Enterprise Storage

48 Controllers

(DataDirect S2A9900)

Jaguar XT5 SeaStar2+ 3D Torus

9.6 Gbit/sec

SION InfiniBand

16 Gbit/sec

InfiniBand

16 Gbit/sec

Serial ATA

3.0 Gbit/sec

(19)

References

Related documents

Our numerical experiments indicate that the second-order exponential Rosenbrock-Euler method outperforms the second- order backward differentiation formula, but the performance of

At the end of the intervention an interview was conducted with the students. It was made with two selected students’ one boy and a girl. The result of the

Creates targeted delivery for safer gene therapy for anticancer treatment by redirecting binding target of the adeno-associated virus capsid (viii) single-chain antibody

This study has some practical applications such as looking at using aphids and ants as a source of biological control on plants that aphids colonize, and the long-term effects on

Consequently, the constraining effect on mobile termination rates resulting from the ability of mobile-to-mobile calls to substitute for fixed- to-mobile calls depends

Abstract: In Human Rights Watch v Secretary of State for the Foreign and Commonwealth Office the UK Investigatory Powers Tribunal found that the relevant standard of

* Note: A boot loader is the first software program that runs when a computer starts.. It is responsible for

Examples of funds that do not invest in fossil fuels are IA Clarington Inhance Global Equity SRI Class and AGF Global Sustainable Growth Equity Fund (here’s a link to AGF’s