Applications and Runtime for
multicore/manycore
March 21 2007
Geoffrey Fox
Community Grids Laboratory Indiana University
505 N Morton Suite 224 Bloomington IN
Pradeep K. Dubey, [email protected]
Tomorrow
What is …? Is it …? What if …?
Recognition Mining Synthesis
Create a model instance
RMS: Recognition Mining Synthesis
Model-based multimodal recognition Find a model instance Model
Real-time analytics on dynamic, unstructured, multimodal datasets Photo-realism and physics-based animation Today
Model-less Real-time streaming andtransactions on static – structured
datasets
Discussed in Seminars
http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/
Rest mainly classi parallel computing
Some Bioinformatics Datamining
• 1. Multiple Sequence Alignment (MSA)
– Kernel Algorithms
• HMM (Hidden Markov Model)
• pairwise alignments (dynamic programming) with heuristics (e.g. progressive, iterative method)
• 2. Motif Discovery
– Kernel Algorithms:
– MEME (Multiple Expectation Maximization for Motif Elicitation)
– Gibbs sampler
• 3. Gene Finding (Prediction)
– Hidden Markov Methods
• 4. Sequence Database Search
– Kernel Algorithms
• BLAST (Basic Local Alignment Search Tool)
• PatternHunter
Berkeley Dwarfs
• Dense Linear Algebra
• Sparse Linear Algebra
• Spectral Methods
• N-Body Methods
• Structured Grids
• Unstructured Grids
• Pleasingly Parallel
• Combinatorial Logic
• Graph Traversal
• Dynamic Programming
• Branch & Bound
• Graphical Models (HMM)
• Finite State Machine
Consistent in Sprit with Intel Analysis
Client side Multicore applications
• “Lots of not very parallel applications”
• Gaming; Graphics; Codec conversion for multiple user conferencing ……
• Complex Data querying and data
manipulation/optimization/regression ; database and datamining (including computer vision) (Recognition
and Mining for Intel Analysis)
– Statistical packages as in Excel and R
• Scenario and Model simulations (Synthesis for Intel)
• Multiple users give several Server side multicore applications
Approach I
• Integrate Intel, Berkeley and other sources including database (successful on current parallel machines like scientific applications)
• and define parallel approaches in “white paper”
• Develop some key examples testing 3 parallel programming paradigms
– Coarse Grain functional Parallelism (as in workflow) including pleasingly parallel instances with different data
– Fine Grain functional Parallelism (as in Integer Programming)
– Data parallel (Loosely Synchronous as in Science)
• Construct so can use different run time including perhaps CCR/DSS, MPI, Data Parallel .NET
• May be these will become libraries used as in
Approach II
• Have looked at CCR in MPI style applications
– Seems to work quite well and support more general messaging
models
• NAS Benchmark using CCR to confirm its utility
• Developing 4 exemplar multi-core parallel applications
– Support Vector Machines (linear algebra) Data Parallel
– Deterministic Annealing (statistical physics) Data Parallel
– Computer Chess or Mixed Integer Programming Fine Grain Parallelism
– Hidden Markov Method (Genetic Algorithms) Loosely Coupled functional Parallelism
• Test high level coordination to such parallel applications
CCR for Data Parallel (Loosely
Synchronous) Applications
• CCR supports general coordination of messages queued
in ports in Handler or Rendezvous mode
• DSS builds service model on CCR and supports coarse grain functional parallelism
• Basic CCR supports fine grain parallelism as in
computer chess (and use STM enabled primitives?)
• MPI has well known collective communication which supply scalable global synchronization etc.
• Look at performance of MPI_Sendrecv
• What is model that encompasses best shared and
distributed memory approaches for “data parallel” problems
– This could be put on top of CCR?
Thread0
Port 3 Thread2 Port2
Port 1 Port 0 Thread3 Thread1
Thread2 Port2 Thread0 Port0
Port 3 Thread3 Port 1 Thread1
Thread3 Port3 Thread2 Port2 Thread0 Port0
Thread1 Port1
(a) Pipeline (b) Shift
(d) Exchange
Thread0
Port 3 Thread2 Port2
Port 1 Port 0 Thread3 Thread1
(c) Two Shifts
Four Communication Patterns used in CCR Tests. (a) and (b) use CCR Receive while (c) and (d) use CCR Multiple Item Receive
Use o
AMD 4-core Xeon 4-core Xeon 8-core
Latter do up to 8 way
Write Exchange Messages Port 3 Port 2 Thread0 Thread3 Thread2 Thread1 Port 1 Port 0 Thread0 Write Exchange Messages Port 3 Thread2 Port 2
Exchanging Messages with 1D Torus Exchang
topology for loosely synchronous execution in CCR
Thread0 Read Message s Thread3 Thread2 Thread1 Port 1 Port 0 Thread3 Thread1
Stage Stage Stage
Break a single computation into different number of stages varying from 1.4 microseconds to 14 seconds for AMD
Stages (millions)
Fixed amount of computation (4.107 units) divided into 4 cores and from 1 to 107
stages on HP Opteron Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode
Time Seconds
8.04 microseconds overhead per stage averaged from 1 to 10 million stages
Overhead = Computatio
n
Computation Component if no Overhead
4-way Pipeline Pattern 4 Dispatcher Threads
HP Opteron 1.4 microseconds
computation per stage
14 microseconds
Stage Overhead versus Thread Computation time
• Overhead per stage constant up to about million
stages and then increases
14 Seconds Stage Computation 14
Stages (millions)
Fixed amount of computation (4.107 units) divided into 4 cores and from 1 to 107
stages on Dell 2 processor 2-core each Xeon Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode
Time Seconds
12.40 microseconds per stage
averaged from 1 to 10 million stages
4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon
Overhead = Computatio
n
Summary of Stage Overheads for AMD 2-core 2-processor Machine
Summary of Stage Overheads for Intel 2-core 2-processor Machine
These are stage switching overheads for a set of runs with
different levels of parallelism and different message patterns –each stage takes about 30 microseconds. AMD overheads in parentheses
Summary of Stage Overheads for Intel 4-core 2-processor Machine
These are stage switching overheads for a set of runs with
different levels of parallelism and different message patterns –each stage takes about 30 microseconds. core
2-processor Xeon overheads in parentheses
XP-Pro
8-way Parallel Pipeline on two 4-core Xeon
• Histogram of 100 runs -- each run has 500,000 synchronizations following a thread execution that takes 33.92 microseconds
– So overhead of 6.1 microseconds modest
• Message size is just one
integer • Choose
computation unit that is
appropriate for a few
microsecond stage
overhead
AMD 4-way
27.94 microsecond Computation Unit
8-way Parallel Shift on two 4-core Xeon
• Histogram of 100 runs -- each run has 500,000 synchronizations following a thread execution that takes 33.92 microseconds
– So overhead of 8.2 microseconds modest
• Shift versus
pipeline adds a microsecond to cost
• Unclear what causes second peak
XP-ProVISTA
AMD 4-way
27.94 microsecond Computation Unit
8-way Parallel Double Shift on two 4-core Xeon
• Histogram of 100 runs -- each run has 500,000
synchronizations following a thread execution that takes 33.92 microseconds
– So overhead of 22.3 microseconds significant – Unclear why double shift slow compared to shift
• Exchange performance partly reflects number of
messages • Opteron
overheads significantly lower than Intel
XP-Pro
AMD 4-way
27.94 microsecond Computation Unit
AMD 2-core 2-processor Bandwidth Measurements
• Previously we measured latency as measurements corresponded to small messages. We did a further set of measurements of bandwidth by
exchanging larger messages of different size between threads • We used three types of data structures for receiving data
– Array in thread equal to message size
– Array outside thread equal to message size
– Data stored sequentially in a large array (“stepped” array)
Intel 2-core 2-processor Bandwidth Measurements
• For bandwidth, the Intel did better than AMD especially when one exploited cache on chip with small transfers
• For both AMD and Intel, each stage executed a computational task after copying data arrays of size 105 (labeled small), 106 (labeled
large) or 107 double words. The last column is an approximate value
in microseconds of the compute time for each stage. Note that copying 100,000 double precision words per core at a
Typical Bandwidth measurements showing effect of cache with slope change
5,000 stages with run time plotted against size of double array copied in each stage from thread to stepped locations in a large array on Dell Xeon Multicore
Time Seconds
4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon
Total Bandwidth 1.0 Gigabytes/Sec up to one million double words and 1.75 Gigabytes/Sec up to
100,000 double words
Array Size: Millions of Double Words
Slope Change (Cache
Timing of HP Opteron Multicore as a function of number of simultaneous two-way service messages processed (November 2006 DSS Release)
n CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better