1
OVERVIEW OF
MULTICORE, PARALLEL COMPUTING,
AND DATA MINING
Indiana University
Computer Science Dept.
Seung-Hee Bae
12 2
OUTLINE
Multicore
Parallel Computing & MPI
Data Mining
3
Multicore
Toward Concurrency
What is Multicore?
Shared cache architecture
Recognition, Mining, and Synthesis (RMS)
Parallel Computing & MPI
Data Mining
4 4
TOWARD CONCURRENCY IN SOFTWARE
Exponential growth (Moore’s Law) can’t continue
Previous CPU performance gains
Clock speed: getting more cycles
Become harder to exploit higher clock speeds due to several physical issues, such as, heat, power consumption, and current leakage problems. (2GHz:2001, 3.4GHz:2004, now?)
Execution optimization: more work per cycle
Pipelining, branch prediction, executing multiple instructions in the same clock cycle
reordering the instruction stream: changing meaning of programs.
Cache
Increasing the size of on-chip cache: main memory is much
5 5
TOWARD CONCURRENCY IN SOFTWARE 2
Current CPU performance gains
Moore’s law is over? Not yet (# of transistors ↑) Hyperthreading
Running two or more threads in parallel inside a single CPU Runs some instructions in parallel
One each of most basic CPU features, (except extra registers) 5% ~ 15 %, 40% under ideal conditions
It doesn’t help single-threaded applications
Multicore
Running two or more actual CPUs on one chip. Less than double the speed even in the ideal case.
It will boost reasonably well-written multi-thread applications, but not
single-threaded applications.
2 * 3GHz < 6 GHz
Coordination overhead between the cores to ensure cache coherency.
Cache
Only this will broadly benefit most existing applications.
6
WHAT IS MULTICORE?
Single Chip
Multiple distinct processing Engine
E.g.) Shared-cache Dual Core Architecture
6 6
Core 0
CPU
L1 Cache
Core 1
CPU
L1 Cache
7
SHARED-CACHE ARCHITECTURE
Options for the last-level cache
private to each core
sharing the last-level cache among diff. cores
Benefits of the Shared-Cache Architecture
Efficient use of the last-level cache. reduce resource underutilization.
Reduce cache-coherence complexity
reduced false sharing because of shared cache.
reduce data-storage redundancy
same data only needs to be stored once.
reduce front-side bus traffic
data requests can be resolved at the shared-cache level instead of system memory.
8
SOFTWARE TECHNIQUES FOR
SHARED-CACHE MULTICORE SYSTEMS
Cache blocking (Data Tiling)
Allow data to stay in the cache while being processing by data loops.
Reducing unnecessary cache traffic. (Better cache hit ratio.)
Hold approach (Late update)
Each thread maintain its own private copy of data.
Updating the shared copy only when it is necessary.
Reducing the frequency of access to the shared data.
Avoid false sharing
What is false sharing? (unnecessary cache line update.)
How to avoid false sharing?
To allocate non-shared data to different cache lines. (padding)
To copy the global variable to a local function variable, then copy the data
9
RECOGNITION, MINING, AND
SYNTHESIS (RMS)
Era of Tera is coming quickly
Teraflops (computing power), Terabits (comm.), Terabytes
(storage)
World data is doubling every three years and is now measured
exabytes (a billion billion bytes)
Need computing model to deal this enormous sea of
information
Working with Models
Recognition (What is ?)
Identifying that a set of data constitutes a model and then constructing that model.
Mining (Is it ?)
Search for instances of the model.
Synthesis (What if ?)
10
RMS 2
(from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of Tera,” Technology@Intel Magazine, Feb. 2005.)
Examples
Medicine (a tumor)
Business (hiring)
Investment
11
Multicore
Parallel Computing & MPI
Parallel architectures (Shared-Memory vs. Distributed-Memory)
Decomposing Program (Data Parallelism vs. Task Parallelism)
MPI and OpenMP
Data Mining
12 12
PARALLEL COMPUTING: INTRODUCTION
Parallel computing
More than just a strategy for achieving good performance Vision for how computation can seamlessly scale from a single
processor to virtually limitless computing power
Parallel computing software systems
Goal: to make parallel programming easier and the resulting applications
more portable and scalable while achieving good performance.
Difficulty
Explicitly parallel program is difficult
e.g.) computation, partitioning, synchronization, and data movement (correct answer & high performance)
Must be machine-independent – portability Complexity of the problems being attacked.
Parallel Computing Challenges
Concurrency & Communication Need for high performance
13
PARALLEL ARCHITECTURE 1
13
Shared-memory machines
Have a single shared address
space that can be accessed by any processor.
Examples
Multicore
Symmetric multiprocessor (SMP) Uniform Memory Access (UMA)
Access time is independent of the loc. Use bus or completely connected net. Not scalable
Shared-Memory Programming
model
Need for synchronization to
preserve the integrity
E.g.) Open Specifications for
MultiProcessing (OpenMP)
Distributed-memory machines
The system memory is packaged
with individual nodes of one or more processors (c.f. Use separate computers connected by a network)
E.g. Cluster
communication is required to
provide data from a processor to a different processor.
support message-passing
programming model
Send-receive communication steps. E.g.) Message Passing Interface
14
PARALLEL ARCHITECTURE 2
14
Shared-Memory Distributed Memory
Pros • Lower latency and higher BW
• Data are available to all of the CPUs through load and store instructions
• Single address space
• Scalable, if a scalable
interconnection network is used.
• Quite fast local data access.
Cons • cache coherency issue
• synchronization is explicitly
needed to access shared data.
• scalability issue
• Communication required to
access data in a diff. processor.
• Communication management
problem
1. Long latency Consolidation of messages btwn the same pair of processors
2. Long transmission time
15 15
PARALLEL ARCHITECTURE 3
Hybrid systems
Distributed shared-memory (DSM)
Distributed-memory machine which allows a processor to directly
access a datum in a remote memory.
Latency varies with the distance to the remote memory. Emphasize the Non-Uniform Memory Access (NUMA)
characteristics.
SMP clusters
distributed-memory system with SMP as a unit.
16 16
PARALLEL PROGRAM: Decomposition 1
Decomposing Programs
Decomposition: Identifying the portions for the parallelism. Decomposition strategy
Task (Functional) parallelism
Different processors carry out different functions.
Data parallelism
Subdivides the data domain of a problem into multiple regions and
assigns different processors to compute the results for each region.
More commonly used in scientific problems. Natural form of scalability
Programming models
Shared-memory programming model
Need for synchronization to preserve the integrity
Message-passing model
Communication is required to access a remote data location.
17
PARALLEL PROGRAM:
DECOMPOSITION 2
17
Data Parallelism
Exploit the parallelism inherent in many large data structures.
Same Task on diff. data.
(SPMD)
Can be expressed by ALL
parallel programming
models (i.e. MPI, HPF like, OpenMP like)
Features
Scalable
Hard to express when
geometry irregular or dynamic
Functional Parallelism
Coarse grain parallelism
Parallelism btwn the parts of many systems.
Diff. task on the same or
diff. data.
Features
Parallelism limited in size
Tens not millions
Synchronization probably
good as parallelism
Decomposition natural
18 18
PARALLEL PROGRAM:
DECOMPOSITION 3
Load balance and scalability
Scalable: running time is inversely proportional to the number of
processors used.
Speedup(n) = T(1)/T(n)
Scalable if speedup(n) ≈ n
Second definition of scalability: scaled speedup
Scalable if the running time remains the same when the number of
processors and the problem size are increased by a factor of n.
Why scalability is not achieved?
a region that must be run sequentially. Total speedup ≤ T(1)/T
s
(Amdahl’s Law)
Require for a high degree of communication or coordination.
Poor load balance (major goal of parallel programming)
If one of the processors takes half of the parallel work, speedup will be
19 19
PARALLEL PROGRAM
Memory-Hierarchy Management
Blocking
Ensuring that data remains in cache between subsequent accesses to the
same memory location.
Elimination of False Sharing
False sharing: When two diff. processors are accessing distinct data
items that reside on the same cache block.
Ensure that data used by diff. processors reside on diff. cache blocks.
(by padding: inserting empty bytes in a data structure.)
Communication Minimization and Placement
Move send and receive commands far enough apart so that time spent on
communication can be overlapped.
Stride-one access
Programs in which the loops access contiguous data items are much
more efficient than those that do not.
20
MESSAGE PASSING INTERFACE
(MPI) 1
Message Passing Interface (MPI)
A specification for a set of functions for managing movement of
data among sets of communicating processes.
The dominant scalable parallel computing paradigm with scientific
problem.
Explicit message send and receive using rendezvous model. Point-to-point communication
Collective communication
Commonly implemented in terms of an SPMD model
All processes execute essentially the same logic.
Pros:
scalable and portable
Race condition avoided (implicit synch. w/ the copy)
Cons:
21
MPI
6 Key Functions
MPI_INIT
MPI_COMM_RANK
MPI_COMM_SIZE
MPI_SEND
MPI_RECV
MPI_FINALIZE
Collective Communications
Barrier, Broadcast, Gather, Scatter, All-to-all, Exchange
General reduction operation (sum, minimum, scan)
Blocking, nonblocking, buffered, synchronous messaging
22
OPEN SPECIFICATIONS FOR
MULTIPROCESSING (
OPENMP
) 1
Appropriate to
Shared-Memory.
A sophisticated
set of annotations (compiler
directives)
for traditional C, C++, or Fortran codes to
aid compilers producing parallel codes
.
It provides
parallel loops
and
collective operations
such as summation over loop indices.
Provide
lock variables
to allow fine-grain
synchronization btwn threads.
23
OPENMP 2
Directives: instruct the compiler to
Create threads
Perform synchronization operations. Manage shared memory.
Examples
PARALLEL DO ~ END PARALLEL DO: explicit parallel loop.
SCHEDULE (STATIC): assign continuous blocks at compile time.
SCHEDULE (DYNAMIC): assign continuous blocks at run-time.
REDUCTION(+: x): final values of var. x is determined global sum.
PARALLEL SECTIONS: task parallelism.
OpenMP synchronization primitives
Critical sections Atomic updates Barriers
Master selection
24
OPENMP 3
Summary
Work decomposition
Ideal target system: uniform-access, shared-memory.
Specify where multiple threads should be applied, and how
to assign work to those threads.
Pros:
Excellent programming interface for uniform-access,
shared-memory machines.
Cons:
No way to specify locality in machines w/ non-uniform
shared-memory or distributed shared-memory.
25
Multicore
Parallel Computing & MPI
Data Mining
Expectation Maximization (EM)
Deterministic Annealing (DA)
Hidden Markov Model (HMM)
Support Vector Machine (SVM)
26
EXPECTATION MAXIMIZATION
(EM)
Expectation Maximization (EM)
A general algorithm for maximum-likelihood (ML) estimation where the data are “incomplete” or the likelihood function involves latent variables.
An efficient iterative procedure
Goal: estimate unknown parameters, given measurement.
Hill climbing approach guarantee to reach local maxima.
Two Steps
E-step (Expectation): the missing data are estimated given the
observed data and current estimate of the model parameters.
M-step (Maximization): the likelihood function is maximized
under the assumption that the missing data are known. (The
estimated missing data from the E-step are used in lieu of the actual missing data.)
27 27
DETERMINISTIC ANNEALING (DA)
Purpose: avoid local minima (optimization) Clustering
example of unsupervised learning
Simulated Annealing (SA)
A sequence of random moves is generated and the random decision to
accept a move depends on the cost of resulting configuration relative to the current state cost (Monte Carlo Method)
Deterministic Annealing (DA)
Deterministic:
don’t wandering randomly
(minimize the free energy directly)
Annealing:
still want to avoid local minima with certain level of uncertainty. maintain the free energy at its minimum.
eq) F = D – TH (T: temperature, H: Shannon Entropy, D: cost)
At large T, entropy (H) dominates while at small T cost dominates. Annealing lowers temperature so solution tracks continuously
28
DA FOR CLUSTERING
Start with a
single cluster
giving as solution
Y
1as centroid
For some
annealing schedule
for T, iterate above algorithm testing
covariance matrix in
X
iabout each cluster center to see if
“elongated”
Split cluster if elongation “long enough”
You
do not need to assume number of clusters
but rather a final
resolution
T or equivalent
29
HIDDEN MARKOV MODEL (HMM) 1
Markov model
A system which may be described at any time as being in one
of a set of N distinct states, S1, S2, …, SN.
State transition probability
The special case of a discrete, first order Markov chain:
P[q
t = Sj|qt-1 = Si, qt-2 = Sk, …] = P[qt = Sj|qt-1 = Si] (1)
Furthermore, consider those processes in which the right-hand side
of (1) is independent of time, thereby leading to the set of state transition probability aij of the form
aij = P[qt = Sj|qt-1 = Si], 1 ≤ i, j ≤ N, aij ≥ 0 ∑J
aij = 1
Initial state probability
30
HIDDEN MARKOV MODEL (HMM) 2
Hidden Markov Model
Observation is a probabilistic function of the state.
State is hidden.
Elements of an HMM
N, the number of states in the model. (Although the states are hidden)
M, the number of distinct observation symbols per state, i.e. the discrete
alphabet size.
The state transition probability distribution A = {aij},
where aij = P[qt = Sj|qt-1 = Si], 1 ≤ i, j ≤ N, aij ≥ 0 ∑J aij = 1
The observation symbol probability distribution (emission probability) in
state j, B = {bj(k)}, where
bj(k) = P[vk at t| qt = Sj], 1 ≤ j ≤ N, 1 ≤ k ≤ M
The initial state distribution π = {πi} where
πi = P[q1 = Si], 1 ≤ j ≤ N
31
HIDDEN MARKOV MODEL (HMM) 3
Three Basic Problems for HMMs
Prob(observation seq | model): Given the observation sequence O =
O1O2 … OT, and a model λ = (A, B, π), how do we efficiently
compute P(O| λ), the probability of the observation sequence, given the model?
Finding Optimal State Sequence: Given the observation sequence
O = O1O2 … OT, and a model λ = (A, B, π), how do we choose a
corresponding state sequence Q = q1q2 … qT which is optimal in
some meaningful sense (i.e. best “explains” the observations)?
Finding Optimal Model Parameters: How do we adjust the model
32
HIDDEN MARKOV MODEL (HMM) 4
Solution to the three basic problems for HMMs
Solution to the problem 1 (Forward-Backward procedure)
Enumeration (straightforward way): computationally
unfeasible.
Forward Procedure
Consider forward variable αt(i) = P(O1O2 … Ot, qt = Si| λ) i.e.,
the probability of the partial observation sequence, O1O2 …
Ot, (until time t) and state Si at time t, given the model λ.
Solution to the problem 2 (Viterbi algorithm)
Optimality criterion: to find the single best state sequence (path), i.e., to maximize P(Q|O, λ) which is equivalent to maximizing P(Q, O| λ).
A formal technique for finding this single best state
sequence exists, based on dynamic programming methods,
33
HIDDEN MARKOV MODEL (HMM) 5
Solution to the Problem 3. (Baum-Welch Algorithm)
The third problem of HMMs is to determine a method to adjust
the model parameters (A, B, π) to maximize the probability of
the observation sequence given the model.
Choose λ = (A, B, π) such that P(O| λ) is locally maximized using an iterative procedure such as the Baum-Welch method (or
equivalently the EM (expectation-modification) method) or using gradient techniques.
Reestimation (iterative update and improvement), define ξt(i, j),
the probability of being in state Si at time t and state Sj at time t+1, given the model and the observation sequence, i.e.