1
OVERVIEW OF
MULTICORE, PARALLEL COMPUTING,
AND DATA MINING
Indiana University
Computer Science Dept.
Seung-Hee Bae
1
OUTLINE
Motivation
Multicore
Parallel Computing
MOTIVATION
According to “How Much Information” project at UC Berkeley
Print, film, magnetic & optical storage media produced about
5 exabytes
(a billion of billion bytes) of new info. in 2002.
5 exabytes = 37000 Library of Congress (17 million books)
The rate of data increase will continue to accelerate
through weblogs,
digital photo & video, surveillance monitor, scientific instruments
(sensors), and instant message etc.
Thus, we need more powerful computing platforms
to deal with
this much data.
To take advantage of multicore chip, it is critical to build a software with
scalable parallelism.
To deal with a huge amount of data and utilize multicore, it is essential
to develop
data mining
tools with
highly scalable parallel
RECOGNITION, MINING, AND
SYNTHESIS (RMS)
(from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of Tera,” Technology@Intel Magazine, Feb. 2005.)
5
Motivation
Multicore
Toward Concurrency
What is Multicore?
Parallel Computing
Data Mining
TOWARD CONCURRENCY IN SOFTWARE
Exponential growth (Moore’s
Law) will change
Clock speed
: getting more cycles
Become harder to exploit higher
clock speeds (2GHz:2001, 3.4GHz:2004, now?)
Execution optimization
: more
work per cycle
Pipelining, branch prediction,
multiple instructions/clock,
Moore’s law is over?
Not yet (# of transistors ↑)
Hyperthreading
Running two or more threads in
parallel inside a single CPU
It doesn’t help single-threaded
applications
Multicore
Running two or more actual CPUs on
one chip.
It will boost reasonably well-written
multi-thread applications, but not
7
WHAT IS MULTICORE?
Single Chip
Multiple distinct processing Engine
E.g.) Shared-cache Dual Core Architecture
7 7
Core 0
CPU
L1 Cache
Core 1
CPU
L1 Cache
Motivation
Multicore
Parallel Computing
Parallel architectures (Shared-Memory vs. Distributed-Memory)
Decomposing Program (Data Parallelism vs. Task Parallelism)
MPI and OpenMP
9 9
PARALLEL COMPUTING: INTRODUCTION
Parallel computing
More than just a strategy for achieving good performance
Vision for how computation can
seamlessly scale
from a single processor
to virtually limitless computing power
Parallel computing software systems
Goal: to make parallel programming easier and the resulting applications
more
portable and scalable
while achieving good
performance.
Component Parallel Paradigm (
Explicit Parallel
)
One explicitly programs the different parts of a parallel application. E.g.) MPI, PGAS, CCR & DSS, Workflow, DES
Program Parallel Paradigm (
Implicit Parallel
)
One writes a single program to describe the whole app. compiler and runtime
break up the program into the multiple parts that execute in parallel.
E.g.) OpenMP, HPF, HPCS, MapReduce
Parallel Computing Challenges
Concurrency & Communication
Scalability and portability are difficult to achieve.
Diversity of Architectures
PARALLEL ARCHITECTURE 1
Shared-memory machines
Have a
single shared address
space
that can be accessed by
any processor.
Examples
Multicore
Symmetric multiprocessor (SMP) Uniform Memory Access (UMA)
Access time is independent of
the location.
Use bus or fully connected net. Hard to achieve the scalability
Distributed-memory
machines
The
system memory is packaged
with individual nodes
of one or
more processors (c.f. Use separate
computers connected by a
network)
E.g.)
Cluster11
PARALLEL ARCHITECTURE 2
11
Shared-Memory Distributed Memory
Pros • Lower latency and higher BW
• Data are available to all of the CPUs through load and store instructions
• Single address space
• Scalable, if a scalable
interconnection network is used.
• Quite fast local data access.
Cons • cache coherency should be dealt with carefully.
• synchronization is explicitly
needed to access shared data.
• scalability issue
• Communication required to
access data in a diff. processor.
• Communication management
problem
1. Long latency
PARALLEL ARCHITECTURE 3
Hybrid systems
Distributed shared-memory (DSM)
Distributed-memory machine which allows a processor to directly
access a datum in a remote memory.
Latency varies with the distance to the remote memory.
Emphasize the Non-Uniform Memory Access (NUMA)
characteristics.
SMP clusters
PARALLEL PROGRAMMING MODEL
Shared-Memory Programming model
Need for synchronization to preserve the integrity More appropriate to shared-memory machine
E.g.) Open Specifications for MultiProcessing (OpenMP)
Message-Passing Programming model
Send-receive communication steps.
Communication is used to access a remote data location. More appropriate to distributed-memory machine
E.g.) Message Passing Interface (MPI)
Shared-memory programming model can be used to
distributed-memory machines as well as message-passing programming
model can be used to shared-memory architectures.
However, the efficiency of the programming model is different.
PARALLEL PROGRAM: DECOMPOSITION 1
Subdivides the data domain of a
problem into multiple regions
and
assigns different processors.
Exploit the
parallelism inherent in
many large data structures
.
Same Task on diff. data. (
SPMD
)
More commonly used in scientific
problems.
Features
natural form of scalability.
Different processors carry out
different functions.
Coarse grain parallelism
Different task on the same or
different data.
Features
Parallelism limited in size
Tens not millions
Synchronization probably good
Parallelism and Decomposition can
15 15
PARALLEL PROGRAM: DECOMPOSITION 2
Load balance and scalability
Scalable: running time is inversely proportional to the number of
processors used.
Speedup(n) = T(1)/T(n)
Scalable if speedup(n) ≈ n
Second definition of scalability: scaled speedup
Scalable if the running time remains the same when the number of
processors and the problem size are increased by a factor of n.
Why scalability is not achieved?
a region that must be run sequentially
. Total speedup ≤ T(1)/T
s
(Amdahl’s Law)
Require for a high degree of
communication or coordination
.
Poor
load balance
(major goal of parallel programming)
If one of the processors takes half of the parallel work, speedup will be
MEMORY MANAGEMENT
Memory-Hierarchy Management
Blocking
Ensuring that data remains in cache between subsequent accesses to the
same memory location.
Elimination of False Sharing
False sharing
: When two diff. processors are accessing distinct data
items that reside on the same cache line.
Ensure that data used by diff. processors reside on diff. cache line.
(by
padding
:
inserting empty bytes in a data structure.)
Communication Minimization and Placement
17
MESSAGE PASSING INTERFACE (MPI)
Message Passing Interface (MPI)
A specification for a set of functions for managing movement of data
among sets of communicating processes.
The dominant
scalable
parallel computing paradigm with
scientific
problem
.
Explicit message
send
and
receive
using rendezvous model.
Point-to-point
communication
Collective
communication
Commonly implemented in terms of an
SPMD
model
All processes execute essentially the same logic.
Pros:
scalable and portable
Race condition avoided (implicit synch. w/ completion of the copy)
Cons:
MPI
6 Key Functions
MPI_INIT
MPI_COMM_SIZE
MPI_COMM_RANK
MPI_SEND
MPI_RECV
MPI_FINALIZE
Collective Communications
19
OPEN SPECIFICATIONS FOR
MULTIPROCESSING (
OpenMP
) 1
Appropriate to
uniform-access, shared-memory.
A sophisticated
set of annotations (compiler directives)
for traditional
C, C++, or Fortran codes to
aid compilers producing parallel codes
.
It provides
parallel loops
and
collective operations
such as summation
over loop indices.
Provide
lock variables
to allow fine-grain synchronization btwn threads.
Specify where multiple threads should be applied, and how to assign
work to those threads.
Pros:
Excellent programming interface for uniform-access, shared-memory machines.
Cons:
No way to specify locality in machines w/ non-uniform shared-memory or
distributed memory.
OpenMP 2
Directives: instruct the compiler to
Create threads, perform synchronization ops, manage shared memory.
Examples
PARALLEL DO ~ END PARALLEL DO SCHEDULE (STATIC)
SCHEDULE (DYNAMIC) REDUCTION(+: x)
PARALLEL SECTIONS
OpenMP synchronization primitives
21
Motivation
Multicore
Parallel Computing
Data Mining
Expectation Maximization (EM)
Deterministic Annealing (DA)
Hidden Markov Model (HMM)
Other Important Algorithms
EXPECTATION MAXIMIZATION (EM)
Expectation Maximization (EM)
A general algorithm for
maximum-likelihood (ML) estimation where
the data are “incomplete”
or the likelihood function involves latent
variables.
An efficient
iterative procedure
Goal:
estimate unknown parameters
, given measurement.
Hill climbing approach
guarantee to reach maxima (or local maxima.)
Two Steps
23 23
DETERMINISTIC ANNEALING (DA)
Purpose: avoid local minima (optimization)
Simulated Annealing (SA)
A sequence of random moves
is generated and the random decision to
accept a move depends on the cost of resulting configuration relative to the
current state cost (
Monte Carlo Method
)
Deterministic Annealing (DA)
Uses expectation
instead of stochastic simulations (random move).
Deterministic
:
Making incremental progress on the average. (minimize the free energy (F) directly)
Annealing
:
still want to avoid local minima with certain level of uncertainty.
Minimizing the cost at prescribed level of randomness (Shannon Entropy)
eq)
F = D – TH
(T: temperature, H: Shannon Entropy, D: cost)
At large T, entropy (H) dominates while at small T cost dominates. Annealing lowers temperature so solution tracks continuously
DA FOR CLUSTERING
This is an
extended K-means
algorithm.
Start with a
single cluster
giving as solution
Y
1as centroid
For some
annealing schedule
for T, iterate above algorithm testing
DA CLUSTERING RESULTS (GIS)
Age under 5 vs. 25 to 34
Age under 5 vs. 75 and up
HIDDEN MARKOV MODEL (HMM) 1
A system being in one of a set of N
distinct states, S1, S2, …, SN at any time.
State transition probability
The special case of a discrete, first
order Markov chain:
P[qt=Sj|qt-1=Si, qt-2=Sk, …]
= P[qt=Sj|qt-1=Si] (1)
Consider the right-hand side of (1) is
independent of time, thereby leading
to the set of state transition probability a of the form
Observation is a probabilistic function of
the state.
State is hidden.
Speech recognition, bioinfo, etc.
Elements of an HMM
N, the number of states
M, the number of symbols
A = {aij}, The state transition probability
distribution
B = {bj(k)}, The symbol emission
probability distribution in state j
HIDDEN MARKOV MODEL (HMM) 2
Prob(observation seq | model):
Given the observation sequence
O=O1O2 … OT, and a model λ=(A,B,π), how do we efficiently compute P(O| λ)?
Finding Optimal State Seq:
Given O = O1O2 … OT, and
λ=(A,B,π), how do we choose a corresponding optimal state sequence Q = q1q2 … qT in some meaningful sense (i.e. best
“explains” the observations)?
Finding Optimal Model
Parameters:
How do we adjust the model
parameters λ = (A, B, π) to maximize P(O| λ) ?
Prob(observation seq | model):
Enumeration: computationally unfeasible.
Forward Procedure
αt(i) = P(O1O2 … Ot, qt = Si| λ)
Finding Optimal State Seq:
find the best state sequence (path)
Viterbi algorithm:
dynamic programming method
δt(i) = max P[q1q2…qt = i, O1O2…Ot| λ)
Path back tracking
Finding Optimal Model Parameters:
Baum-Welch Method:
Choose λ = (A, B, π) such that P(O| λ) is
locally maximized
Essentially EM method: iterative
ξt(i, j) = P(qt = Si, qt+1 = Sj|O, λ)
Three Basic Problems
Solutions of those Problems
OTHER IMPORTANT ALGS.
Other Data Mining Algorithms
Support Vector Machine (SVM)
K-means (special case of DA clustering), Nearest-neighbor
Decision Tree, Neural network, etc.
Dimension Reduction
GTM (Generative Topographic Map)
MDS (MultiDimensional Scaling)
SUMMARY
Era of Multicore (Parallelism is essential.)
Explosion of information from many kinds of sources.
We are interesting
scalable parallel data-mining
algorithms.
Clustering algorithm (DA clustering)
GIS (demographic (census) data) – visualization is natural.
Cheminformatics – dimension reduction is necessary to visualize.
Visualization (Dimension Reduction)