OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING

(1)

1

OVERVIEW OF

MULTICORE, PARALLEL COMPUTING,

AND DATA MINING

Indiana University

Computer Science Dept.

Seung-Hee Bae

1

(2)

OUTLINE



Motivation



Multicore



Parallel Computing

(3)

MOTIVATION



According to “How Much Information” project at UC Berkeley



_{Print, film, magnetic & optical storage media produced about}

_{5 exabytes}

(a billion of billion bytes) of new info. in 2002.

 5 exabytes = 37000 Library of Congress (17 million books)



_{The rate of data increase will continue to accelerate}

_{through weblogs,}

digital photo & video, surveillance monitor, scientific instruments

(sensors), and instant message etc.



Thus, we need more powerful computing platforms

to deal with

this much data.



_{To take advantage of multicore chip, it is critical to build a software with}

scalable parallelism.



_{To deal with a huge amount of data and utilize multicore, it is essential}

to develop

data mining

tools with

highly scalable parallel

(4)

RECOGNITION, MINING, AND

SYNTHESIS (RMS)

(from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of Tera,” Technology@Intel Magazine, Feb. 2005.)

(5)

5 

Motivation



Multicore



_{Toward Concurrency}



_{What is Multicore?}



Parallel Computing



Data Mining

(6)

TOWARD CONCURRENCY IN SOFTWARE



Exponential growth (Moore’s

Law) will change



Clock speed

: getting more cycles

 _{Become harder to exploit higher}

clock speeds (2GHz:2001, 3.4GHz:2004, now?)



Execution optimization

: more

work per cycle

 _{Pipelining, branch prediction,}

multiple instructions/clock,



Moore’s law is over?

 _{Not yet (# of transistors ↑)}



Hyperthreading

 _{Running two or more threads in}

parallel inside a single CPU

 _{It doesn’t help single-threaded}

applications



Multicore

 _{Running two or more actual CPUs on}

one chip.

 _{It will boost reasonably}_well-written

multi-thread applications, but not

(7)

7

WHAT IS MULTICORE?



Single Chip



Multiple distinct processing Engine



E.g.) Shared-cache Dual Core Architecture

7 7

Core 0

CPU

L1 Cache

Core 1

CPU

L1 Cache

(8)



Motivation



Multicore



Parallel Computing



_{Parallel architectures (Shared-Memory vs. Distributed-Memory)}



_{Decomposing Program (Data Parallelism vs. Task Parallelism)}



_{MPI and OpenMP}

(9)

9 9

PARALLEL COMPUTING: INTRODUCTION



Parallel computing



_{More than just a strategy for achieving good performance}



_{Vision for how computation can}

_{seamlessly scale}

_{from a single processor}

to virtually limitless computing power



Parallel computing software systems



_{Goal: to make parallel programming easier and the resulting applications}

more

portable and scalable

while achieving good

performance.



_{Component Parallel Paradigm (}

_{Explicit Parallel}

₎

 One explicitly programs the different parts of a parallel application.  E.g.) MPI, PGAS, CCR & DSS, Workflow, DES



_{Program Parallel Paradigm (}

_{Implicit Parallel}

₎

 One writes a single program to describe the whole app.  compiler and runtime

break up the program into the multiple parts that execute in parallel.

 E.g.) OpenMP, HPF, HPCS, MapReduce



Parallel Computing Challenges



_{Concurrency & Communication}



_{Scalability and portability are difficult to achieve.}



_{Diversity of Architectures}

(10)

PARALLEL ARCHITECTURE 1



Shared-memory machines



_{Have a}

_{single shared address}

space

that can be accessed by

any processor.



_Examples

 Multicore

 Symmetric multiprocessor (SMP)  Uniform Memory Access (UMA)

 Access time is independent of

the location.

 Use bus or fully connected net.  Hard to achieve the scalability



Distributed-memory

machines



_The

_{system memory is packaged}

with individual nodes

of one or

more processors (c.f. Use separate

computers connected by a

network)



_E.g.)

_Cluster

(11)

11

PARALLEL ARCHITECTURE 2

11

Shared-Memory Distributed Memory

Pros • Lower latency and higher BW

• Data are available to all of the CPUs through load and store instructions

• Single address space

• Scalable, if a scalable

interconnection network is used.

• Quite fast local data access.

Cons •_{cache coherency}_{should be} dealt with carefully.

• _{synchronization}_{is explicitly}

needed to access shared data.

• _{scalability issue}

• _{Communication required}_to

access data in a diff. processor.

•_{Communication management}

problem

1. Long latency

(12)

PARALLEL ARCHITECTURE 3



Hybrid systems



_{Distributed shared-memory (DSM)}



Distributed-memory machine which allows a processor to directly

access a datum in a remote memory.



Latency varies with the distance to the remote memory.



Emphasize the Non-Uniform Memory Access (NUMA)

characteristics.



SMP clusters

(13)

PARALLEL PROGRAMMING MODEL



Shared-Memory Programming model

 Need for synchronization to preserve the integrity  _{More appropriate to}_{shared-memory machine}

 _{E.g.) Open Specifications for MultiProcessing (OpenMP)} 

Message-Passing Programming model

 _{Send-receive communication steps.}

 _{Communication is used to access a remote data location.}  _{More appropriate to}_{distributed-memory machine}

 E.g.) Message Passing Interface (MPI)



Shared-memory programming model can be used to

distributed-memory machines as well as message-passing programming

model can be used to shared-memory architectures.

 However, the efficiency of the programming model is different.

(14)

PARALLEL PROGRAM: DECOMPOSITION 1



Subdivides the data domain of a

problem into multiple regions

and

assigns different processors.



Exploit the

parallelism inherent in

many large data structures

.



Same Task on diff. data. (

SPMD

)



More commonly used in scientific

problems.



Features

 _{natural form of}_scalability.



Different processors carry out

different functions.



Coarse grain parallelism



Different task on the same or

different data.



Features

 _{Parallelism limited in size}

 Tens not millions

 _{Synchronization probably good}

 _{Parallelism and Decomposition can}

(15)

15 15

PARALLEL PROGRAM: DECOMPOSITION 2



Load balance and scalability



Scalable: running time is inversely proportional to the number of

processors used.



_{Speedup(n) = T(1)/T(n)}



Scalable if speedup(n) ≈ n



Second definition of scalability: scaled speedup



Scalable if the running time remains the same when the number of

processors and the problem size are increased by a factor of n.



_{Why scalability is not achieved?}



a region that must be run sequentially

. Total speedup ≤ T(1)/T

s

(Amdahl’s Law)



Require for a high degree of

communication or coordination

.



Poor

load balance

(major goal of parallel programming)

 If one of the processors takes half of the parallel work, speedup will be

(16)

MEMORY MANAGEMENT



Memory-Hierarchy Management



Blocking



Ensuring that data remains in cache between subsequent accesses to the

same memory location.



_{Elimination of False Sharing}



False sharing

: When two diff. processors are accessing distinct data

items that reside on the same cache line.



Ensure that data used by diff. processors reside on diff. cache line.

(by

padding

:

inserting empty bytes in a data structure.)



_{Communication Minimization and Placement}

(17)

17

MESSAGE PASSING INTERFACE (MPI)



Message Passing Interface (MPI)



_{A specification for a set of functions for managing movement of data}

among sets of communicating processes.



The dominant

scalable

parallel computing paradigm with

scientific

problem

.



_{Explicit message}

_send

_and

_receive

_{using rendezvous model.}



_{Point-to-point}

_{communication}



_Collective

_{communication}



_{Commonly implemented in terms of an}

_SPMD

_model

 All processes execute essentially the same logic. 

_Pros:

 scalable and portable

 Race condition avoided (implicit synch. w/ completion of the copy) 

_Cons:

(18)

MPI



6 Key Functions



_{MPI_INIT}



_{MPI_COMM_SIZE}



_{MPI_COMM_RANK}



_{MPI_SEND}



_{MPI_RECV}



_{MPI_FINALIZE}



Collective Communications

(19)

19

OPEN SPECIFICATIONS FOR

MULTIPROCESSING (

OpenMP

) 1



Appropriate to

uniform-access, shared-memory.



A sophisticated

set of annotations (compiler directives)

for traditional

C, C++, or Fortran codes to

aid compilers producing parallel codes

.



It provides

parallel loops

and

collective operations

such as summation

over loop indices.



Provide

lock variables

to allow fine-grain synchronization btwn threads.



Specify where multiple threads should be applied, and how to assign

work to those threads.



Pros:

 _{Excellent programming interface for uniform-access, shared-memory machines.}



Cons:

 _{No way to specify locality}_{in machines w/ non-uniform shared-memory or}

distributed memory.

(20)

OpenMP 2



Directives: instruct the compiler to



_{Create threads, perform synchronization ops, manage shared memory.}



_Examples

 PARALLEL DO ~ END PARALLEL DO  SCHEDULE (STATIC)

 SCHEDULE (DYNAMIC)  REDUCTION(+: x)

 PARALLEL SECTIONS



OpenMP synchronization primitives

(21)

21



Motivation



Multicore



Parallel Computing



Data Mining



Expectation Maximization (EM)



_{Deterministic Annealing (DA)}



_{Hidden Markov Model (HMM)}



Other Important Algorithms

(22)

EXPECTATION MAXIMIZATION (EM)



Expectation Maximization (EM)



_{A general algorithm for}

_{maximum-likelihood (ML) estimation where}

the data are “incomplete”

or the likelihood function involves latent

variables.



_{An efficient}

_{iterative procedure}



_Goal:

_{estimate unknown parameters}

_{, given measurement.}



_{Hill climbing approach}



_{guarantee to reach maxima (or local maxima.)}



_{Two Steps}

(23)

23 23

DETERMINISTIC ANNEALING (DA)



Purpose: avoid local minima (optimization)



Simulated Annealing (SA)



_{A sequence of random moves}

_{is generated and the random decision to}

accept a move depends on the cost of resulting configuration relative to the

current state cost (

Monte Carlo Method

)



Deterministic Annealing (DA)



_{Uses expectation}

_{instead of stochastic simulations (random move).}



_{Deterministic}

_:

 Making incremental progress on the average.  (minimize the free energy (F) directly)



_Annealing

_:

 still want to avoid local minima with certain level of uncertainty.

 Minimizing the cost at prescribed level of randomness (Shannon Entropy)

eq)

F = D – TH

(T: temperature, H: Shannon Entropy, D: cost)

 At large T, entropy (H) dominates while at small T cost dominates.  Annealing lowers temperature so solution tracks continuously

(24)

DA FOR CLUSTERING



This is an

extended K-means

algorithm.



Start with a

single cluster

giving as solution

Y

₁

as centroid



For some

annealing schedule

for T, iterate above algorithm testing

(25)

DA CLUSTERING RESULTS (GIS)

Age under 5 vs. 25 to 34

Age under 5 vs. 75 and up

(26)

HIDDEN MARKOV MODEL (HMM) 1

 _{A system being in one of a set of N}

distinct states, S₁, S₂, …, S_Nat any time.

 State transition probability

 The special case of a discrete, first

order Markov chain:

 P[q_t=S_j|q_t-1=S_i, q_t-2=S_k, …]

= P[q_t=S_j|q_t-1=S_i] (1)

 _{Consider the right-hand side of (1) is}

independent of time, thereby leading

to the set of state transition probability a of the form

 Observation is a probabilistic function of

the state.

 _{State is hidden}_.

 Speech recognition, bioinfo, etc.

 Elements of an HMM

 _N_{, the number of states}

 _M,_{the number of symbols}

 A = {a_ij}, The state transition probability

distribution

 B = {b_j(k)}, The symbol emission

probability distribution in state j

(27)

HIDDEN MARKOV MODEL (HMM) 2



Prob(observation seq | model):

 _{Given the observation sequence}

O=O₁O₂ … O_T, and a model λ=(A,B,π), how do we efficiently compute P(O| λ)?



Finding Optimal State Seq:

 Given O = O₁O₂ … O_T, and

λ=(A,B,π), how do we choose a corresponding optimal state sequence Q = q₁q₂ … q_T in some meaningful sense (i.e. best

“explains” the observations)?



Finding Optimal Model

Parameters:

 _{How do we adjust the model}

parameters λ = (A, B, π) to maximize P(O| λ) ?

 _{Prob(observation seq | model):}

 _{Enumeration: computationally}_unfeasible.

 _{Forward Procedure}

α_t(i) = P(O₁O₂ … O_t, q_t = S_i| λ)

 Finding Optimal State Seq:

 _{find the best state sequence (path)}

 _{Viterbi algorithm:}

 dynamic programming method

δ_t(i) = max P[q₁q₂…q_t = i, O₁O₂…O_t| λ)

 Path back tracking

 Finding Optimal Model Parameters:

 _{Baum-Welch Method}_:

 Choose λ = (A, B, π) such that P(O| λ) is

locally maximized

 Essentially EM method: iterative

ξ_t(i, j) = P(q_t = S_i, q_t+1 = S_j|O, λ)

Three Basic Problems

Solutions of those Problems

(28)

OTHER IMPORTANT ALGS.



Other Data Mining Algorithms



Support Vector Machine (SVM)



K-means (special case of DA clustering), Nearest-neighbor



Decision Tree, Neural network, etc.



Dimension Reduction



GTM (Generative Topographic Map)



MDS (MultiDimensional Scaling)

(29)

SUMMARY



Era of Multicore (Parallelism is essential.)



Explosion of information from many kinds of sources.



We are interesting

scalable parallel data-mining

algorithms.



_{Clustering algorithm (DA clustering)}

 GIS (demographic (census) data) – visualization is natural.

 Cheminformatics – dimension reduction is necessary to visualize. 

_{Visualization (Dimension Reduction)}



_{Hidden Markov Models, …}

(30)