OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING

(1)

1

OVERVIEW OF

MULTICORE, PARALLEL COMPUTING,

AND DATA MINING

Indiana University

Computer Science Dept.

Seung-Hee Bae

1

(2)

2 2

OUTLINE



Multicore



Parallel Computing & MPI



Data Mining

(3)

3



Multicore

 _{Toward Concurrency}

 _{What is Multicore?}

 _{Shared cache architecture}

 _{Recognition, Mining, and Synthesis (RMS)}



Parallel Computing & MPI



Data Mining

(4)

4 4

TOWARD CONCURRENCY IN SOFTWARE



Exponential growth (Moore’s Law) can’t continue



Previous CPU performance gains

 _{Clock speed: getting more cycles}

Become harder to exploit higher clock speeds due to several physical issues, such as, heat, power consumption, and current leakage problems. (2GHz:2001, 3.4GHz:2004, now?)

 _{Execution optimization: more work per cycle}

Pipelining, branch prediction, executing multiple instructions in the same clock cycle

reordering the instruction stream: changing meaning of programs.

 _Cache

Increasing the size of on-chip cache: main memory is much

(5)

5 5

TOWARD CONCURRENCY IN SOFTWARE 2

 Current CPU performance gains

 _{Moore’s law is over? Not yet (# of transistors ↑)}  _{Hyperthreading}

 Running two or more threads in parallel inside a single CPU  Runs some instructions in parallel

 One each of most basic CPU features, (except extra registers)  5% ~ 15 %, 40% under ideal conditions

 It doesn’t help single-threaded applications

 _Multicore

 Running two or more actual CPUs on one chip.  Less than double the speed even in the ideal case.

 It will boost reasonably well-written multi-thread applications, but not

single-threaded applications.

 2 * 3GHz < 6 GHz

 Coordination overhead between the cores to ensure cache coherency.

 _Cache

 Only this will broadly benefit most existing applications.

(6)

6

WHAT IS MULTICORE?



Single Chip



Multiple distinct processing Engine



E.g.) Shared-cache Dual Core Architecture

6 6

Core 0

CPU

L1 Cache

Core 1

CPU

L1 Cache

(7)

7

SHARED-CACHE ARCHITECTURE



Options for the last-level cache

 private to each core

 _{sharing the last-level cache among diff. cores} 

Benefits of the Shared-Cache Architecture

 _{Efficient use of the last-level cache}_. reduce resource underutilization.

 Reduce cache-coherence complexity

reduced false sharing because of shared cache.

 reduce data-storage redundancy

same data only needs to be stored once.

 reduce front-side bus traffic

data requests can be resolved at the shared-cache level instead of system memory.

(8)

8

SOFTWARE TECHNIQUES FOR

SHARED-CACHE MULTICORE SYSTEMS



Cache blocking (Data Tiling)

 _{Allow data to stay in the cache while being processing by data loops.}

 _{Reducing unnecessary cache traffic. (Better cache hit ratio.)}



Hold approach (Late update)

 Each thread maintain its own private copy of data.

 _{Updating the shared copy only when it is necessary.}

 _{Reducing the frequency of access to the shared data.}



Avoid false sharing

 _{What is false sharing? (unnecessary cache line update.)}

 _{How to avoid false sharing?}

 To allocate non-shared data to different cache lines. (padding)

 To copy the global variable to a local function variable, then copy the data

(9)

9

RECOGNITION, MINING, AND

SYNTHESIS (RMS)



Era of Tera is coming quickly

 Teraflops (computing power), Terabits (comm.), Terabytes

(storage)

 World data is doubling every three years and is now measured

exabytes (a billion billion bytes)

 Need computing model to deal this enormous sea of

information



Working with Models

 _{Recognition (What is ?)}

 Identifying that a set of data constitutes a model and then constructing that model.

 _{Mining (Is it ?)}

 Search for instances of the model.

 Synthesis (What if ?)

(10)

10

RMS 2

(from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of Tera,” Technology@Intel Magazine, Feb. 2005.)



Examples

 Medicine (a tumor)

 Business (hiring)

 Investment

(11)

11



Multicore



Parallel Computing & MPI

 _{Parallel architectures (Shared-Memory vs. Distributed-Memory)}

 _{Decomposing Program (Data Parallelism vs. Task Parallelism)}

 MPI and OpenMP



Data Mining

(12)

12 12

PARALLEL COMPUTING: INTRODUCTION



Parallel computing

 _{More than just a strategy for achieving good performance}  _{Vision for how computation can}_{seamlessly scale from a single}

processor to virtually limitless computing power



Parallel computing software systems

 _{Goal: to make parallel programming easier and the resulting applications}

more portable and scalable while achieving good performance.

 _Difficulty

 Explicitly parallel program is difficult

e.g.) computation, partitioning, synchronization, and data movement (correct answer & high performance)

 Must be machine-independent – portability  Complexity of the problems being attacked.



Parallel Computing Challenges

 _{Concurrency & Communication}  _{Need for high performance}

(13)

13

PARALLEL ARCHITECTURE 1

13



Shared-memory machines

 _{Have a}_{single shared address}

space that can be accessed by any processor.

 _Examples

 Multicore

 Symmetric multiprocessor (SMP)  Uniform Memory Access (UMA)

 Access time is independent of the loc.  Use bus or completely connected net.  Not scalable

 _{Shared-Memory Programming}

model

 Need for synchronization to

preserve the integrity

 E.g.) Open Specifications for

MultiProcessing (OpenMP)



Distributed-memory machines

 _The_{system memory is packaged}

with individual nodes of one or more processors (c.f. Use separate computers connected by a network)

 E.g. _Cluster

 communication is required to

provide data from a processor to a different processor.

 support message-passing

programming model

 Send-receive communication steps.  E.g.) Message Passing Interface

(14)

14

PARALLEL ARCHITECTURE 2

14

Shared-Memory Distributed Memory

Pros • Lower latency and higher BW

• Data are available to all of the CPUs through load and store instructions

• Single address space

• Scalable, if a scalable

interconnection network is used.

• Quite fast local data access.

Cons • cache coherency issue

• _{synchronization}_{is explicitly}

needed to access shared data.

• scalability issue

• Communication required to

access data in a diff. processor.

•_{Communication management}

problem

1. Long latency  Consolidation of messages btwn the same pair of processors

2. Long transmission time 

(15)

15 15

PARALLEL ARCHITECTURE 3



Hybrid systems



_{Distributed shared-memory (DSM)}

Distributed-memory machine which allows a processor to directly

access a datum in a remote memory.

Latency varies with the distance to the remote memory. Emphasize the Non-Uniform Memory Access (NUMA)

characteristics.



SMP clusters

distributed-memory system with SMP as a unit.

(16)

16 16

PARALLEL PROGRAM: Decomposition 1



Decomposing Programs

 _{Decomposition:}_{Identifying the portions for the parallelism}_.  _{Decomposition strategy}

Task (Functional) parallelism

 Different processors carry out different functions.

Data parallelism

 Subdivides the data domain of a problem into multiple regions and

assigns different processors to compute the results for each region.

 More commonly used in scientific problems.  Natural form of scalability

 _{Programming models}

Shared-memory programming model

 Need for synchronization to preserve the integrity

Message-passing model

 Communication is required to access a remote data location.

(17)

17

PARALLEL PROGRAM:

DECOMPOSITION 2

17



Data Parallelism

 Exploit the parallelism inherent in many large data structures.

 Same Task on diff. data.

(SPMD)

 Can be expressed by ALL

parallel programming

models (i.e. MPI, HPF like, OpenMP like)

 Features

 Scalable

 Hard to express when

geometry irregular or dynamic



Functional Parallelism

 Coarse grain parallelism

 Parallelism btwn the parts of many systems.

 Diff. task on the same or

diff. data.

 Features

 Parallelism limited in size

 Tens not millions

 Synchronization probably

good as parallelism

 Decomposition natural

(18)

18 18

PARALLEL PROGRAM:

DECOMPOSITION 3



Load balance and scalability

 Scalable: running time is inversely proportional to the number of

processors used.

 _{Speedup(n) = T(1)/T(n)}

 Scalable if speedup(n) ≈ n

 Second definition of scalability: scaled speedup

 Scalable if the running time remains the same when the number of

processors and the problem size are increased by a factor of n.

 _{Why scalability is not achieved?}

 a region that must be run sequentially. Total speedup ≤ T(1)/T

s

(Amdahl’s Law)

 Require for a high degree of communication or coordination.

 Poor load balance (major goal of parallel programming)

 If one of the processors takes half of the parallel work, speedup will be

(19)

19 19

PARALLEL PROGRAM



Memory-Hierarchy Management

 Blocking

 Ensuring that data remains in cache between subsequent accesses to the

same memory location.

 _{Elimination of False Sharing}

 False sharing: When two diff. processors are accessing distinct data

items that reside on the same cache block.

 Ensure that data used by diff. processors reside on diff. cache blocks.

(by padding: inserting empty bytes in a data structure.)

 _{Communication Minimization and Placement}

 Move send and receive commands far enough apart so that time spent on

communication can be overlapped.

 _{Stride-one access}

 Programs in which the loops access contiguous data items are much

more efficient than those that do not.

(20)

20

MESSAGE PASSING INTERFACE

(MPI) 1



Message Passing Interface (MPI)

 A specification for a set of functions for managing movement of

data among sets of communicating processes.

 The dominant scalable parallel computing paradigm with scientific

problem.

 Explicit message send and receive using rendezvous model.  Point-to-point communication

 Collective communication

 Commonly implemented in terms of an SPMD model

 All processes execute essentially the same logic.

 Pros:

 scalable and portable

 Race condition avoided (implicit synch. w/ the copy)

 Cons:

(21)

21

MPI



6 Key Functions

 MPI_INIT

 MPI_COMM_RANK

 MPI_COMM_SIZE

 MPI_SEND

 MPI_RECV

 MPI_FINALIZE



Collective Communications

 Barrier, Broadcast, Gather, Scatter, All-to-all, Exchange

 General reduction operation (sum, minimum, scan)



Blocking, nonblocking, buffered, synchronous messaging

(22)

22

OPEN SPECIFICATIONS FOR

MULTIPROCESSING (

OPENMP

) 1



Appropriate to

Shared-Memory.



A sophisticated

set of annotations (compiler

directives)

for traditional C, C++, or Fortran codes to

aid compilers producing parallel codes

.



It provides

parallel loops

and

collective operations

such as summation over loop indices.



Provide

lock variables

to allow fine-grain

synchronization btwn threads.

(23)

23

OPENMP 2



Directives: instruct the compiler to

 Create threads

 Perform synchronization operations.  Manage shared memory.

 Examples

 PARALLEL DO ~ END PARALLEL DO: explicit parallel loop.

 SCHEDULE (STATIC): assign continuous blocks at compile time.

 SCHEDULE (DYNAMIC): assign continuous blocks at run-time.

 REDUCTION(+: x): final values of var. x is determined global sum.

 PARALLEL SECTIONS: task parallelism.



OpenMP synchronization primitives

 Critical sections  Atomic updates  Barriers

 Master selection

(24)

24

OPENMP 3



Summary

 Work decomposition

 Ideal target system: uniform-access, shared-memory.

 Specify where multiple threads should be applied, and how

to assign work to those threads.

 Pros:

 Excellent programming interface for uniform-access,

shared-memory machines.

 Cons:

 No way to specify locality in machines w/ non-uniform

shared-memory or distributed shared-memory.

(25)

25



Multicore



Parallel Computing & MPI



Data Mining

 _{Expectation Maximization (EM)}

 _{Deterministic Annealing (DA)}

 _{Hidden Markov Model (HMM)}

 Support Vector Machine (SVM)

(26)

26

EXPECTATION MAXIMIZATION

(EM)



Expectation Maximization (EM)

 A general algorithm for maximum-likelihood (ML) estimation where the data are “incomplete” or the likelihood function involves latent variables.

 An efficient iterative procedure

 Goal: estimate unknown parameters, given measurement.

 Hill climbing approach  guarantee to reach local maxima.

 Two Steps

 E-step (Expectation): the missing data are estimated given the

observed data and current estimate of the model parameters.

 M-step (Maximization): the likelihood function is maximized

under the assumption that the missing data are known. (The

estimated missing data from the E-step are used in lieu of the actual missing data.)

(27)

27 27

DETERMINISTIC ANNEALING (DA)

 Purpose: avoid local minima (optimization)

 Clustering

 _{example of unsupervised learning}

 Simulated Annealing (SA)

 _{A sequence of random moves is generated and the random decision to}

accept a move depends on the cost of resulting configuration relative to the current state cost (Monte Carlo Method)

 Deterministic Annealing (DA)

 _{Deterministic}_:

 don’t wandering randomly

 (minimize the free energy directly)

 _Annealing_:

 still want to avoid local minima with certain level of uncertainty.  maintain the free energy at its minimum.

eq) F = D – TH (T: temperature, H: Shannon Entropy, D: cost)

 At large T, entropy (H) dominates while at small T cost dominates.  Annealing lowers temperature so solution tracks continuously

(28)

28

DA FOR CLUSTERING



Start with a

single cluster

giving as solution

Y

₁

as centroid



For some

annealing schedule

for T, iterate above algorithm testing

covariance matrix in

X

_i

about each cluster center to see if

“elongated”



Split cluster if elongation “long enough”



You

do not need to assume number of clusters

but rather a final

resolution



T or equivalent

(29)

29

HIDDEN MARKOV MODEL (HMM) 1



Markov model

 A system which may be described at any time as being in one

of a set of N distinct states, S1, S2, …, SN.

 State transition probability

 The special case of a discrete, first order Markov chain:

 P[q

t = Sj|qt-1 = Si, qt-2 = Sk, …] = P[qt = Sj|qt-1 = Si] (1)

 Furthermore, consider those processes in which the right-hand side

of (1) is independent of time, thereby leading to the set of state transition probability aij of the form

a_ij = P[q_t = S_j|q_t-1 = S_i], 1 ≤ i, j ≤ N, a_ij ≥ 0 ∑_J

a_ij = 1

 Initial state probability

(30)

30

HIDDEN MARKOV MODEL (HMM) 2

 Hidden Markov Model

 _{Observation is a probabilistic function of the state.}

 _{State is hidden.}

 Elements of an HMM

 _N_{, the}_{number of states}_{in the model. (Although the states are hidden)}

 _M,_the_{number of distinct observation symbols per state}_{, i.e. the discrete}

alphabet size.

 The state transition probability distribution A = {a_ij},

where a_ij = P[q_t = S_j|q_t-1 = S_i], 1 ≤ i, j ≤ N, a_ij ≥ 0 ∑_J a_ij = 1

 _The_{observation symbol probability distribution}_{(emission probability) in}

state j, B = {b_j(k)}, where

b_j(k) = P[v_k at t| q_t = S_j], 1 ≤ j ≤ N, 1 ≤ k ≤ M

 The initial state distribution π = {π_i} where

π_i = P[q₁ = S_i], 1 ≤ j ≤ N

(31)

31

HIDDEN MARKOV MODEL (HMM) 3



Three Basic Problems for HMMs

 _{Prob(observation seq | model):}_{Given the observation sequence O =}

O₁O₂ … O_T, and a model λ = (A, B, π), how do we efficiently

compute P(O| λ), the probability of the observation sequence, given the model?

 _{Finding Optimal State Sequence:}_{Given the observation sequence}

O = O₁O₂ … O_T, and a model λ = (A, B, π), how do we choose a

corresponding state sequence Q = q₁q₂ … q_T which is optimal in

some meaningful sense (i.e. best “explains” the observations)?

 Finding Optimal Model Parameters: How do we adjust the model

(32)

32

HIDDEN MARKOV MODEL (HMM) 4



Solution to the three basic problems for HMMs

 Solution to the problem 1 (Forward-Backward procedure)

 Enumeration (straightforward way): computationally

unfeasible.

 Forward Procedure

 Consider forward variable α_t(i) = P(O₁O₂ … O_t, q_t = S_i| λ) i.e.,

the probability of the partial observation sequence, O₁O₂ …

O_t, (until time t) and state S_i at time t, given the model λ.

 Solution to the problem 2 (Viterbi algorithm)

 Optimality criterion: to find the single best state sequence (path), i.e., to maximize P(Q|O, λ) which is equivalent to maximizing P(Q, O| λ).

 A formal technique for finding this single best state

sequence exists, based on dynamic programming methods,

(33)

33

HIDDEN MARKOV MODEL (HMM) 5



Solution to the Problem 3. (Baum-Welch Algorithm)

 The third problem of HMMs is to determine a method to adjust

the model parameters (A, B, π) to maximize the probability of

the observation sequence given the model.

 Choose λ = (A, B, π) such that P(O| λ) is locally maximized using an iterative procedure such as the Baum-Welch method (or

equivalently the EM (expectation-modification) method) or using gradient techniques.

 Reestimation (iterative update and improvement), define ξt(i, j),

the probability of being in state Si at time t and state Sj at time t+1, given the model and the observation sequence, i.e.