• No results found

OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING

N/A
N/A
Protected

Academic year: 2020

Share "OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

1

OVERVIEW OF

MULTICORE, PARALLEL COMPUTING,

AND DATA MINING

Indiana University

Computer Science Dept.

Seung-Hee Bae

1

(2)

2 2

OUTLINE

Multicore

Parallel Computing & MPI

Data Mining

(3)

3

Multicore

Toward Concurrency

What is Multicore?

Shared cache architecture

Recognition, Mining, and Synthesis (RMS)

Parallel Computing & MPI

Data Mining

(4)

4 4

TOWARD CONCURRENCY IN SOFTWARE

Exponential growth (Moore’s Law) can’t continue

Previous CPU performance gains

Clock speed: getting more cycles

Become harder to exploit higher clock speeds due to several physical issues, such as, heat, power consumption, and current leakage problems. (2GHz:2001, 3.4GHz:2004, now?)

Execution optimization: more work per cycle

Pipelining, branch prediction, executing multiple instructions in the same clock cycle

reordering the instruction stream: changing meaning of programs.

Cache

Increasing the size of on-chip cache: main memory is much

(5)

5 5

TOWARD CONCURRENCY IN SOFTWARE 2

 Current CPU performance gains

Moore’s law is over? Not yet (# of transistors ↑)Hyperthreading

 Running two or more threads in parallel inside a single CPU  Runs some instructions in parallel

 One each of most basic CPU features, (except extra registers)  5% ~ 15 %, 40% under ideal conditions

 It doesn’t help single-threaded applications

Multicore

 Running two or more actual CPUs on one chip.  Less than double the speed even in the ideal case.

 It will boost reasonably well-written multi-thread applications, but not

single-threaded applications.

 2 * 3GHz < 6 GHz

 Coordination overhead between the cores to ensure cache coherency.

Cache

 Only this will broadly benefit most existing applications.

(6)

6

WHAT IS MULTICORE?

Single Chip

Multiple distinct processing Engine

E.g.) Shared-cache Dual Core Architecture

6 6

Core 0

CPU

L1 Cache

Core 1

CPU

L1 Cache

(7)

7

SHARED-CACHE ARCHITECTURE

Options for the last-level cache

 private to each core

sharing the last-level cache among diff. cores

Benefits of the Shared-Cache Architecture

Efficient use of the last-level cache. reduce resource underutilization.

Reduce cache-coherence complexity

reduced false sharing because of shared cache.

reduce data-storage redundancy

same data only needs to be stored once.

reduce front-side bus traffic

data requests can be resolved at the shared-cache level instead of system memory.

(8)

8

SOFTWARE TECHNIQUES FOR

SHARED-CACHE MULTICORE SYSTEMS

Cache blocking (Data Tiling)

Allow data to stay in the cache while being processing by data loops.

Reducing unnecessary cache traffic. (Better cache hit ratio.)

Hold approach (Late update)

 Each thread maintain its own private copy of data.

Updating the shared copy only when it is necessary.

Reducing the frequency of access to the shared data.

Avoid false sharing

What is false sharing? (unnecessary cache line update.)

How to avoid false sharing?

 To allocate non-shared data to different cache lines. (padding)

 To copy the global variable to a local function variable, then copy the data

(9)

9

RECOGNITION, MINING, AND

SYNTHESIS (RMS)

Era of Tera is coming quickly

 Teraflops (computing power), Terabits (comm.), Terabytes

(storage)

 World data is doubling every three years and is now measured

exabytes (a billion billion bytes)

 Need computing model to deal this enormous sea of

information

Working with Models

Recognition (What is ?)

 Identifying that a set of data constitutes a model and then constructing that model.

Mining (Is it ?)

 Search for instances of the model.

 Synthesis (What if ?)

(10)

10

RMS 2

(from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of Tera,” Technology@Intel Magazine, Feb. 2005.)

Examples

 Medicine (a tumor)

 Business (hiring)

 Investment

(11)

11

Multicore

Parallel Computing & MPI

Parallel architectures (Shared-Memory vs. Distributed-Memory)

Decomposing Program (Data Parallelism vs. Task Parallelism)

 MPI and OpenMP

Data Mining

(12)

12 12

PARALLEL COMPUTING: INTRODUCTION

Parallel computing

More than just a strategy for achieving good performanceVision for how computation can seamlessly scale from a single

processor to virtually limitless computing power

Parallel computing software systems

Goal: to make parallel programming easier and the resulting applications

more portable and scalable while achieving good performance.

Difficulty

 Explicitly parallel program is difficult

e.g.) computation, partitioning, synchronization, and data movement (correct answer & high performance)

 Must be machine-independent – portability  Complexity of the problems being attacked.

Parallel Computing Challenges

Concurrency & CommunicationNeed for high performance

(13)

13

PARALLEL ARCHITECTURE 1

13

Shared-memory machines

Have a single shared address

space that can be accessed by any processor.

Examples

 Multicore

 Symmetric multiprocessor (SMP)  Uniform Memory Access (UMA)

Access time is independent of the loc.  Use bus or completely connected net.  Not scalable

Shared-Memory Programming

model

 Need for synchronization to

preserve the integrity

 E.g.) Open Specifications for

MultiProcessing (OpenMP)

Distributed-memory machines

The system memory is packaged

with individual nodes of one or more processors (c.f. Use separate computers connected by a network)

 E.g. Cluster

communication is required to

provide data from a processor to a different processor.

 support message-passing

programming model

 Send-receive communication steps.  E.g.) Message Passing Interface

(14)

14

PARALLEL ARCHITECTURE 2

14

Shared-Memory Distributed Memory

Pros • Lower latency and higher BW

• Data are available to all of the CPUs through load and store instructions

• Single address space

Scalable, if a scalable

interconnection network is used.

• Quite fast local data access.

Cons • cache coherency issue

synchronization is explicitly

needed to access shared data.

scalability issue

Communication required to

access data in a diff. processor.

Communication management

problem

1. Long latency  Consolidation of messages btwn the same pair of processors

2. Long transmission time

(15)

15 15

PARALLEL ARCHITECTURE 3

Hybrid systems

Distributed shared-memory (DSM)

Distributed-memory machine which allows a processor to directly

access a datum in a remote memory.

Latency varies with the distance to the remote memory. Emphasize the Non-Uniform Memory Access (NUMA)

characteristics.

SMP clusters

distributed-memory system with SMP as a unit.

(16)

16 16

PARALLEL PROGRAM: Decomposition 1

Decomposing Programs

Decomposition: Identifying the portions for the parallelism.Decomposition strategy

Task (Functional) parallelism

 Different processors carry out different functions.

Data parallelism

Subdivides the data domain of a problem into multiple regions and

assigns different processors to compute the results for each region.

 More commonly used in scientific problems.  Natural form of scalability

Programming models

Shared-memory programming model

 Need for synchronization to preserve the integrity

Message-passing model

 Communication is required to access a remote data location.

(17)

17

PARALLEL PROGRAM:

DECOMPOSITION 2

17

Data Parallelism

 Exploit the parallelism inherent in many large data structures.

 Same Task on diff. data.

(SPMD)

 Can be expressed by ALL

parallel programming

models (i.e. MPI, HPF like, OpenMP like)

 Features

Scalable

 Hard to express when

geometry irregular or dynamic

Functional Parallelism

 Coarse grain parallelism

Parallelism btwn the parts of many systems.

 Diff. task on the same or

diff. data.

 Features

 Parallelism limited in size

 Tens not millions

 Synchronization probably

good as parallelism

Decomposition natural

(18)

18 18

PARALLEL PROGRAM:

DECOMPOSITION 3

Load balance and scalability

 Scalable: running time is inversely proportional to the number of

processors used.

Speedup(n) = T(1)/T(n)

 Scalable if speedup(n) ≈ n

 Second definition of scalability: scaled speedup

 Scalable if the running time remains the same when the number of

processors and the problem size are increased by a factor of n.

Why scalability is not achieved?

a region that must be run sequentially. Total speedup ≤ T(1)/T

s

(Amdahl’s Law)

 Require for a high degree of communication or coordination.

Poor load balance (major goal of parallel programming)

 If one of the processors takes half of the parallel work, speedup will be

(19)

19 19

PARALLEL PROGRAM

Memory-Hierarchy Management

Blocking

 Ensuring that data remains in cache between subsequent accesses to the

same memory location.

Elimination of False Sharing

False sharing: When two diff. processors are accessing distinct data

items that reside on the same cache block.

 Ensure that data used by diff. processors reside on diff. cache blocks.

(by padding: inserting empty bytes in a data structure.)

Communication Minimization and Placement

 Move send and receive commands far enough apart so that time spent on

communication can be overlapped.

Stride-one access

 Programs in which the loops access contiguous data items are much

more efficient than those that do not.

(20)

20

MESSAGE PASSING INTERFACE

(MPI) 1

Message Passing Interface (MPI)

 A specification for a set of functions for managing movement of

data among sets of communicating processes.

 The dominant scalable parallel computing paradigm with scientific

problem.

 Explicit message send and receive using rendezvous model.  Point-to-point communication

Collective communication

 Commonly implemented in terms of an SPMD model

 All processes execute essentially the same logic.

 Pros:

scalable and portable

Race condition avoided (implicit synch. w/ the copy)

 Cons:

(21)

21

MPI

6 Key Functions

 MPI_INIT

 MPI_COMM_RANK

 MPI_COMM_SIZE

 MPI_SEND

 MPI_RECV

 MPI_FINALIZE

Collective Communications

 Barrier, Broadcast, Gather, Scatter, All-to-all, Exchange

 General reduction operation (sum, minimum, scan)

Blocking, nonblocking, buffered, synchronous messaging

(22)

22

OPEN SPECIFICATIONS FOR

MULTIPROCESSING (

OPENMP

) 1

Appropriate to

Shared-Memory.

A sophisticated

set of annotations (compiler

directives)

for traditional C, C++, or Fortran codes to

aid compilers producing parallel codes

.

It provides

parallel loops

and

collective operations

such as summation over loop indices.

Provide

lock variables

to allow fine-grain

synchronization btwn threads.

(23)

23

OPENMP 2

Directives: instruct the compiler to

 Create threads

 Perform synchronization operations.  Manage shared memory.

 Examples

 PARALLEL DO ~ END PARALLEL DO: explicit parallel loop.

 SCHEDULE (STATIC): assign continuous blocks at compile time.

 SCHEDULE (DYNAMIC): assign continuous blocks at run-time.

 REDUCTION(+: x): final values of var. x is determined global sum.

 PARALLEL SECTIONS: task parallelism.

OpenMP synchronization primitives

 Critical sections  Atomic updates  Barriers

 Master selection

(24)

24

OPENMP 3

Summary

 Work decomposition

 Ideal target system: uniform-access, shared-memory.

 Specify where multiple threads should be applied, and how

to assign work to those threads.

 Pros:

 Excellent programming interface for uniform-access,

shared-memory machines.

 Cons:

 No way to specify locality in machines w/ non-uniform

shared-memory or distributed shared-memory.

(25)

25

Multicore

Parallel Computing & MPI

Data Mining

Expectation Maximization (EM)

Deterministic Annealing (DA)

Hidden Markov Model (HMM)

Support Vector Machine (SVM)

(26)

26

EXPECTATION MAXIMIZATION

(EM)

Expectation Maximization (EM)

 A general algorithm for maximum-likelihood (ML) estimation where the data are “incomplete” or the likelihood function involves latent variables.

 An efficient iterative procedure

 Goal: estimate unknown parameters, given measurement.

 Hill climbing approach  guarantee to reach local maxima.

 Two Steps

E-step (Expectation): the missing data are estimated given the

observed data and current estimate of the model parameters.

M-step (Maximization): the likelihood function is maximized

under the assumption that the missing data are known. (The

estimated missing data from the E-step are used in lieu of the actual missing data.)

(27)

27 27

DETERMINISTIC ANNEALING (DA)

 Purpose: avoid local minima (optimization)

 Clustering

example of unsupervised learning

 Simulated Annealing (SA)

A sequence of random moves is generated and the random decision to

accept a move depends on the cost of resulting configuration relative to the current state cost (Monte Carlo Method)

 Deterministic Annealing (DA)

Deterministic:

 don’t wandering randomly

 (minimize the free energy directly)

Annealing:

 still want to avoid local minima with certain level of uncertainty.  maintain the free energy at its minimum.

eq) F = D – TH (T: temperature, H: Shannon Entropy, D: cost)

 At large T, entropy (H) dominates while at small T cost dominates.  Annealing lowers temperature so solution tracks continuously

(28)

28

DA FOR CLUSTERING

Start with a

single cluster

giving as solution

Y

1

as centroid

For some

annealing schedule

for T, iterate above algorithm testing

covariance matrix in

X

i

about each cluster center to see if

“elongated”

Split cluster if elongation “long enough”

You

do not need to assume number of clusters

but rather a final

resolution

T or equivalent

(29)

29

HIDDEN MARKOV MODEL (HMM) 1

Markov model

 A system which may be described at any time as being in one

of a set of N distinct states, S1, S2, …, SN.

 State transition probability

 The special case of a discrete, first order Markov chain:

P[q

t = Sj|qt-1 = Si, qt-2 = Sk, …] = P[qt = Sj|qt-1 = Si] (1)

 Furthermore, consider those processes in which the right-hand side

of (1) is independent of time, thereby leading to the set of state transition probability aij of the form

aij = P[qt = Sj|qt-1 = Si], 1 ≤ i, j ≤ N, aij ≥ 0 ∑J

aij = 1

Initial state probability

(30)

30

HIDDEN MARKOV MODEL (HMM) 2

 Hidden Markov Model

Observation is a probabilistic function of the state.

State is hidden.

 Elements of an HMM

N, the number of states in the model. (Although the states are hidden)

M, the number of distinct observation symbols per state, i.e. the discrete

alphabet size.

 The state transition probability distribution A = {aij},

where aij = P[qt = Sj|qt-1 = Si], 1 ≤ i, j ≤ N, aij ≥ 0 ∑J aij = 1

The observation symbol probability distribution (emission probability) in

state j, B = {bj(k)}, where

bj(k) = P[vk at t| qt = Sj], 1 ≤ j ≤ N, 1 ≤ k ≤ M

 The initial state distribution π = {πi} where

πi = P[q1 = Si], 1 ≤ j ≤ N

(31)

31

HIDDEN MARKOV MODEL (HMM) 3

Three Basic Problems for HMMs

Prob(observation seq | model): Given the observation sequence O =

O1O2 … OT, and a model λ = (A, B, π), how do we efficiently

compute P(O| λ), the probability of the observation sequence, given the model?

Finding Optimal State Sequence: Given the observation sequence

O = O1O2 … OT, and a model λ = (A, B, π), how do we choose a

corresponding state sequence Q = q1q2 … qT which is optimal in

some meaningful sense (i.e. best “explains” the observations)?

Finding Optimal Model Parameters: How do we adjust the model

(32)

32

HIDDEN MARKOV MODEL (HMM) 4

Solution to the three basic problems for HMMs

 Solution to the problem 1 (Forward-Backward procedure)

Enumeration (straightforward way): computationally

unfeasible.

 Forward Procedure

 Consider forward variable αt(i) = P(O1O2 … Ot, qt = Si| λ) i.e.,

the probability of the partial observation sequence, O1O2

Ot, (until time t) and state Si at time t, given the model λ.

 Solution to the problem 2 (Viterbi algorithm)

Optimality criterion: to find the single best state sequence (path), i.e., to maximize P(Q|O, λ) which is equivalent to maximizing P(Q, O| λ).

 A formal technique for finding this single best state

sequence exists, based on dynamic programming methods,

(33)

33

HIDDEN MARKOV MODEL (HMM) 5

Solution to the Problem 3. (Baum-Welch Algorithm)

 The third problem of HMMs is to determine a method to adjust

the model parameters (A, B, π) to maximize the probability of

the observation sequence given the model.

 Choose λ = (A, B, π) such that P(O| λ) is locally maximized using an iterative procedure such as the Baum-Welch method (or

equivalently the EM (expectation-modification) method) or using gradient techniques.

Reestimation (iterative update and improvement), define ξt(i, j),

the probability of being in state Si at time t and state Sj at time t+1, given the model and the observation sequence, i.e.

References

Related documents

On the other hand, given that the purpose of the applied research development research is in a particular field and applied research is directed towards

cinerea, AcOEt extract has presented an acceptable reducing power towards metals; it’s equivalent ascorbic acid concentration was equal to 498.333±0.013 µg EAA/mg ext. These

In contrast to serial robots, parallel robots (or parallel kinematic machines) are characterized by high structural stiffness, high load operation, high speed and acceleration

 AIM/CGE, MESSAGE and WITCH treat electricity as a homogeneous good, so there is no differentiation between high and low load.  To force the model to invest into

The high rates of resistance to cephalosporins in gram-negative BSI isolates of both the neonatal and pediatric population is concerning ( Klebsiella pneumoniae /cefotaxime 62.6%

RNR3-lacZ by WTM1 and WTM2 overexpression is un- affected by the rad53-21 mutation that effectively disables the known pathway of the DNA damage and replication stress

World Health Organization and the European research Organization on Genital Infection and Neoplasia in the year 2000 mentioned that HPV testing showed

Moreover, the current study aims to investigate the relationship between five dimensions of community capacity in conserving natural environment namely;