• No results found

OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING

N/A
N/A
Protected

Academic year: 2020

Share "OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

1

OVERVIEW OF

MULTICORE, PARALLEL COMPUTING,

AND DATA MINING

Indiana University

Computer Science Dept.

Seung-Hee Bae

1

(2)

OUTLINE

Motivation

Multicore

Parallel Computing

(3)

MOTIVATION

According to “How Much Information” project at UC Berkeley

Print, film, magnetic & optical storage media produced about

5 exabytes

(a billion of billion bytes) of new info. in 2002.

 5 exabytes = 37000 Library of Congress (17 million books)

The rate of data increase will continue to accelerate

through weblogs,

digital photo & video, surveillance monitor, scientific instruments

(sensors), and instant message etc.

Thus, we need more powerful computing platforms

to deal with

this much data.

To take advantage of multicore chip, it is critical to build a software with

scalable parallelism.

To deal with a huge amount of data and utilize multicore, it is essential

to develop

data mining

tools with

highly scalable parallel

(4)

RECOGNITION, MINING, AND

SYNTHESIS (RMS)

(from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of Tera,” Technology@Intel Magazine, Feb. 2005.)

(5)

5

Motivation

Multicore

Toward Concurrency

What is Multicore?

Parallel Computing

Data Mining

(6)

TOWARD CONCURRENCY IN SOFTWARE

Exponential growth (Moore’s

Law) will change

Clock speed

: getting more cycles

Become harder to exploit higher

clock speeds (2GHz:2001, 3.4GHz:2004, now?)

Execution optimization

: more

work per cycle

Pipelining, branch prediction,

multiple instructions/clock,

Moore’s law is over?

Not yet (# of transistors ↑)

Hyperthreading

Running two or more threads in

parallel inside a single CPU

It doesn’t help single-threaded

applications

Multicore

Running two or more actual CPUs on

one chip.

It will boost reasonably well-written

multi-thread applications, but not

(7)

7

WHAT IS MULTICORE?

Single Chip

Multiple distinct processing Engine

E.g.) Shared-cache Dual Core Architecture

7 7

Core 0

CPU

L1 Cache

Core 1

CPU

L1 Cache

(8)

Motivation

Multicore

Parallel Computing

Parallel architectures (Shared-Memory vs. Distributed-Memory)

Decomposing Program (Data Parallelism vs. Task Parallelism)

MPI and OpenMP

(9)

9 9

PARALLEL COMPUTING: INTRODUCTION

Parallel computing

More than just a strategy for achieving good performance

Vision for how computation can

seamlessly scale

from a single processor

to virtually limitless computing power

Parallel computing software systems

Goal: to make parallel programming easier and the resulting applications

more

portable and scalable

while achieving good

performance.

Component Parallel Paradigm (

Explicit Parallel

)

 One explicitly programs the different parts of a parallel application.  E.g.) MPI, PGAS, CCR & DSS, Workflow, DES

Program Parallel Paradigm (

Implicit Parallel

)

 One writes a single program to describe the whole app.  compiler and runtime

break up the program into the multiple parts that execute in parallel.

 E.g.) OpenMP, HPF, HPCS, MapReduce

Parallel Computing Challenges

Concurrency & Communication

Scalability and portability are difficult to achieve.

Diversity of Architectures

(10)

PARALLEL ARCHITECTURE 1

Shared-memory machines

Have a

single shared address

space

that can be accessed by

any processor.

Examples

 Multicore

 Symmetric multiprocessor (SMP)  Uniform Memory Access (UMA)

Access time is independent of

the location.

 Use bus or fully connected net.  Hard to achieve the scalability

Distributed-memory

machines

The

system memory is packaged

with individual nodes

of one or

more processors (c.f. Use separate

computers connected by a

network)

E.g.)

Cluster

(11)

11

PARALLEL ARCHITECTURE 2

11

Shared-Memory Distributed Memory

Pros • Lower latency and higher BW

• Data are available to all of the CPUs through load and store instructions

• Single address space

Scalable, if a scalable

interconnection network is used.

• Quite fast local data access.

Cons • cache coherency should be dealt with carefully.

synchronization is explicitly

needed to access shared data.

scalability issue

Communication required to

access data in a diff. processor.

Communication management

problem

1. Long latency

(12)

PARALLEL ARCHITECTURE 3

Hybrid systems

Distributed shared-memory (DSM)

Distributed-memory machine which allows a processor to directly

access a datum in a remote memory.

Latency varies with the distance to the remote memory.

Emphasize the Non-Uniform Memory Access (NUMA)

characteristics.

SMP clusters

(13)

PARALLEL PROGRAMMING MODEL

Shared-Memory Programming model

 Need for synchronization to preserve the integrity  More appropriate to shared-memory machine

E.g.) Open Specifications for MultiProcessing (OpenMP)

Message-Passing Programming model

Send-receive communication steps.

Communication is used to access a remote data location.More appropriate to distributed-memory machine

 E.g.) Message Passing Interface (MPI)

Shared-memory programming model can be used to

distributed-memory machines as well as message-passing programming

model can be used to shared-memory architectures.

 However, the efficiency of the programming model is different.

(14)

PARALLEL PROGRAM: DECOMPOSITION 1

Subdivides the data domain of a

problem into multiple regions

and

assigns different processors.

Exploit the

parallelism inherent in

many large data structures

.

Same Task on diff. data. (

SPMD

)

More commonly used in scientific

problems.

Features

natural form of scalability.

Different processors carry out

different functions.

Coarse grain parallelism

Different task on the same or

different data.

Features

Parallelism limited in size

 Tens not millions

Synchronization probably good

Parallelism and Decomposition can

(15)

15 15

PARALLEL PROGRAM: DECOMPOSITION 2

Load balance and scalability

Scalable: running time is inversely proportional to the number of

processors used.

Speedup(n) = T(1)/T(n)

Scalable if speedup(n) ≈ n

Second definition of scalability: scaled speedup

Scalable if the running time remains the same when the number of

processors and the problem size are increased by a factor of n.

Why scalability is not achieved?

a region that must be run sequentially

. Total speedup ≤ T(1)/T

s

(Amdahl’s Law)

Require for a high degree of

communication or coordination

.

Poor

load balance

(major goal of parallel programming)

 If one of the processors takes half of the parallel work, speedup will be

(16)

MEMORY MANAGEMENT

Memory-Hierarchy Management

Blocking

Ensuring that data remains in cache between subsequent accesses to the

same memory location.

Elimination of False Sharing

False sharing

: When two diff. processors are accessing distinct data

items that reside on the same cache line.

Ensure that data used by diff. processors reside on diff. cache line.

(by

padding

:

inserting empty bytes in a data structure.)

Communication Minimization and Placement

(17)

17

MESSAGE PASSING INTERFACE (MPI)

Message Passing Interface (MPI)

A specification for a set of functions for managing movement of data

among sets of communicating processes.

The dominant

scalable

parallel computing paradigm with

scientific

problem

.

Explicit message

send

and

receive

using rendezvous model.

Point-to-point

communication

Collective

communication

Commonly implemented in terms of an

SPMD

model

 All processes execute essentially the same logic. 

Pros:

scalable and portable

Race condition avoided (implicit synch. w/ completion of the copy)

Cons:

(18)

MPI

6 Key Functions

MPI_INIT

MPI_COMM_SIZE

MPI_COMM_RANK

MPI_SEND

MPI_RECV

MPI_FINALIZE

Collective Communications

(19)

19

OPEN SPECIFICATIONS FOR

MULTIPROCESSING (

OpenMP

) 1

Appropriate to

uniform-access, shared-memory.

A sophisticated

set of annotations (compiler directives)

for traditional

C, C++, or Fortran codes to

aid compilers producing parallel codes

.

It provides

parallel loops

and

collective operations

such as summation

over loop indices.

Provide

lock variables

to allow fine-grain synchronization btwn threads.

Specify where multiple threads should be applied, and how to assign

work to those threads.

Pros:

Excellent programming interface for uniform-access, shared-memory machines.

Cons:

No way to specify locality in machines w/ non-uniform shared-memory or

distributed memory.

(20)

OpenMP 2

Directives: instruct the compiler to

Create threads, perform synchronization ops, manage shared memory.

Examples

 PARALLEL DO ~ END PARALLEL DO  SCHEDULE (STATIC)

 SCHEDULE (DYNAMIC)  REDUCTION(+: x)

 PARALLEL SECTIONS

OpenMP synchronization primitives

(21)

21

Motivation

Multicore

Parallel Computing

Data Mining

Expectation Maximization (EM)

Deterministic Annealing (DA)

Hidden Markov Model (HMM)

Other Important Algorithms

(22)

EXPECTATION MAXIMIZATION (EM)

Expectation Maximization (EM)

A general algorithm for

maximum-likelihood (ML) estimation where

the data are “incomplete”

or the likelihood function involves latent

variables.

An efficient

iterative procedure

Goal:

estimate unknown parameters

, given measurement.

Hill climbing approach

guarantee to reach maxima (or local maxima.)

Two Steps

(23)

23 23

DETERMINISTIC ANNEALING (DA)

Purpose: avoid local minima (optimization)

Simulated Annealing (SA)

A sequence of random moves

is generated and the random decision to

accept a move depends on the cost of resulting configuration relative to the

current state cost (

Monte Carlo Method

)

Deterministic Annealing (DA)

Uses expectation

instead of stochastic simulations (random move).

Deterministic

:

 Making incremental progress on the average.  (minimize the free energy (F) directly)

Annealing

:

 still want to avoid local minima with certain level of uncertainty.

 Minimizing the cost at prescribed level of randomness (Shannon Entropy)

eq)

F = D – TH

(T: temperature, H: Shannon Entropy, D: cost)

 At large T, entropy (H) dominates while at small T cost dominates.  Annealing lowers temperature so solution tracks continuously

(24)

DA FOR CLUSTERING

This is an

extended K-means

algorithm.

Start with a

single cluster

giving as solution

Y

1

as centroid

For some

annealing schedule

for T, iterate above algorithm testing

(25)

DA CLUSTERING RESULTS (GIS)

Age under 5 vs. 25 to 34

Age under 5 vs. 75 and up

(26)

HIDDEN MARKOV MODEL (HMM) 1

A system being in one of a set of N

distinct states, S1, S2, …, SN at any time.

 State transition probability

 The special case of a discrete, first

order Markov chain:

P[qt=Sj|qt-1=Si, qt-2=Sk, …]

= P[qt=Sj|qt-1=Si] (1)

Consider the right-hand side of (1) is

independent of time, thereby leading

to the set of state transition probability a of the form

 Observation is a probabilistic function of

the state.

State is hidden.

Speech recognition, bioinfo, etc.

 Elements of an HMM

N, the number of states

M, the number of symbols

A = {aij}, The state transition probability

distribution

B = {bj(k)}, The symbol emission

probability distribution in state j

(27)

HIDDEN MARKOV MODEL (HMM) 2

Prob(observation seq | model):

Given the observation sequence

O=O1O2 … OT, and a model λ=(A,B,π), how do we efficiently compute P(O| λ)?

Finding Optimal State Seq:

 Given O = O1O2 … OT, and

λ=(A,B,π), how do we choose a corresponding optimal state sequence Q = q1q2 … qT in some meaningful sense (i.e. best

“explains” the observations)?

Finding Optimal Model

Parameters:

How do we adjust the model

parameters λ = (A, B, π) to maximize P(O| λ) ?

Prob(observation seq | model):

Enumeration: computationally unfeasible.

Forward Procedure

αt(i) = P(O1O2 … Ot, qt = Si| λ)

Finding Optimal State Seq:

find the best state sequence (path)

Viterbi algorithm:

 dynamic programming method

δt(i) = max P[q1q2…qt = i, O1O2…Ot| λ)

 Path back tracking

Finding Optimal Model Parameters:

Baum-Welch Method:

 Choose λ = (A, B, π) such that P(O| λ) is

locally maximized

 Essentially EM method: iterative

ξt(i, j) = P(qt = Si, qt+1 = Sj|O, λ)

Three Basic Problems

Solutions of those Problems

(28)

OTHER IMPORTANT ALGS.

Other Data Mining Algorithms

Support Vector Machine (SVM)

K-means (special case of DA clustering), Nearest-neighbor

Decision Tree, Neural network, etc.

Dimension Reduction

GTM (Generative Topographic Map)

MDS (MultiDimensional Scaling)

(29)

SUMMARY

Era of Multicore (Parallelism is essential.)

Explosion of information from many kinds of sources.

We are interesting

scalable parallel data-mining

algorithms.

Clustering algorithm (DA clustering)

 GIS (demographic (census) data) – visualization is natural.

 Cheminformatics – dimension reduction is necessary to visualize. 

Visualization (Dimension Reduction)

Hidden Markov Models, …

(30)

THANK YOU!

References

Related documents

Put a check in Incl (Include) column next to each company to which you want to copy the report; then click the Next button. The system displays the Select

The article defines the target market of KB company through the research and analysis, also discuss how to create the core competence, which kind of development strategy

Since our aim was to produce metal ion beams (especially from Gold and Calcium) with good stability and without any major modification of the source (for

Biohydrogen, which is another form of renewable energy also has been proven feasible to be produced from different types of organic wastes which include municipal solid waste

Computer systems have reached the point where the goal of distributed resource allocation is no longer to maximize utilization; instead, when demand exceeds supply and not all needs

In our fi rst set of models, we examine the effect of customer input on the likelihood of funding success using an instrumental variable to control for creator ability and

Figure 7 Numbers of worldwide fatal accidents broken down by nature of flight and aircraft class for the ten-year period 2002 to 2011.. 0 20 40 60 80 100 120

Therefore the study will be focusing on the following selected HRM practices (training and development, recruitment and selection, performance appraisal,