Mining Association Rules on Grid Platforms

(1)

UNIVERSITY OF TUNIS EL MANAR

FACULTY OF SCIENCES OF TUNISIA

Mining Association Rules on

Grid Platforms

Raja Tlili

Yahya Slimani

(2)

Plan

Introduction

Association rules

The need of parallel computing

Workload balancing: Problem description

Workload balancing in association rule mining algorithms

Workload balancing in Grid computing

(3)

Introduction (1)

Data vs Knowledge

Databases

Data : involved

Knowledge : hidden

Knowledge

Knowledge is most important than data

Decision making

To increase revenues and reduce costs

(4)

Introduction (2)

What is data mining

Extracting knowledge from a large volume of data

Non trivial

Implicit

Previously unkown

(5)

Association rules (1)

The use of knowledge

(6)

Finding the rule

A

⇒

B with

support >= minsup and a

confidence >= minconf

Association rules (2)

Clients buying milk

Clients buying both

Clients buying sugar

confidence >= minconf

support,

s

, probability that a transaction contain

{A, B}

confidence,

c,

conditional probability that a

transaction containing A will also contains

B

Transaction Items

T₁ A B C D E F G H I T₂ .. .. .. .. .. .. .. .. T₃ .. .. .. .. .. .. .. .. T₄ .. .. .. .. .. .. .. ..

Clients buying sugar

(7)

The support and confidence thresehlods are fixed by the user

MinSup MinConf

Extracting association rules : how ?

Finding all association rules respecting that MinSup and this MinConf

_{Objectif :}

1. Finding all frequent itemsets (support _≥_≥_≥_≥ MinSup) _{Problem decomposition}

(8)

The need of parallel computing

•

Databases to be mined are often very large

( in GB and TB

)

( in GB and TB

)

•

Transactional database have to be scanned

repeatedly (iteratively)

•

Databases to be mined are often very large

•

The need of fast algorithms for discovering

association rules

(9)

Workload balancing

Main challenges facing parallelism

Synchronisation & Communication minimization

Finding good data layout & data decomposition

Workload

Balancing

Finding good data layout & data decomposition

Disk I/O minimization

(10)

Work load balancing is the

assignment of work

to processors in

Load balancing: Problem description

Work load balancing is the

assignment of work

to processors in

a way that maximizes

application performance

Minimizing

(11)

Homogeneous environment Even if we

Causes of load imbalance

Homogeneous environment Even if we

equally partition the DB, the imbalance would occur

due to the differences in data correlation.

Heterogeneous platforms Have different

processor capacities and network speed.

(12)

Related work

The majority of current approaches use static

load balancing based on finding some

intelligent way for partitionning the database

(13)

Taxonomy of load balancing policies

Proposed Load Balancing Approach:

Characteristics

Static Dynamic

Reassignment _Centralized _Distributed

One-time Dynamic Local Global

Static Dynamic

Reassignment _Centralized _Distributed

One-time Dynamic Local Global

One-time Dynamic

Adaptive Non-Adaptive

Local _Global

Cooperative Non-Cooperative One-time Dynamic

Adaptive Non-Adaptive Non-Adaptive

Local _Global

(14)

Proposed Load Balancing Approach:

Goals

Improving the efficiency and the scalability of

ARM algorithms under Grid platforms :

Exploiting prallelism at various levels ;

considering the particular features of the

target platform

(15)

Let

G

= (

S

₁ ,

S

₂ ,…,

S

_T)

S

=

(M , Coord(S ) , Mem , Stor ,

Proposed load balancing model

Clij : Cluster j of Si

Coord (clij) :

Cluster coordinator

S

_i =

(M

_i

, Coord(S

_i

) , Mem

_i

, Stor

_i

,

Band

_i

)

M

_i : total number of clusters in

S

_i

Coord(S_i) : coordinator node of the site

S

_i

Mem

_i : memory size

Stor

_i : capacity of the storage subsystem

Band

: bandwidth size of the network

Network …. . BD3 BD3 BD3 BD1 BD1 coordinator

Band

_i : bandwidth size of the network

∑

₌

=

NNi j i j i

Mem

1 ,

∑

=

NNi

Stor

Coord (Si) : Site coordinator ndijk : node k of clij

(16)

DB _DB

DB

Load balancing strategy :

(1) Before execution

S₁ S2 Sn … . DB Partition 2 DB Partition n DB Partition 1 … Processing Network … Processing … Processing Network

(17)

Steps :

Step I : K=1

(1) Before execution

S

₁

D

Coord(S

i

)

S

₁

P₀ P₁ P₂ P₃

S₁ S₂ S₃

•Partitioning the database

D

between sites according to their

• Every processor has its local database

(18)

From the intra-site level

(2) During execution

… Network State Vector S ta te V e c to r Network S ta te V e c to r

(19)

From the Grid level

Load balancing strategy :

… Network Global State Information Global State Information Global State Information Global State Information

the coordinators of different sites periodically

Global State Information Global State

(20)

Intra Site Candidates Migration

{A,B,C,..} …

(21)

Inter Site Transactions Migration

… Network T : A,B,C,I,J T: D,E,F,H,I,K T:D,F,H,I,H,J . . T: C,F,J,L,M

(22)

Load balancing strategy :

The coordinator sends migration plan to all

processing nodes and instructs them to

reallocate the work load.

The previously mentioned process is periodically

invoked. Coordinators check the work load

(23)

Experimentation under a Grid

Experimental results

under a Grid computing environment: Grid’5000

Grille

constituted of 5000

CPU distributed over

9 sites : Lille,

Rennes, Orsay,

Nancy, Lyon,

Bordeaux, Grenoble,

Toulouse, Sophia.

(24)

Database size Transactions number

Items number Average

transaction size DB100T13M 100 MB 1 300 000 4000 25

Experimental results

DB100T13M 100 MB 1 300 000 4000 25 (b) DB100T13M 1500 2000 2500 R u n t im e ( s e c ) Time seq // without loadbalancing // with loadbalancing

2 Sites

Each site contains

2 Clusters

16

computational

Nodes :

500 1000 R u n t im e ( s e c )

16

computational

Nodes :

3 nodes

/cluster 1

,

2 nodes

/cluster 2

,

(25)

Experimental results

There is not a fixed

optimal number of

processors

that could

be used for

execution. The

number of processors

used

should be

proportional to the

size of data sets to be

mined.

The easiest

way to determine

(26)

Association rule mining algo. have a simple statement, but they are

Conclusion and future works

computationally and I/O intensive

(performance

problem).

Parallel & distributed computing

is essential for providing

scalable mining solutions,

and can play an important role in

ameliorating performances.

The

dynamic nature

of association rule mining algorithms causes

The

dynamic nature

of association rule mining algorithms causes

(27)

We developed a

distributed dynamic

load balancing strategy

,

under a

Grid Computing environment.

Conclusion and future works

Grid Computing environment.

Experimentations showed that our strategy succeeded in

reducing

the

execution time

of

iterative

association rule mining algorithms (good

distribution

of workload among the processors of the Grid).

Work migration is known since a long time in

«

task scheduling »

Work migration is known since a long time in

«

task scheduling »

Adapting

it to ARM algorithms.

Executing

ARM algorithms

under Grid platforms

and

obtaining good results

, even with the various phases of

synchronizations.

(28)

UNIVERSITY OF TUNIS EL MANAR

FACULTY OF SCIENCES OF TUNIS