UNIVERSITY OF TUNIS EL MANAR
FACULTY OF SCIENCES OF TUNISIA
Mining Association Rules on
Mining Association Rules on
Mining Association Rules on
Mining Association Rules on
Mining Association Rules on
Mining Association Rules on
Mining Association Rules on
Mining Association Rules on
Grid Platforms
Grid Platforms
Grid Platforms
Grid Platforms
Raja Tlili
Yahya Slimani
Plan
Introduction
Association rules
The need of parallel computing
Workload balancing: Problem description
Workload balancing in association rule mining algorithms
Workload balancing in Grid computing
Workload balancing in Grid computing
Introduction (1)
Data vs Knowledge
Databases
Data : involved
Data : involved
Knowledge : hidden
Knowledge
Knowledge is most important than data
Decision making
To increase revenues and reduce costs
Introduction (2)
What is data mining
Extracting knowledge from a large volume of data
Non trivial
Implicit
Previously unkown
Previously unkown
Association rules (1)
Association rules (1)
The use of knowledge
Finding the rule
A
⇒
⇒
⇒
⇒
B with
support >= minsup and a
confidence >= minconf
Association rules (2)
Clients buying milk
Clients buying both
Clients buying sugar
confidence >= minconf
support,
s
, probability that a transaction contain{A, B}
confidence,
c,
conditional probability that atransaction containing A will also contains
B
Transaction Items
T1 A B C D E F G H I T2 .. .. .. .. .. .. .. .. T3 .. .. .. .. .. .. .. .. T4 .. .. .. .. .. .. .. ..
Clients buying sugar
The support and confidence thresehlods are fixed by the user
MinSup MinConf
Extracting association rules : how ?
Finding all association rules respecting that MinSup and this MinConf
Objectif :
1. Finding all frequent itemsets (support ≥≥≥≥ MinSup) Problem decomposition
The need of parallel computing
•
Databases to be mined are often very large
( in GB and TB
)
( in GB and TB
)
•
Transactional database have to be scanned
repeatedly (iteratively)
•
Databases to be mined are often very large
•
The need of fast algorithms for discovering
association rules
association rules
Workload balancing
Main challenges facing parallelism
Synchronisation & Communication minimization
Finding good data layout & data decomposition
Workload
Balancing
Finding good data layout & data decomposition
Disk I/O minimization
Work load balancing is the
assignment of work
to processors in
Load balancing: Problem description
Work load balancing is the
assignment of work
to processors in
a way that maximizes
application performance
Minimizing
Homogeneous environment Even if we
Causes of load imbalance
Homogeneous environment Even if we
equally partition the DB, the imbalance would occur
due to the differences in data correlation.
Heterogeneous platforms Have different
Heterogeneous platforms Have different
processor capacities and network speed.
Related work
The majority of current approaches use static
load balancing based on finding some
intelligent way for partitionning the database
Taxonomy of load balancing policies
Taxonomy of load balancing policies
Proposed Load Balancing Approach:
Characteristics
Taxonomy of load balancing policies
Static Dynamic
Reassignment Centralized Distributed
One-time Dynamic Local Global
Taxonomy of load balancing policies
Static Dynamic
Reassignment Centralized Distributed
One-time Dynamic Local Global
One-time Dynamic
Adaptive Non-Adaptive
Local Global
Cooperative Non-Cooperative One-time Dynamic
Adaptive Non-Adaptive Non-Adaptive
Local Global
Proposed Load Balancing Approach:
Goals
Improving the efficiency and the scalability of
ARM algorithms under Grid platforms :
ARM algorithms under Grid platforms :
Exploiting prallelism at various levels ;
considering the particular features of the
target platform
Let
G
= (S
1 ,S
2 ,…,S
T)S
=(M , Coord(S ) , Mem , Stor ,
Proposed load balancing model
Clij : Cluster j of Si
Coord (clij) :
Cluster coordinator
S
i =(M
i, Coord(S
i) , Mem
i, Stor
i,
Band
i)
M
i : total number of clusters inS
iCoord(Si) : coordinator node of the site
S
iMem
i : memory sizeStor
i : capacity of the storage subsystemBand
: bandwidth size of the networkNetwork …. . BD3 BD3 BD3 BD1 BD1 coordinator
Band
i : bandwidth size of the network∑
==
NNi j i j iMem
Mem
1 ,∑
=
NNiStor
Stor
Coord (Si) : Site coordinator ndijk : node k of clijDB DB
DB
Load balancing strategy :
(1) Before execution
S1 S2 Sn … . DB Partition 2 DB Partition n DB Partition 1 … Processing Network … Processing … Processing Network
Steps :
Step I : K=1
Load balancing strategy :
(1) Before execution
S
1D
Coord(S
i)
S
1P0 P1 P2 P3
S1 S2 S3
•Partitioning the database
D
between sites according to their
• Every processor has its local database
From the intra-site level
Load balancing strategy :
(2) During execution
… Network State Vector S ta te V e c to r Network S ta te V e c to r
From the Grid level
Load balancing strategy :
(2) During execution
… Network Global State Information Global State Information Global State Information Global State Information
the coordinators of different sites periodically
Global State Information Global State
Intra Site Candidates Migration
Load balancing strategy :
(2) During execution
{A,B,C,..} …
Inter Site Transactions Migration
Load balancing strategy :
(2) During execution
… Network T : A,B,C,I,J T: D,E,F,H,I,K T:D,F,H,I,H,J . . T: C,F,J,L,M
Load balancing strategy :
(2) During execution
The coordinator sends migration plan to all
The coordinator sends migration plan to all
processing nodes and instructs them to
reallocate the work load.
The previously mentioned process is periodically
invoked. Coordinators check the work load
Experimentation under a Grid
Experimental results
under a Grid computing environment: Grid’5000Grille
constituted of 5000
CPU distributed over
9 sites : Lille,
9 sites : Lille,
Rennes, Orsay,
Nancy, Lyon,
Bordeaux, Grenoble,
Toulouse, Sophia.
Database size Transactions number
Items number Average
transaction size DB100T13M 100 MB 1 300 000 4000 25
Experimental results
DB100T13M 100 MB 1 300 000 4000 25 (b) DB100T13M 1500 2000 2500 R u n t im e ( s e c ) Time seq // without loadbalancing // with loadbalancing2 Sites
Each site contains
2 Clusters
16
computational
Nodes :
500 1000 R u n t im e ( s e c )16
computational
Nodes :
3 nodes
/cluster 1
,
2 nodes
/cluster 2
,
Experimental results
There is not a fixed
optimal number of
optimal number of
processors
that could
be used for
execution. The
number of processors
used
should be
proportional to the
size of data sets to be
size of data sets to be
mined.
The easiest
way to determine
Association rule mining algo. have a simple statement, but they are
Conclusion and future works
computationally and I/O intensive
(performance
problem).
Parallel & distributed computing
is essential for providing
scalable mining solutions,
and can play an important role in
ameliorating performances.
The
dynamic nature
of association rule mining algorithms causes
The
dynamic nature
of association rule mining algorithms causes
We developed a
distributed dynamic
load balancing strategy
,
under a
Grid Computing environment.
Conclusion and future works
Grid Computing environment.
Experimentations showed that our strategy succeeded in
reducing
the
execution time
of
iterative
association rule mining algorithms (good
distribution
of workload among the processors of the Grid).
Work migration is known since a long time in
«
task scheduling »
Work migration is known since a long time in
«
task scheduling »
Adapting
it to ARM algorithms.
Executing
ARM algorithms
under Grid platforms
and
obtaining good results
, even with the various phases of
synchronizations.
UNIVERSITY OF TUNIS EL MANAR
FACULTY OF SCIENCES OF TUNIS