6.5 The Parallel Task with Vertical Partitioning (DATA-VP) Algorithm
6.6.1 Messaging
Parallel ARM tends to entail much exchange of data messaging as the task pro- ceeds. Messaging represents a significant computational overhead, in some cases
outweighing any other advantage gained. Usually the number of messages sent and the size of the content of the message are significant factors affecting perfor- mance. It is therefore expedient, in the context of the techniques described here, to minimise the number of messages that are required to be sent as well as their size.
The technique described here is One-to-Many approach, where only the task agent can send/receive messages to/from DM agents. This involves fewer oper- ations, although, the significance of this advantage decreases as the number of agents used increases.
6.7
Experimentation and Analysis
To evaluate the two approaches, in the context of the EMADS vision, a number of experiments were conducted. These are described and analysed in this section. The experiments presented here were run on one machine with an Intel Core 2 Duo E6400 CPU (2.13 GHz) with 3GB of main memory (DDR2 800MHz), Fedora Core 6, Kernel version 2.6.18 running under Linux and used up to six data partitions and two artificial datasets: (i) T20.D100K.N250.num, and (ii) T20.D500K.N500.num whereT = 20 (average number of items per transactions),
D = 100K or D = 500K (Number of transactions), and N = 500 or N = 250 (Number of attributes) are used. The datasets were generated using the IBM Quest generator used in Agrawal and Srikant [2].
(a) Number of Data Partitions (b) Support Threshold
Figure 6.3: Average of Execution Time for Dataset T20.D100K.N250.num
As noted above the most significant overhead of any parallel system is the number and size of messages sent and received between agents. For the DATA-
(a) Number of Data Partitions (b) Support Threshold
Figure 6.4: Average of Execution Time for Dataset T20.D500K.N500.num
VP EMADS approach, the number of messages sent is independent of the number of levels in the T-tree; communication takes place only at the end of the tree con- struction. DATA-VP passes entire pruned sub (local) T-tree branches. Therefore, DATA-VP has a clear advantage in terms of the number of messages sent.
Figure 6.3 and Figure 6.4 show the effect of increasing the number of data partitions with respect to a range of support thresholds. As shown in Figure 6.3 the DATA-VP algorithm shows better performance compared to the DATA-HS algorithm. This is largely due to the smaller size of the dataset and the T-tree data structure which: (i) facilitates vertical distribution of the input dataset, and (ii) readily lends itself to parallelisation/distribution. However, when the data size is increased as in the second experiment, and further DM (worker) agents are added (increasing the number of data partitions), the results shown in Figure 6.4, show that the increasing overhead of messaging size outweighs any gain from using additional agents, so that parallelisation/distribution becomes counter productive. Therefore DATA-HS showed better performance from the addition of further data agents compared to the DATA-VP approach.
6.8
Discussion
MADM can be viewed as an effective distributed and parallel environment where the constituent agents function autonomously and (occasionally) exchange in- formation with each other. EMADS is designed with asynchronous, distributed communication protocols that enable the participating agents to operate inde- pendently and collaborate with other peer agents as necessary, thus eliminating
centralised control and synchronisation barriers.
Distributed and parallel DM can improve both efficiency and scalability first by executing the DM processes in parallel improving the run-time efficiency and second, by applying the DM processes on smaller subsets of data that are properly partitioned and distributed to fit in main memory (a data reduction technique). The scenario, described in this chapter, demonstrated that MADM provides suitable mechanisms for exploiting the benefits of parallel computing; particularly parallel data processing. The scenario also demonstrated that MADM is suitable for re-usability and illustrated how it is supported by re-employing the meta ARM task agent, described in the previous chapter, with the DATA-HS task.
6.9
Summary
In this chapter a MADM method for parallel ARM has been described so as to explore the MADM issues of scalability and re-usability. Scalability is explored by parallel processing of the data and re-usability is explored by re-employing the meta ARM task agent with the DATA-HS task.
The solution to the scenario considered in this chapter made use of a vertical data partitioning or a horizontal data segmentation technique to distribute the input data amongst a number of agents.
In the horizontal data segmentation (DATA-HS) method, the dataset was simply divided into segments each comprising an equal number of records. Each segment was then assigned to a data agent that allowed for using the meta ARM task when employed on EMADS. Each DM agent then used its local data agent to generate a complete local T-tree for its allocated segment. Finally, the local T-trees were collated into a single T-tree which contained the overall frequent itemsets.
The proposed vertical partitioning (DATA-VP) was facilitated by the T-tree data structure, and an associated mining algorithm (Apriori-T), that allowed for computationally effective parallel ARM when employed on EMADS.
The reported experimental results showed that the data partitioning methods described are extremely effective in limiting the maximal memory requirements of the algorithm, while their execution time scale only slowly and linearly with increasing data dimensions. Their overall performance, both in execution time
Chapter 7
Classifier Generation
To further explore the issues associated with the concept of a generic MADM, a classifier generation scenario is considered in this chapter. The point of this sce- nario is to investigate how MADM can be used to support the sharing of resources and expertise while at the same time preserving intellectual property rights that may be attached to Data Mining (DM) software contributed by MADM users. The scenario is also intended to further demonstrate how extendibility is sup- ported. The scenario considered is where a user wishes to generate a classifier for a specified dataset. In this scenario, a collection of classification (DM) agents is applied to a number of standard benchmark datasets held by data agents. A number of classifiers are generated, from which the best can be selected.
The chapter is organised as follows. In Section 7.1 the motivation for the sce- nario is described. Some background on data classification is presented in Section 7.2. The operation of EMADS with respect to classification is then described in Section 7.3. Extendibility consideration is described in Section 7.4. The eval- uation of the MADM classification process is then described in Section 7.5. A summary is given in Section 7.6.
7.1
Motivation
Multi-Agent Systems (MAS) have some particular potential advantages to offer with respect to KDD, in the context of sharing resources and expertise. Namely that the MADM approach provides the possibility of greater end user access to Data Mining (DM) techniques. MADM can make use of algorithms without necessitating their transfer to users, thus contributing to the preservation of any
intellectual property rights over the algorithms. This is a significant issue with respect to the EMADS vision espoused in this thesis. Recall that the vision is that of an anarchic collection of agents that can organically grow (using the EMADS extendibility feature). For this to happen contributors must have confidence in the “security” of any software they may contribute (i.e. the intellectual property rights in the software). This feature of MADM may be illustrated in a number of ways, however the scenario considered in this chapter is in the context of classification.
In the context of MADM the classifier generation scenario allowed the inves- tigation of a number of the MADM issues identified in Chapter 1, namely: (i) the sharing of recourse and expertises through the use of the facilitator; (ii) the protection of data privacy and intellectual property rights;
Although the preservation of the intellectual property in DM algorithms is the primary motivation for the work described in this chapter there are also other secondary motivations, namely:
• To further illustrate the EMADS extendibility feature.
• To demonstrate how EMADS agents can cooperate to produce on “opti- mum” solutions to a DM problem.
In the scenario considered in this chapter, in addition to the preservation of the intellectual property in DM algorithms, EMADS is used to illustrate how MADM can provide a dynamic solution of the problem of selecting the best classifier for a given dataset. “Best” in this context is measured in terms of classification accuracy (although some other metric, such as precision, could equally well be used). The experiments reported in this chapter thus not only serve to illustrate the advantages of MADM, but also provides an interesting comparison of a variety of classification techniques and algorithms.