2. Scientific Data Analysis, Data Mining and Data Analysis Environments
3.4. Case Study DataMiningGrid
3.4.5. Discussion
In the first case study (Section 3.4.1) we have shown that it is possible to implement the architecture of our approach, based on the ADS, in the DataMiningGrid system.
In the case study on the Application Enabler (Section 3.4.2), we presented an imple- mentation of the tool for grid-enabling data mining components based on a web based GUI. By this it was shown that users are able to integrate and reuse components easily. They do not need deep grid knowledge or know how in parallel processing to integrate and execute their own data mining components in a grid environment. The details of the underlying grid technology are hidden. The implementation of our approach in the DataMiningGrid system addresses the users’ needs by providing a mechanism for the reuse that makes it possible to grid-enable their favourite component without writing any code, by simply providing metadata that can be specified via a web page.
In the case study on grid-enabling Weka (Section 3.4.3) we have shown how data mining components from the Weka toolkit, which was developed for a single computer environ- ment, can be integrated into a grid environment. In detail, we integrated the Weka components IBK, LWL and M5P into the DataMiningGrid system. By this case study it was shown that common data mining components can be integrated and reused without modification of the component or the grid system. We gave an evaluation by experiments which showed that the requirements were met. The system is capable of handling even more complex scenarios, e.g. where algorithms or the data should not be moved. There is no need for the integration of new workflow operations or of new grid services. Because the scenarios can be adapted to run in the DataMiningGrid grid environment, the users of data mining benefit from all the advantages the grid technology provides, among them an user-friendly set up of experiments, a decrease in runtime or other benefits like a massive storage volume, and an integrated security concept. The evaluation of the different sce- narios emphasizes the easy set up and submission of data mining tasks. As the runtime analysis shows, the flexibility of the system does not result in a big performance overhead. The runtime of the experiments depend on the speed and the number of available ma- chines. Grid-enabled components in the DataMiningGrid system can reach a very good scalability.
The study on data mining scenarios (Section 3.4.4) demonstrated that the capability of our approach in the context of the DataMiningGrid system, extended by new grid- enabled components, is sufficient to carry out a variety of data mining scenarios. In detail, we presented how to setup scenarios for data partitioning, classifier comparison and
parameter optimization based on grid-enabled components. Additional helper components were needed just for splitting a large data set in to pieces, gathering results, or performing a simple vote.
The potential of using grid technology for data mining consists in the fact that it are simple distribution schemes that seem to give the biggest benefits in terms of performance. Our approach and its implementation in the DataMiningGrid system emerge not as a specialist technology, but as a general solution for integrating data mining components into grid environments. The system is user friendly in a way that a data miner is able to use the system - from the inclusion of new data mining components to their execution in the context of complex experiments - without any specific knowledge of the system details.
3.5. Wrap-up
In this chapter we presented an approach for the integration of existing data mining com- ponents into OGSA-based grid environments. The approach allows for integrating data mining components that have been developed as executable files in a single computer environment into grid environments. It is based on the Application Description Schema (ADS), which is an XML-based metadata schema that is used to manage user interaction with grid system components, and associated client-side components, which provide user interfaces and use the ADS for information exchange. The schema allows to grid-enable existing data mining components, to register and search for available data mining com- ponents on the grid, to match requests for job executions with suitable computational resources, and to dynamically create user interfaces. We have shown that it is possible to cover all information necessary for the execution of data mining components in OGSA- based grid environments with a single XML schema and that it is possible to create a technical system for the execution of data mining components based on data exchange via the XML schema. Our approach allows for the reuse of a wide range of components existing in the field of data mining and addresses the requirements for reuse without com- ponent or platform modification, for transparency of the grid technology, for generality and for efficiency. We implemented and evaluated our approach in the DataMiningGrid system based on GT4 grid services and extensions to the Triana workflow environment.
In addition to reusing existing data mining components, users need to compose and to develop new data mining components. In the next chapter, we will present an ap- proach for flexible and interactive development of data mining scripts, which allow for the development and composition of data mining components.