2. Scientific Data Analysis, Data Mining and Data Analysis Environments
3.2. Related Work
• Transparency: The details of the underlying technology of the grid system should be hidden from the users, since they usually do not have expertise in distributed systems.
Data mining processes include data intensive and complex tasks that require many computational resources and a lot of domain knowledge. From CRISP [121], which was already described in Section 2.1.2, results that a lot of very different data mining processes exist in the form of iterative, complex workflows. Such workflows are, e.g., published at repositories such as myExperiment [58]. Thus, the following requirements have to be met [123]:
• Generality of the extensibility mechanism: The integration method should cover a variety of different data mining components and should support a wide range of data mining processes. Thus, it has to support the common steps of data mining processes that can be seen as standardizable as described in Section 2.1.3: chaining, looping, branching, parameter variation (parameter sweep), cross-validation, and data partitioning.
• Efficiency: Data mining components should be able to run in an efficient way in the grid system, based on batch job execution and parallel processing of the standard tasks parameter sweep and cross-validation (see Section 2.1.3) to save execution time. This also requires shipping of data, which means to send the data to the machine where the data mining component is executed, and shipping of components, which means to send the component to the machine where the data it operates on is located. In addition, the integration into the grid environment should introduce only little overhead.
Shipping of data is important in cases where either the full data set is partitioned to facilitate distributed computation (e.g. for k-NN, where objects are assigned to the class most common amongst its k nearest neighbours [156]), or where the same source data set is moved to different machines and repeatedly analysed (e.g. for ensemble learning, where different data mining components are executed and the results are combined afterwards). Shipping components to the location of the data for execution helps to save data transfer time. This is one of the major options in setting up a distributed data mining process. It is required when, as it is often the case, no pre-configured pool of machines is available that already has the data mining functionality installed. The option to ship components to data allows for flexibility in the selection of machines and reduces the overhead in setting up the data mining environment. It is especially important when the data naturally exists in a distributed manner and it is not possible to merge it. This may be the case when data sets are too large to be transferred without significant overhead or when, e.g., security policies prevent the data to be moved.
3.2. Related Work
In recent years, a number of environments for grid-enabling data mining tools have been described. The importance of extensibility for data mining platforms has already been
argued in [158]. Today, there exist a lot of systems which are capable of distributed data mining. While in [123] a general comparison of a variety of systems has been done, we focus on grid-based system that are related to OGSA here.
GridMiner [22] is designed to support data mining and online-analytical processing (OLAP) in distributed computing environments. The system is based on a service ori- ented architecture (SOA) supporting OGSA grid services and OGSA-DAI database access. GridMiner implements a number of common data mining algorithms, including parallel versions. In the GridMiner system, each data mining component is integrated by wrapping it by a single OGSA-based grid service. This means that for each data mining component to be integrated there is a need for developing a new service. Thus, the approach does not fully support the reuse by the users themselves, as the integration of components in terms of creating new grid services is complex for users. Furthermore, the approach does not support extensibility without component modification, as the components have to be developed into grid services.
The Federated Analysis Environment for Heterogeneous Intelligent Mining [8] (FAE- HIM) implements a toolkit for grid-based data mining. It consists of grid services for data mining and a workflow engine for service composition. Based on algorithms taken from Weka, the grid services split into the types classification, clustering and association rules. The services are not limited to algorithms from Weka, but for each data mining component or set of components a new service has to be developed. Thus, similar to GridMiner, the approach does not fully support the reuse by the users themselves, as the integration of components in terms of creating new grid services is complex for users. In addition, the approach does not foresee that the components can be executed as grid jobs, which limits the ability for efficient execution.
Weka4WS [127] is a framework for supporting distributed data mining on grid envi- ronments, designed by using the Web Service Resource Framework (WSRF) to achieve integration and interoperability with standard grid environments. The Weka4WS system is based on the data mining toolkit Weka. A single web service interface is used to provide access to the data mining algorithms implemented in Weka. Thus, the extensibility of the system is restricted to algorithms that are contained in the Weka toolkit, which constraints the generality. Similar to FAEHIM, the approach does not foresee that the components can be executed as grid jobs, which limits the ability for efficient execution.
Knowledge Grid (K-Grid) [26] is a service-oriented framework that has been designed to provide grid-based data mining tools and services. The system facilitates data mining in distributed grid systems and related tasks such as data management and knowledge representation. The system architecture is organized in different layers: The Core K- Grid Services handle the publication and discovery of data sources, data mining and visualization tools, and mining results as well as the management of abstract execution plans that describe complex data mining processes. The High-level K-Grid Services are responsible for searching resources, the mapping of resource requests from the execution plans to the available resources in the grid, and the task execution. The Knowledge Directory Service (KDS) is responsible for maintaining the descriptions of components that can be used in the Knowledge Grid. The description of components is based on XML documents, which are stored in a Knowledge Metadata Repository (KMR). The metadata about data mining components includes information about the implemented