2. Scientific Data Analysis, Data Mining and Data Analysis Environments
3.3. Integration of Data Mining Components
3.3.3. Process of Registering and Executing Data Mining Components
Bioinformaticians who want to grid-enable their data mining components have to describe the components according to the ADS. This means that they have to create an instance of the ADS for their component. Instances of the ADS are XML documents at various levels of specification, which are passed among system components. The documents can differ in the specification of the values for the options, inputs, requirements, etc. For an execution, the ADS instance must be fully specified. This means that all non-optional values have to be set. The ADS is expressive enough to accommodate data mining components from a wide range of platforms, technologies, application domains, and sectors, and has been tested with eight different application domains [124]. Code listing 3.2 gives an abstract example. Detailed examples will be given as part of the case studies presented in Section 3.4.
Listing 3.2: An ADS instance (as pseudo code) <?xml version="1.0"?> <app:application ...> <generalInformation ...> <longDescription>...</longDescription> <comment .../> </generalInformation> <dataminingInformation .../> <executionInformation> <javaExecution ...> <interpreterArguments>...</interpreterArguments> <applicationRunFile.../> </javaExecution> </executionInformation> <applicationInformation> <options .../> <dataInputs .../> <dataOutputs .../> <requirements> <minMemory .../> <operatingSystem .../> <architecture ...> <value>...</value> </architecture> </requirements> </applicationInformation> </app:application>
The Application Enabler, which is part of the architecture described in Section 3.3.1, is a tool that is used to create an ADS instance for a given data mining component. The process of grid-enabling data mining components is as follows:
1. ADS instance. Based on the ADS, the different types of information needed for a successful integration of the component in the grid are collected and formalized. The result is an instance of the ADS specifically created for the component.
2. Registration. After it’s creation, the ADS instance is stored via the Information Services of the grid middleware in the grid registry and the executable is trans- ferred to a storage component attached to the grid via the data services of the grid middleware.
Figure 3.4 visualizes the idea on how to grid-enable data mining components. An implementation of the Application Enabler as web application in the DataMiningGrid system will be described in Section 3.4.2.
When the component is successfully registered, it is ready to be found and executed. The tools for the execution of grid-enabled data mining components consist of the client tools Explorer, Control, Execution and Provenance (see Section 3.3.1).
3.3. Integration of Data Mining Components
Figure 3.4.: Reusing data mining components by registering them in the grid environment with the help of the ADS.
In the following, we will describe how the ADS interacts with the grid system compo- nents on the execution of a data mining component. Figure 3.5 visualizes the interactions.
1. Search. With the Explorer tool the user searches the grid registry for the registered data mining component to execute by specifying search terms for metadata fields of the ADS. The Explorer tool accesses the Information Services of the grid middleware to search for information on the registered components based on the metadata from the ADS instances. By selecting a component, the respective instance of the ADS is fetched from the registry and is transferred to the Explorer tool. From there, it can be passed to the next tool (Control tool). Note that the ADS instance is not necessarily fully specified at this stage.
2. Parameters. The Control tool dynamically creates a GUI for specifying the options, inputs and requirements of the data mining component based on the information from the ADS instance. If the ADS instance is fully specified, which means that all of the necessary parameters are set, it can be passed to the Execution tool. If it is not yet fully specified, the user specifies the parameters and data input for the selected data mining component via a user interface. In addition, file- or parameter-sweeps can be configured. The ADS instance is now fully specified and is prepared for the execution with the grid middleware.
3. Execution & Monitoring. The execution tool is responsible for transforming the ADS instance into a format that can be passed to the Resource and Execution Man- agement Services of the grid middleware. According to the specified requirements and parameters, the job is scheduled and executed on the hardware resources on the grid. During the execution, the job status can be monitored via the Execution tool, which represents a client for the execution services of the grid middleware. When the execution is finished, further information about the execution are added to the ADS instance as provenance information. This information can be visualized by the Provenance tool.
4. Provenance. After execution, the ADS instance and other relevant information on the execution can be inspected and stored as provenance information for a later analysis via a user interface by the Provenance tool.
An implementation of these tools as extensions to the Triana workflow environment in the DataMiningGrid system will be presented in Section 3.4.1.
Figure 3.5.: Executing a grid-enabled data mining component with the help of the ADS in the grid environment.