2. Scientific Data Analysis, Data Mining and Data Analysis Environments
3.4. Case Study DataMiningGrid
3.4.3. Grid-enabling Weka
This case study describes the grid-enabling and execution of data mining components based on our approach presented in Section 3.3. It is based on [142, 139]. We use the Weka data mining toolkit (see Section 2.1.3) for our case study, as it represents a standard toolkit in the area of data mining. It includes a wide range of data mining components which cover the data mining methods presented in Section 2.1. The following subsections will introduce the Weka algorithms which are going to be grid-enabled as data mining components and the process of grid enabling them.
Weka
Weka [156] is a comprehensive data mining toolkit written in Java. It is available as Open Source and is in widespread use especially in the academic community. It includes components for pre-processing, classification, regression, clustering, feature selection and
3.4. Case Study DataMiningGrid
Figure 3.15.: DM Application Enabler - Confirmation (from [139]).
visualization. Weka is equipped with a set of user interfaces, but the individual components can also be executed via command-line. We use Weka 3.5.5 in our experiments.
In our case study, we focus on a regression problem. We use the following Weka algo- rithms for our experiments:
• K-Nearest Neighbours classifier (IBK): K-Nearest-Neighbours (kNN) [156] is a well-known, simple yet powerful method both for classification and regression. For predicting an unknown instance the k nearest instances according to some distance function are selected and, for regression, the target value is calculated using some possibly distance-weighted mean of the nearest neighbours. Crucial parameters are the correct choice of the distance function, the weighting function and the number of neighbours.
• Locally Weighted Learning (LWL): Locally Weighted Learning [156] is an in- stance-based algorithm that assigns weights to instances according to the distance to the test instance. This is similar to kNN, but not limited to a fixed number of neighbours. LWL performs a linear or non-linear regression on the weighted data, using the weighted instances.
• M5P: M5P [156] is a tree-based algorithm. In contrast to a regression tree, which uses the mean value in the leafs of a tree for prediction, a linear model is fitted for each leaf.
Process of grid enabling
The user’s component to be integrated into the grid environment has to be compliant to the definition of a component from Section 2.1.4. In our case, the jar file of the Weka distribution is the executable file for all Weka components to grid-enable.
In order to make the three components IBK, LWL and M5P available in the grid, we have to follow the procedure of grid-enabling them and create the application description files. These files are instances of the ADS and contain the following description for each component: The information in the element dataminingInformation is data mining spe- cific metadata about the component like the component’s name (which is the algorithms name) the group (Weka), the application domain (Data Mining), and the CRISP phase (Modelling). The element generalInformation contains further metadata such as version, id, a description, the upload date and so on. In the execution element we have to specify execution type (java), the main class (e.g. weka.classifiers.lazy.IBk), interpreter arguments (e.g. the maximum java heap size -Xmx1000m) and the component’s executable file (path to the jar file). The element applicationInformation contains information about the op- tions of the component (which are the options which can be specified in the Weka GUI) as well as the class attribute to predict. Each of these options is specified by data type, default value, a tool-tip, the flag, a label shown in the GUI, etc. Additionally it specifies the data input, which is a single file in the ARFF format, and the data output, which is a text file containing the textual output Weka creates. The last step is to upload the executable file and the application description files to the grid. Once the component is grid-enabled, it appears in the grid wide component registry.
Examples of the ADS instances for the Weka components IBK, LWL and M5P can be found in the Appendix (Section A.2).
Example Workflow
Figure 3.16.: Weka Workflow (from [142]).
In the DataMiningGrid system, Triana workflows are used to specify data mining pro- cesses. A workflow for running a grid-enabled Weka component consists, e.g., of the following units (see Figure 3.16):
• ApplicationExplorer
• GridURI
• ParameterControl
3.4. Case Study DataMiningGrid
• Provenance
• GridExplorer
When the workflow is started in Triana, the following steps are executed:
• Step1: The GUI of the ApplicationExplorer Unit is shown (see Figure 3.17), which allows to search the grid registry for a data mining component.
• Step2: The GUI of the GridURI unit is shown (see Figure 3.18), which allows the user to specify the URI of the file in the grid to be used as input.
• Step3: In the ParameterControl unit (see Figure 3.19), the user is able to specify the parameters of the component.
• Step4: During execution, the Execution unit (see Figure 3.20) shows information on the execution of the job in the grid.
• Step5: In the GUI of the Provenance unit (see Figure 3.21), the user can access information and statistics on the execution of the component.
• Step6: The GUI of the GridExplorer unit (see Figure 3.22) allows the user to browse the result directory of the executed job.
Figure 3.18.: The GridURI Triana Unit (from [139]).
Figure 3.19.: ParameterControl Triana Unit (from [139]).
Figure 3.20.: Execution Triana Unit with 8 jobs executing (from [139]).
Runtime Experiments
In the following we will present different runtime experiments based on the grid-enabled Weka algorithms. We show two experiments with the execution of a single component with different parameter settings.
The workflow which is set up for running one of the Weka components is the one already shown in Figure 3.16. As described in the previous section, it consists of the six units LoadDescription, GridURI, ParameterControl, Execution, Provenance and GridExplorer. In the following we describe the settings during the execution of the workflow for the Weka components. When starting the workflow, the data mining components to execute are selected. After that, the workflow passes on to the ParameterControl unit, where
3.4. Case Study DataMiningGrid
Figure 3.21.: Provenance Triana Unit displaying general information (from [139]).
Figure 3.22.: GridExplorer Triana Unit (from [139]).
the component’s parameters and options can be specified (e.g., it is specified that each job shall perform 10 fold cross-validation). We want to execute jobs which run the same component with different parameter settings, so we have to select the options on which we will perform the sweep. At the Options panel of the ParameterControl unit we can set the details for the sweep by choosing either a list or a loop for the parameter (as already shown in Figure 3.8). Out of these settings the system generates a (multi-)job description. Additionally we have to specify the component’s data input. This is done by selecting the URI which was passed from the GridURI unit in the input data drop down box at the Data mappings tab. We used the dataset House(16L) from a database which was designed on the basis of data provided by US Census Bureau and is concerned with predicting the median price of houses in the region based on demographic composition and a state of housing market in the region. The dataset, which was taken from the UCI Machine Learning Repository [51], contains 22784 examples and 17 continuous attributes. This size of data is justified because we are mainly interested in measurements of the overhead caused by grid computing in the DataMiningGrid environment which becomes
Number of Machines 1 2 3 4 5 6 Max number of jobs per Machine 10 5 4 3 2 2
Table 3.1.: Job distribution for the M5P experiment.
clearer when using smaller datasets.
The next step of the workflow is the Execution unit, which submits the jobs to the resource broker. The jobs will be executed, and after all jobs are finished the Provenance unit and the GridExplorer show the provenance information about the execution and the result directory.
The test environment on which the jobs will be executed contains 2 GT4 GRAMs (Intel Pentium 4 2.40GHz, 2GB memory) and 6 Condor machines (AMD Opteron 244 1.80GHz, 4GB memory). For the evaluation we will vary the number of machines and/or the number of jobs and we will look at and compare the runtime.
Experiment M5P During the M5P experiment we submit jobs to the grid which execute the Weka M5P algorithm with different parameter settings. The execution mode is Condor, which means that all jobs are submitted to the Condor pool which are connected to the GRAMs. In this experiment we will have a fixed number of jobs and we will vary the number of machines in the grid. We generate 10 jobs in total by using a list for the option BuildRegressionTree (true/false) and a loop for the option MinNumInstances (from 2 to 10 step 2). These jobs will be submitted to 1 to 6 machines. Figure 3.23 visualizes the results of the M5P experiments. The graph shows the relation of the number of machines in the grid and the runtime of all 10 jobs. As expected, in general the runtime decreases the more machines the grid contains until the number of machines in the grid reaches the number of jobs. The jobs are distributed equally to the Condor machines. Table 3.1 shows the maximum number of jobs which one of the machines has to compute (e.g. 10 jobs on 3 machines, so 2 machines take 3 jobs and one takes 4). This explains why there is no decrease in total runtime from 5 to 6 machines.
3.4. Case Study DataMiningGrid
Experiment IBK In the IBK experiment we will submit jobs to the grid which execute the Weka IBK algorithm. We make different experiment series, each on a different kind of machine/pool, which are compared afterwards. The jobs are generated by varying the parameter k (from 1 to maximum 16) so that we have up to 16 jobs in total. These jobs will run a) in fork mode on the Globus machine, b) on a single machine inside the Condor pool and c) on the whole grid (which consists of 6 Condor machines). In each experiment series we have a fixed number of machines and we will vary the number of jobs. The result (Figure 3.24) is as expected. At a) and b) the jobs are all executed on a single machine, the fork execution on the Globus machine has worse performance than the Condor machine. The runtime increases linear, but the Condor execution seems to be faster. This looks confusing, because the submission from the Globus machine to Condor should take some time so that the Condor execution should definitely take longer. The reason for this result is that the Globus machine has older hardware. When executing the jobs c) on 6 Condor machines, the runtime also increases linear, but in comparison to the Condor execution on a single machine the runtime decreases by a factor about 6.
Figure 3.24.: IBK results (from [142]).