Comparison of different internal data structures

6.5 Experiments

6.5.4 Comparison of different internal data structures

In another experiment we tried to analyze the use of different data structures. We used a perceptron on the SQL dataset (see Section 6.5.1) and we used different data structures for storing the set of already misclassified examples containing tree structures. We present the trivial approach which just stores each tree and calculates the sum on the set of these trees. This approach is two-fold as the calculation is possible using the approach by [Collins and Duffy, 2001] or a faster approach by [Moschitti, 2006a, Mos- chitti, 2006b]. The approach to store the set of trees in one list to use the approach by [Moschitti, 2006a, Moschitti, 2006b] later on is presented in Section 6.4.2 and it is the third method to be used here. The fourth approach to store the set of trees is a DAG. Table 6.13 shows the results of the perceptron experiments on the SQL dataset. It

Method Accuracy Recall Precision Time (in s)

One DAG 98.9 ± 0.6% 87.3 ± 3.1% 82.4 ± 6.8% 3.58

Tree List Percep- tron

99.0 ± 0.6% 87.3 ± 3.6% 83.2 ± 6.4% 18.6

FTK 99.0 ± 0.6% 87.3 ± 3.6% 83.2 ± 6.5% 23.89

QTK 98.9 ± 0.5% 87.6 ± 3.3% 82.3 ± 6.6% 31.34

6.6. SUMMARY

becomes obvious that quality of the results concerning recall, precision and accuracy do not depend on the internally used data structure. Figure 6.6 shows the runtimes for one 10-fold cross-validation using the five different data structures internally by a perceptron on the SQL dataset.

0 10 20 30

DAG Tree List Perceptron FTK CollinsDuffy

time (in s)

datastructure

Figure 6.6: Runtime for a cross-validation using different data structures using a perceptron on SQL

6.6 Summary

In this chapter we have shown that a crucial point to be respected for efficient tree kernel usage is the data structure storing the set of trees.

We presented two data structures which both can handle sets of trees. One data structure is based on our idea to store the set of trees in a list of productions. This idea is affected by the FTK by [Moschitti, 2006b] which presents the usage of lists of productions to store trees. The other data structure is a directed acyclic graph (DAG) which – during tree kernel calculation – acts like a tree but contains a whole set of trees. The advantage of using a DAG is that the nodes contain frequencies in addition to the la- bels. This behavior results in a smaller amount of nodes which leads to a less complex calculation of the tree kernel values. The usage of both data structures leads to a more efficient calculation for tree kernels in contrast to not storing the certain set of trees in one data structure.

Our presented approach aims at enhancing the calculation over sets of trees in order to make the access more efficient. The internal calculation of C(ni, nj) is not affected by our approach. The approximate tree kernel approach presented by [Rieck et al., 2010] aims at speeding up the calculation of C(ni, nj). The combination of our approach and the approximate tree kernel could, again, lead to a more efficient tree kernel calculation.

Additionally, we presented two machine learning approaches which by definition have a shorter runtime than, for instance, support vector machines (SVMs). The first approach is a perceptron which is evaluated with all presented data structures. The second approach is our main contribution in this chapter which is the development of a tree kernel na¨ıve Bayes classifier. This classifier is on the one hand significantly better than na¨ıve Bayes classifiers applied on flattened structured features with respect to precision and accuracy. On the other hand our approach is significantly faster on particular datasets than comparable tree kernel methods based on optimized models like SVMs. We presented experiments on three real-world datasets which show that our tree kernel na¨ıve Bayes approach is a fast and efficient alternative to tree kernel methods based on kernel machines.

Chapter 7

The Information Extraction

Plugin for RapidMiner

In this chapter we will present the plugin we developed for the open source framework RapidMiner [Mierswa et al., 2006]. Relevant publications concerning the extension are [Jungermann, 2009, Jungermann, 2010, Jungermann, 2011b, Jungermann, 2011c, Jungermann, 2011a].

Our plugin allows the combination of Information Extraction and Data Mining methods. In addition, it is possible to use all techniques which are already available in RapidMiner. These techniques do not only include models like decision trees or support vector machines. A great benefit of RapidMiner is the possibility to easily validate machine learning tasks. By the application of our plugin the validation process can also be used for Information Extraction processes making the results more significant. Additionally, we are able to evaluate and validate several different parameter settings for multiple techniques. This makes our plugin a toolbox for the comparison and development of new machine learning approaches for Information Extraction. The plugin is open source and easily to extend.

RapidMiner, which is shortly presented in Section 7.1, supports a certain data structure for storing datasets. This data structure has to be respected by extensions and operators of those extensions. This circumstances lead to particular requirements which have to be fulfilled by our extension. These requirements in addition to the data structure used in RapidMiner are presented in Section 7.1.1.

The process of a particular Data Mining task can be separated in four distinct parts in RapidMiner. The first part is the retrieval of the data. We present the possible ways to retrieve data for Information Extraction purposes in RapidMiner in Section 7.2.1. After the retrieval the data has to be prepared for future use. This preparation is often called preprocessing. Although the task of preprocessing sometimes contains the process of data preparation (see Section 2.2), we just focus on the enrichment of the

data by features to allow more precise analyses in this work. The preprocessing of the datasets is presented in Section 7.2.2. The preprocessed datasets finally can be used to create models which in turn can be used to analyze formerly unknown datasets. The process of creating models is called modeling and it is presented in Section 7.2.3. After – and sometimes also during – the process of modeling the models have to be evaluated to get the optimal model for a given datasets. The task of evaluation is presented in Section 7.2.4.

In Section 7.3 we present frameworks which are comparable to our plugin. We will show that state-of-the-art frameworks for Information Extraction are based on a comparable architecture like our plugin. It will become obvious that our plugin is superior compared to those frameworks because it enables the close collaboration of Informa- tion Extractionand Data Mining.

Section 7.4 summarizes this chapter. The particular reference for each operator is presented in Appendix A.

7.1 RapidMiner

RapidMineris an open source framework for Data Mining purposes. It offers many Data Miningmethods which can be plugged together to form a Data Mining analysis. The functional objects in RapidMiner are called operators, and the set of operators being plugged together are defined to be a so called process. The major function of a process is the analysis of the data which is retrieved at the beginning of the process. The framework offers a graphical user interface (GUI) that offers the possibility to connect operators with each other in the process. The particular panel visualizing the process is called process view. Operators have interfaces for achieving and presenting data. These interfaces are called ports. Input ports are receiving the data which will be presented to the operator and the output port is delivering the data which has been processed by the operator. Most operators have at least one input and one output port. Data that is passed to an input port of an operator is processed internally and it is presented at the output port, finally. The data which is processed is passed to any operator which is connected to the certain output port. Other types of objects may be created during the process. These objects are also presented at certain output ports. The data can be used to create models (see Section 2.4 for more information concerning models), for instance. These models can be evaluated, and performance results can be generated out of this evaluation process, and so on. These kinds of objects all can be passed as data objects from operator to operator by connecting the operators. The complete process has global output ports. Data, results or models which are passed to these ports are represented in the result view panel after finishing the process.

The GUI of RapidMiner is shown in Figure 7.1. Six main areas of the GUI, which can be rearranged by the user, are to be distinguished:

1. Overview

7.1. RAPIDMINER

Figure 7.1: RapidMiner graphical user interface

If the process is too large to be displayed in the process window, the overview window will help to navigate to certain positions in the process window. 2. Operators and Repositories

These tabs allow accessing operators or repositories of RapidMiner. Operators are the basic elements for building a process. Repositories store datasets to avoid the loading and converting process of files for each run of a process. This behavior leads to a faster access on datasets.

3. Process

The Process window makes the whole process of connected operators accessible. An overview of this window which could become very large is available in the Overviewtab.

4. Problems, Log and System Monitor

This tab contains possible log messages, information about problems and about the system load.

5. Parameters

The Parameters tab shows the parameters of the operator which is currently focused. Parameters are very important because the results of Data Mining tasks often depend on the right choice of the particular parameters.

6. Help

The Help-tab contains information about operators which are focused.

Each RapidMiner-process can be split into four distinct phases. These phases are shown in Figure 7.2:

1. Retrieve

The leftmost operator in Figure 7.2 is a Retrieve-operator. During the Retrieve phase the data which is processed later on is loaded from specific data sources. 2. Preprocessing

The retrieved data has to be prepared or enriched in the Preprocessing phase. The second operator shown in Figure 7.2 (the purple one) is a particular preprocessing operator which is converting nominal values to numerical ones.

3. Modeling

The prepared data is used in the Modeling phase to extract or create models which can be used for the analysis of unlabeled data. The third operator shown in Figure 7.2 is creating an SVM model.

4. Evaluation

The two rightmost operators shown in Figure 7.2 are used to apply the learned model to a dataset and to evaluate the performance achieved by the applied model. The expected or real performance of the created models is evaluated during the Evaluation phase.

Figure 7.2: Exemplary process in RapidMiner

These phases are similar for every RapidMiner process. Therefore, we will define the particular phases and the corresponding specialties using the Information Extraction Pluginin Section 7.2.

In document About the exploration of data mining techniques using structured features for information extraction (Page 128-134)