• No results found

4.2 Experiments

4.2.5 Features Increase Rate

Finally, we conduct a scalability experiment, where we examine how the number of instances affects the number of generated features by each feature generation strategy. For this purpose we use theMetacritic Moviesdataset. We start with a random sample of100instances, and in each next step we add200(or300) unused instances, until the complete dataset is used, i.e., 2,000instances. The number of generated features for each sub-sample of the dataset using each of the feature generation strategies is shown in Figure 8.3. We can observe that in the beginning the curves for all strategies sharply increase. After the sub-sample reaches the

Figure 4.1: Features increase rate per strategy (log scale).

half of the complete sample, the strategies based on generic relations stabilize, as only a few new relations are discovered when adding new instances. The curves for the strategies based on generic relation-values, and the specific relations, are steadily increasing as new instances are added. On the other side, the curves for the strategies based on graph substructures increase more rapidly than the rest of the strategies, without a sign of convergence. From the chart, we can also observe that the strategies based on graph substructures generate feature sets three orders of magnitude large than the strategies based on generic relations, and two orders of magnitude larger than the strategies based on specific relations.

4.3

Conclusion and Outlook

In this chapter, we have introduced a collection of 22 benchmark datasets for ma- chine learning on the Semantic Web. We have shown how they can be used to set up experiments which allow for making statistically significant comparisons be- tween different learning approaches. So far, we have concentrated on classification and regression tasks. There are methods to derive clustering and outlier detection benchmarks from classification and regression datasets [77, 84], so that extending the dataset collection for such unsupervised tasks is possible as well. Further- more, as many datasets on the Semantic Web use extensive hierarchies in the form of ontologies, building benchmark datasets for tasks likehierarchical multi-label

classification[279] would also be an interesting extension.

At the moment, the dataset collection has a certain bias towards datasets linked to DBpedia. This has two main reasons, (1) DBpedia being a cross-domain knowl- edge base usable in datasets from very different topical domains, and (2) tools like DBpedia Lookup and DBpedia Spotlight making it easy to link external datasets to DBpedia. However, DBpedia can be seen as an entry point to the Web of Linked Data, with many datasets linking to and from DBpedia. In fact, some Semantic

4.3. CONCLUSION AND OUTLOOK 67 Web mining tools, such as the RapidMiner Linked Open Data extension, are capa- ble of exploiting such links automatically and combining information from various Linked Data sets [246].

Summarizing, this presents the first attempt of creating a universal benchmark collection for Semantic Web mining, an area in which much research is conducted, but an accepted benchmark set is missing. By successively extending this bench- mark set, we believe that it will provide a useful cornerstone for research at the crossroads of Semantic Web and machine learning.

Propositionalization Strategies

for Creating Features from

Linked Open Data

As shown in chapter 2, Semantic Web knowledge graphs have been recognized as a valuable source of background knowledge in many data mining tasks. Augmenting a dataset with features taken from Semantic Web knowledge graphs can, in many cases, improve the results of a data mining problem at hand, while externalizing the cost of maintaining that background knowledge [221].

Most data mining algorithms work with a propositional feature vector rep- resentation of the data, i.e., each instance is represented as a vector of features hf1, f2, ..., fni, where the features are either binary (i.e.,fi ∈ {true, f alse}), nu- merical (i.e.,fi ∈R), or nominal (i.e.,fi ∈S, whereS is a finite set of symbols). Linked Open Data, however, comes in the form ofgraphs, connecting resources with types and relations, backed by a schema or ontology.

Thus, for accessing Semantic Web knowledge graphs with existing data mining tools, transformations have to be performed, which create propositional features from the graphs in Linked Open Data, i.e., a process called propositionalization

[154]. Usually, binary features (e.g., true if a type or relation exists, false otherwise) or numerical features (e.g., counting the number of relations of a certain type) are used [225]. Other variants, e.g., computing the fraction of relations of a certain type, are possible, but rarely used.

Our hypothesis is that the strategy of creating propositional features from Linked Open Data may have an influence on the data mining result. For example, promiximity- based algorithms like k-NN will behave differently depending on the strategy used to create numerical features, as that strategy has a direct influence on most distance functions.

In this chapter, we compare a set of different strategies for creating features from types and relations in Linked Open Data. We compare those strategies on a number of different datasets and across different tasks, i.e., classification, regres-

5.1. STRATEGIES 69 sion, and outlier detection.

The work presented in this chapter has been published before as: “Petar Ris- toski, Heiko Paulheim: Feature selection in hierarchical feature spaces. Pro- ceedings of the 17th International Conference on Discovery Science, Bled, Slovenia, October, 2014.” [253].

5.1

Strategies

When creating features for a resource, we take into account the relation to other resources. We distinguish strategies that use the object ofspecific relations, and strategies that only take into account the presence ofrelations as such.