5.5 Evaluations of Language–Independent Web Genre Detection
5.5.6 Evaluation using restricted linguistic features
As related work (e.g. [168]) shows, linguistic features boost the performance of web genre detection significantly. However, in a multilingual corpus this is problematic due to the different languages used. Without resorting to an approach that maps genre–typical multilingual feature terms, linguistic features are not applicable to such a corpus. However, if the use of language in a web resource is differentiated between use of a language in its content and use of language in its technical framework, there is a chance to improve the results presented above.
Forum, blog and wiki software often is open source and is therefore developed by a community. Al- though the developers often are not native English speakers, they collaborate in the lingua franca of
the web, which is English. Consequentially, the source code and the documentation often are written in English. In desktop applications, this is not necessarily visible to the user of a software, but in web applications, some of the internals are exposed through the HTML markup served to the browser. Re- spectively the classes and IDs of HTML elements (cf. section 4.2) often expose semantic information about the content they markup. For example, a common characteristic that is found in a blog application is that the class names for marking up comments contain the terms comment, post or author. Further, often the URLs of web resources are self–describing towards the genre. This can be utilized to provide
restricted linguistic features to an approach for detecting the web genre of such a web resource.
The following closed word sets were identified on exemplary web resources that are not in the corpus and subsequently added as feature groups to the feature set described above:
Class and ID names are used to apply style or functionality to HTML tags. Examples for this feature group are comment, header, footer, navbar, blogroll, edit and archive that designate certain functionality in a page (e.g. denoting the place for a list of friends’ blogs like blogroll or marking paragraphs in a wiki as editable).
Domain names often hint to the web genre of a resource. For example, board and forum are quite distinctive for forums whereas the sub–string wiki usually hints towards the web genre being a wiki.
URL sub–strings may indicate the use of a certain web genre. The string thread, for example, hints towards forums, whereas the string post is usually found in a blog.
Link sub–strings are similar to URL sub–strings, only that they denote the genre a web resource links to instead of designating the web genre of the resource itself. The intention of this feature group is the hypothesis that a blog will more often link to other blog pages than to e.g. forum or wiki pages. In total, these feature groups add 196 features to the classification, which is approximately 1.2 times the size of the original feature set. As they are not computationally expensive to collect, they represent a light–weight addition to the features in comparison to full–fledged, heavy–weight linguistic features like POS tagging. a b c d e ← classified as 190 6 2 2 0 a = BSP 19 177 2 1 1 b = BSP 9 2 188 0 1 c = WP 7 0 2 187 4 d = FSP 6 3 0 4 187 e = FTP 0.82 0.94 0.97 0.96 0.97 Precision 0.95 0.89 0.95 0.94 0.94 Recall 0.88 0.91 0.95 0.95 0.95 F–Measure 92.9% Accuracy 96.2% 3G Accuracy
Table 5.9:Confusion Matrix for Evaluation including selected linguistic features
Table 5.9 presents the confusion matrix that shows that the results are considerable with92.9% accu- racy and96.2% 3G accuracy. These results are significantly better than the results without the restricted linguistic features (t(0.99; 198) = 99.13). On ranking the newly introduced features by Information Gain, especially the link sub–string feature group shows to be promising, supporting the hypothesis that web genres primarily link to other web resources of the same web genres.
If the MP genre is taken into account, the accuracy drops to 87.4% (respectively 89.6% for 3G accu- racy). This is still a 10% improvement in comparison with the evaluation presented in section 5.5.4 and shows that LIGD is applicable to real–world scenarios.
5.6 Conclusions
In this chapter, an approach to automatically detect the genre of a web resource has been presented in order to recognize the genres blog, wiki and forum. In ELWMS.KOM this information can be used as metadata, helping learners to create a consistent vocabulary in their resource organization and therefore facilitates both the structuring and the retrieval process. LIGD draws on traditional features from related work, but also introduces novel features that serve to distinguish the web genres by their structure and not the used terminology. The latter base on the hierarchy of the HTML’s markup and do not demand knowledge of the language of the HTML’s content. Therefore, LIGD is language independent. Further, a corpus has been presented that encompasses 1,000 instances of multilingual resources of the above– mentioned genres, including pages from major providers like Blogspot14 as well as non–standard and custom blog applications. This shows that LIGD works with different web applications and systems. In the evaluation, accuracies up to94.3% of correctly classified instances were obtained (89.6% if the exact sub–genres of the superordinate web genres blog and forum are of interest). Further, these results can be enhanced by introducing linguistic features that depend on the language of the technical scaffold of the respective system, resulting in accuracies up to96.2%. Additionally, several other evaluations have been performed to show the benefits of LIGD.
LIGD has some advantages over related work, such as the independence of the web resource’s lan- guage. Further, reasonable results were obtained with only a small set of 144 features. Other approaches — particularly those making use of linguistic analysis — often have several thousand features [118], as they use a possibly large set of closed–class word sets. Thus, the limited number of features in LIGD reduces the computational complexity of the actual classification task significantly.
With the presented accuracies, LIGD is reliable enough to be used in a system like ELWMS.KOM for determining whether a web resource belongs to one of the targeted genres. Though, the use case of LIGD is not restricted to ELWMS.KOM, it has been applied in a Community Mining scenario [61] for identifying a resource’s web genre, segmenting the web resource and classifying the content types of the fragments. Thus, in such a setting, it is most useful as a complementary pre–processing step to the segmentation approach presented in chapter 4.
14 http://blogspot.com/, retrieved 2011-02-17
6 Supporting Self–Regulated Learning
In self–directed RBL settings using web resources, learners usually are not guided by a teacher or a tutor. Further, as web resources usually are not intended to be used for learning (e.g. weblog posts, wiki articles or community pages), they are not didactically structured and thus rarely provide the guidance that learners need. Additionally, the availability of a dedicated LO that covers the learner’s specific information need cannot be guaranteed. Hence, an application like ELWMS.KOM that aims at supporting RBL needs to substitute this lack of direction by enabling the learners themselves to assume the role of the organizer of their learning processes. This involves supporting setting goals, planning the learning process, self–monitoring and reflection, and eventually modification of a sub–optimal process step. Such a support needs to affect all processes of RBL that are executed in the personal context of a learner (cf. figure 6.1). Sharing and Distributing Utilization SRL Goal Setting Searching Self-monitoring Planning and Reflecting Annotating and Organizing Utilization Modification Task / Information Need
Figure 6.1:Supporting principles of Self–Regulated Learning benefits all processes of Resource–Based Learning in the learner’s personal context.
The theory of Self–Regulated Learning provides a framework for giving exactly this support, postu- lating that learners have to execute the metacognitive processes setting learning goals, planning and
monitoring their learning process and finally reflecting on it in order to readjust their procedure for
the next learning episodes. This chapter describes an extension to ELWMS.KOM that accommodates principles of Self–Regulated Learning (SRL) by supporting the above–mentioned learner processes.
6.1 Introduction
Major challenges for self–directed learners consist of stating their information needs, formulating search queries, estimating relevance of found resources, filtering irrelevant resources and keeping track of the state of the search process, i.e. monitoring their progress [7, 13]. These processes require high learner’s competencies of self–organization and self–motivation, as a deep information search in the context of learning is not trivial. These processes are covered by the theory of Self–Regulated Learning (SRL). Central to this theory is the notion that learning is a process that is self–directed and needs regulation on the learner’s side [6].
In the context of this thesis, RBL encompasses this style of learning. As shown in chapter 2, self– directed learners usually identify their information need autonomously and proceed to cover relevant information by searching on the web or dedicated digital libraries. Thus, SRL is applicable on learning settings like the presented one and should be supported in such a self–directed learning process.
6.1.1 Structure of this Chapter
In this chapter, additions to ELWMS.KOM that address the above–mentioned challenges are presented. Section 6.2 presents a basic overview of the theory of SRL that adequately reflects this self–directed process of learning with web resources. Further, the term scaffolds that denotes support of this process is explicated. The design and implementation of additions to ELWMS.KOM that enable learners to set learning goals prior to internet search and assign relevant web resources to these goals is given in section 6.3. The goal–setting component has been implemented for ELWMS.KOM that is an add–on for the web browser Firefox, as web browsers are the gateway to most information on the web. Section 6.4 presents two studies and evaluations of ELWMS.KOM showing the benefits of supporting the process phases of SRL and section 6.5 concludes with a short summary and an outlook.