Training a classifier for web genres needs a corpus containing example instances to learn by correctly (manually) classified examples and to validate the selected features, checking whether they are appro- priate and distinctive.
In order to evaluate LIGD a corpus is needed that contains samples of all web genres (blogs, forums,
wikis) that are taken into account. In order to emulate realistic use conditions, the requirements for this
corpus are inclusion of
1. up–to–date, real–world HTML (as opposed to the corpora of other, older approaches that partially only include table–based layout).
2. web resources composed in different languages.
3. content from different authoring applications (e.g. in the case of blogs Wordpress4, Blogger.com5as well as less–used applications), as they represent and structure content in different ways.
Although some related approaches’ corpora (notably Meyer zu Eissen et al. [132] and Santini [168]) are available for research, these do not match the requirements for LIGD, as they are out–dated, feature only English web resources or do not match the selection of web genres. Thus, a novel corpus has been built6.
Superordinate genre Sub–genres Description
Blog Blog Page (BSP) The start page of a blog, typically featur-
ing multiple post teasers
Blog Post (BPP) The view of a single post with accompa- nying comments
Forum Forum Page (FSP) Typically an overview of topics or threads Forum Thread (FTP) A thread with multiple comments to the
lead post
Wiki Wiki Page (WP) A wiki page containing one article
Miscellaneous Miscellaneous Page (MP) A page that belongs to neither of the web genres above
Table 5.1:Overview of all sampled genres and their respective sub–genres. Note that Miscellaneous does not constitute an own genre.
After a preliminary analysis of the targeted web genres, the blog and the forum genre corpora were split into sub–genres in order to reflect the structural diversity within the different web page types in the web genres themselves. An overview of the resulting web genres can be found in table 5.1. In the following sections the way the web resources have been acquired for each respective web genre is described in detail.
4 http://wordpress.org/, retrieved 2011-01-10 5 https://www.blogger.com/, retrieved 2011-01-10
6 However, the Meyer zu Eissen–Corpus is used for an evaluation in section 5.5.5 for showing the limitations of LIGD concerning relatively unstructured web genres.
5.4.1 Blog Pages
There is a considerable structural difference between the start page of a blog and a page containing the specific blog posts, with most of the Blog Start Pages (BSPs) displaying the most recent blog posts (often in an abbreviated or truncated form as a teaser) without comments and additional menus and sidebars, and the typical Blog Post Page (BPP) providing only one single, full blog post with possibly additional comments.
Initially, example BSPs were gathered by extracting the appropriate categories from the ODP7. This multilingual web site directory — founded in 1998, then bought by Netscape — follows the Open Content paradigm8 and is maintained by volunteers that monitor the quality of submitted links, thus avoiding spam sites and broken web links. ODP provides a RDF dump9 that contains all directory data freely available for download. The Resource Description Framework (RDF) [157] dump was parsed and all web links in categories like weblog were selected, resulting in 15,000 instances of BSPs composed in different languages and provided by different authoring applications (e.g. Wordpress, Blogger and others). The respective web resources were downloaded and exactly one link was automatically extracted (in order not to skew the representativeness of the BPP corpus) from each page by inspecting the RSS–feed. Finally, these linked HTML documents were downloaded, resulting in 11,800 BPP instances (77% of BSP instances). Some blogs did not have a RSS–feed, thus the BPPs could not be extracted automatically from these pages. Nevertheless, great care was taken to include the major blog engines that were in the BSP corpus.
5.4.2 Forum Pages
Depending on the specific forum application used, there are two distinctively structured page types in a forum. Forum Start Pages (FSPs) are the entry point, giving an overview over all different forums, whereas Forum Thread Pages (FTPs) contain a first post and the following thread of discussion written by different members of this forum.
Scraping the ODP RDF dump for forum pages did not prove to be too valuable due to a lack of reliable categorization, therefore FSPs were gathered by scraping the Big Boards website10 (an edited website directory tracking the most active message boards in several languages on the web), ensuring that dif- ferent forum authoring applications and forums in different languages were contained. Thus, 1,800 FSP instances were obtained. Taking these as starting points with an approach similar to the one taken to extract the BPP corpus, 1,400 FTPs were gathered.
5.4.3 Wiki Pages
Wiki Pages (WPs) most often do not differentiate between a start page and arbitrary wiki pages, thus this genre was not split up. As building a corpus was not possible using ODP data due to lack of an appropriate category system, WPs were obtained by scraping the Wikiindex website11 (3,500 wiki links) and Wikiservice.at website12 (650 wiki links). These two sites provide directories of known wiki com- 7 http://www.dmoz.org, retrieved 2008-05-23 8 http://www.opencontent.org/opl.shtml, retrieved 2011-01-11 9 http://rdf.dmoz.org, retrieved 2008-04-12 10 http://rankings.big-boards.com/, retrieved 2008-04-20 11 http://www.wikiindex.org/index.php?title=Category:All, retrieved 2008-04-23 12 http://www.wikiservice.at/gruender/wiki.cgi?WikiVerzeichnis, retrieved 2008-04-28
munities. After having extracted all linked wiki pages, 3,100 valid WP instances in different languages and provided by different applications were obtained (after having sorted out inaccessible and obviously erroneous pages).
5.4.4 Miscellaneous Pages
In real–world settings, it is not only important to know which of the targeted web genres a web resource follows but also if a web resource belongs to any of the genres. Therefore, another class of web resources, Miscellaneous Pages (MPs) that are not contained in any of the three focused web genres, was added to the corpus. Kennedy and Shepherd [100] call this genre noise, as it is different to all other genres they take into account. For this collection, 347 web resources were manually sampled belonging to multiple genres based on a list of genres (see appendix section C.2). The obtained web resources are very heterogeneous in length, structure and figure of speech. Additionally, the selected web resources are written in different languages.
These MP instances are not included in the web genre corpus but are added for separate evaluations presented in sections 5.5.4 and 5.5.6.