There is no single strategy to spot the entire set of all cleanup tags. Cleanup tags are realized based on templates, which are special Wikipedia pages that can be included into other pages. Although templates can be separated from other pages by their namespace (the prefix “Template:” in the page title), there is no dedicated qualifier to separate templates that are used to implement cleanup tags from other templates. A complete manual inspection is unfeasible as Wikipedia contains more than 450 000 different templates.6 We hence employ a two-step approach to compile the set of cleanup tags automatically:
1. an initial set of cleanup tags is extracted from two meta sources within Wikipedia, and
2. the initial set is further refined by applying several filtering substeps.
6Wikistats: Wikimedia Statistics, “Database records per namespace,” last modified Septem- ber 21, 2012, http://stats.wikimedia.org/EN/TablesWikipediaEN.htm#namespaces.
Step 1: Extraction
We exploit two sources within Wikipedia containing meta information about cleanup tags. The first source that we employ is the Wikipedia administration category ”Category:Cleanup templates”, which comprises templates that are used for tagging pages as requiring cleanup. The category also has several subcategories to further organize the cleanup tags by their usage, e.g., inline cleanup templates or cleanup templates for WikiProjects. The page titles of those templates linking to the category or some subcategory are obtained from the local Wikipedia database, using the tables categorylinks and page (cf. Table 2.2), which results in 437 different cleanup tags.
The second source is the meta page ”Wikipedia:Template messages/Cleanup”, which comprises a manually maintained listing of templates that may be used to tag pages as needing cleanup. From a technical point of view, the page is a composition of several pages (transclusion principle). For each of these pages, the content of the revision from the snapshot time is retrieved using the MediaWiki API7. A total of 286 different cleanup tags are extracted from the wiki markup of the retrieved pages using regular expressions. Merging the findings from both sources gives 530 different cleanup tags.
Step 2: Refinement
We apply the following filtering substeps to the initial set of cleanup tags: Redirect resolving. A cleanup tag may have several alternative titles linking to it through redirects. For example, the tag Unreferenced has the redirects Unref, Noreferences, and No refs among others. We resolve all redirects using the tables redirect and page of the local Wikipedia database (cf. Table 2.2). Subtemplate removal. We discard particular subtemplates, namely experimental pages and documentation pages. Experimental pages are identified by the suffixes “/sandbox” and “/testcases” in the page title and are used for testing purposes only. Documentation pages are identified by the suffix “/doc” and provide a template description.
Meta-template removal. We discard templates that are solely used as building blocks, for instance, to instantiate other templates with a particular parameteri- zation. The two Wikipedia categories ”Category:Wikipedia metatemplates” and ”Category:Wikipedia substituted templates” are used to identify these templates. Moreover, we discard templates that implement technical features (categories ”Category:Search templates” and ”Category:Maintenance navigation”) as well 7The MediaWiki Web service API provides direct access to the Wikipedia databases. For further information, refer to Section 3.1.1 or to http://www.mediawiki.org/wiki/API.
as templates that are used for documentation and testing purposes (category ”Category:Template namespace templates”).
Altogether we collect a set of 458 cleanup tags.
Discussion
To evaluate our mining approach, we manually inspected the documentation pages of the 458 templates that have been identified as cleanup tags. A docu- mentation page gives information about purpose, usage, and scope of a template. Consider for example the tag Unreferenced (shown in Figure 2.1). The respective documentation page states that a tagged article “does not cite any references or sources” and that citations to reliable sources should be added in order to improve the article. This tag indeed relates to a particular cleanup task, and, since the verifiability of information is one of Wikipedia’s core content policies (cf. Section 1.4.2), it defines a quality flaw. Our evaluation reveals that of the 458 templates 445 are actually cleanup tags, and hence, define a particular quality flaw. The analyses in this thesis are based on the 445 quality flaws, which are listed in Appendix A.
The remaining twelve templates are listed in Table 2.3. None of them can be considered as a proper cleanup tag. The first ten templates in Table 2.3 are specific meta-templates that implement technical features. The template Geodata-check does not produce any output, and hence, it cannot be considered as a tag. The last template in Table 2.3 is a kind of placeholder that need to be replaced by the respective English cleanup tags. The twelve templates are identified by our mining approach because they are assigned to the category ”Category:Cleanup templates” (or to some subcategory respectively; see Step 1). However, the category’s documentation page states that: “This is a category of templates used for marking articles as requiring cleanup.”8 The twelve templates are no cleanup tags, and hence, their assignment to this category is incorrect. We initiated discussions on the respective talk pages of the twelve templates in order to correct the wrong assignments in Wikipedia.9 Note that after the miscategorizations are corrected our approach is able to identify the set of cleanup tags without any manual intervention.
Our mining approach does not guarantee completeness though, since the true set of cleanup tags is unknown in general. However, from a quantitative point of view we are confident that we identify the most common cleanup tags, and hence, the most important quality flaws.
8
Wikipedia, “Category:Cleanup templates,” last modified November 7, 2012, http://en.wikipedia.org/wiki/Category:Cleanup_templates.
9
Table 2.3: The twelve templates that are identified by our mining approach but which are actually no cleanup tags. This issue is due to the incorrect assignment of the templates to the Wikipedia category ”Category:Cleanup templates”.
Template name Template description
Reflist-talk Shows a reference section for a talk page discussion within a bordered box.
Cleanup template docu- mentation see also sec- tion generic list
Produces a section containing a list of certain cleanup tags.
Edit Creates an “edit” link.
Editlink Creates an “edit this section” link. Editlink-right Creates a right-aligned “edit” link.
Tagged Creates a tag for the user page of an user who tagged a page but left no justification on the respective talk page.
Introduction cleanup maintenance templates
Produces a navbox containing a listing of certain cleanup tags. Details removed Produces an inline statement stating that personal information
has been removed to protect the user’s privacy.
Postchronicle Produces an inline statement stating that a link to a Post Chronicle has been removed.
Moveoptions Is used to substitute other content.
Geodata-check Tagging of geodata which needs further checking or correction. No output other than adding the tagged page to the category ”Category:Pages requiring geodata verification”.
... Originates from the French Wikipedia and is replaced by Empty section or Expand section in the English language edition. (The template name is composed of three dots.)