2.2 Website Boundary Detection (WBD)
2.2.2 Web page features
This sub section presents some related work on the features that have been used to model web pages. This is an important aspect with respect to this thesis as the features used to model web pages will directly effect the measure of similarity between two pages. The notion of similarity used will inevitably have an impact on the pages identified to be within a website boundary.
A common approach to represent web pages is to use methods that are used in text mining. Traditional text data mining methods are applied where a web page is modelled as a text document. This document model is then used to determine the similarity between a set of web pages [50, 49, 135, 128]. Despite the application of text mining approaches, web pages can be distinguished from traditional text collections [156] for the following reasons:
• The web contains pages that are unstructured or semi structured with inconsistent formatting.
• Web pages usually contain markup code that is rendered visually for users.
• Pages on the web exist within a web graph of rich hyperlinks pointing to and from various pages.
The facts outlined above present the reasons which distinguish web pages from tra- ditional text documents. In summary there are many more elements that make up a web page than are contained in a text document. Therefore a further investigation into additional features to represent web pages is justified [156]. The additional attributes
that distinguish web pages are usually contained “on-page” and encoded in the web pages content in some way. A review of features that have been used in the literature for web page modelling is presented below.
Text The text of a web page is the most straight forward feature to use when mod- elling web pages. Hence the common application of traditional text mining techniques. Using textual content allows for the application of text mining methods, for example the bag of words model. The noise that is associated with data from the web can often hinder these techniques [156]. The technique of using n-grams has also been applied to web pages [141]; a technique which can be used to capture concepts and/or phrases. A draw back is that the n-grams method usually creates a higher dimensional feature space than in the case of the bag of words method. High dimensional data can intro- duce additional problems, particularly when using traditional similarity measures with respect to the application of clustering algorithms [110]. Therefore further techniques need to be applied to reduce the dimensions of the feature space, for example feature selection techniques [156].
URL Arguably the most prominent feature of a web page is its Uniform Resource Locator (URL). The feature is typically used for web navigation purposes [34]. This feature is not strictly encoded in the web page’s content, but has to be known in order to request web page content. The main elements making up a standard URL consist of: scheme, domain, sub domains, directory path and query. The URLhttp:
//www.my.examples.com/example?query_string is made up of: ahttp scheme, the
domain and sub domain is examples.comand my, the path isexample and the query
isquery_string.
There are techniques that have been successfully used to classify web pages using features constructed from the various elements making up a URL. Creating features based on the domain, sub domain and directory structure have been used [28]. There have also been methods that extract features from the URL by segmenting the various elements of the URL using delimiters.
Meta data In [90] the meta data that is sometimes embedded in web pages was used as a feature. Web pages were represented using: title, headings and meta data along side the main body text. Each of these features can be extracted from the content of a web page if available by focusing on what is contained within the HTML markup of a web page. The meta data features can include keywords or even a description of the content of a web page. It is concluded that a combination of the features should be used in order to gain the best results in terms of web page classification [90].
Styling tags In [116, 115] web pages are represented according to the presence of certain tags in the content. Some common features (title, headings, and meta data) were used in this work. However, additional features based on tags in the HTML content were also used. Web pages are segmented into features representing certain tags in the textual content. For example textual styling tags like bold, underline and strong may be extracted from a web page [116]. A weighting scheme is used to distinguish the significance of the tags used. If a tag is used very often, the assumption is that it is not used for emphasis, thus it is deemed unimportant, and the weighting is reduced [115]. Content and Structure The work described in [69, 68, 67] uses a hybrid approach to representing features from a web page which draws upon both the content and the internal structure of web pages. The content is represented by features constructed using keywords from the text. The structural properties are used to create features us- ing page anchors. Page anchors are used as links “jumping” the user to various points in the same web page. This type of feature representation produces a small dimen- sional feature space, and does not involve the complexity associated with modelling the hyperlink structure [67].
User behaviour In [169] numerous sets of common features are used to model web pages, in particular they are used in combination with users navigation behaviour. The URL-based features are used as in previous work, but the features are constructed based on the depth of URL paths. The “type” of a web page is also used. A feature is constructed using the file extension, which can indicate the type of resource at a URL, for example .pdf, .jpg and so on. Features are constructed using the textual content in a page anchor (named entities, nouns and verbs). The web page is used to construct a DOM, which is then used to represent block based features. Links based features are used (incoming links). User behaviour features collected from browser toolbars are also used as features, examples include; the number of page visits over a certain time period, number of links clicked on each page and pages visited from bookmarks or browser history. Notice that the features used, and collected, are quite extensive. The research was conducted in part by Yahoo!, which is a major search engine based in the US. This work performs clustering on the features using a term-term co-occurrence model, which is created using a bag of words.
Other work There is other work that uses features based on the HTML structure of web pages [183]. These techniques concentrate on creating trees from the internal nested structure of web pages, and are less focussed on expressing similarity relation- ships between content in the pages. Theses techniques have been used successfully in identifying spam on the web [181, 182]. There is also an area of research that represents web pages based on visual properties [109], [29]. The techniques used are said to be
expressed from user oriented point of view, as they derive features based on what is visually rendered. This point is in contrast to the majority of the features presented above, where the object is to encode content [156].