Web Mining - Website boundary detection via machine learning

Web mining is the task of discovering knowledge from web hyperlink structure, web content and web usage data [128]. Web mining can be thought of as the data mining step of the the KDD process model, when the knowledge discovery problem is concentrating on data from the WWW [50, 49, 135, 128]. The specific techniques used to discover non trivial patterns in web data are known as web mining techniques, which can be commonly based on traditional data mining techniques, but not exclusively so due to the specific challenges the WWW brings [128]. Web mining tasks can be categorised into three main topics:

Web structure mining techniques exploit the hyperlink (or link) structure between web pages to discover useful knowledge about the web.

Web content mining draws upon the information in the content of web pages. Web usage mining typically discovers patterns from the user access data associated

for a particular web service.

Each of these tasks has an associated set of techniques. It has been suggested that the three category view of web mining techniques could be represented by a merged list representing just two main categories; these are (1) web usage mining techniques, and (2) web content/structure mining techniques [58]. The two category representation is suggested because for many applications the techniques used draw upon more than one category to provide a solution.

The approaches presented in the research described in this thesis also draw on more than one category of web mining technique. The approaches to solve the WBD problem presented in this thesis first draws upon isolated techniques in the web content (chapter 5), and web structure (chapter 6) mining categories. An approach is then

presented which utilises techniques drawn from both the web structure and content mining categories (chapter 7).

Two sub problems of web mining [105] that are related to the strucutres of the web can be broadly described as the problems of:

1. Generating new structures 2. Mining existing structures

With respect to the generation of new structures of the web, techniques use existing structural properties and content relationships. Examples include ranking algorithms like Page-rank [43], HITS [106], site rank and variations [186, 185]. These algorithms aim to supplement the existing structure with models that provide new knowledge that was previously unknown. Mining of existing structures is concerned with the problem of extracting models that are known to exist in the web, but are hidden by navigation, or lack of explicit description. The work presented in this thesis falls into this later category, a website boundary is known to exist, but the boundaries are not explicitly described in the web.

2.4.1 Web Mining Challenges

The WWW as a data source has many unique characteristics in comparison to traditional data that is mined for knowledge [128]. These differences contribute to making web mining both challenging and a rewarding process in terms of new knowledge discovered. This section discusses the issues and challenges associated with web mining applications in general. For a discussion of web mining with respect to the WBD problem see section 3.3.1.

Volume The volume of data and information on the web is huge and it is growing incredibly quickly [128, 95, 49, 135, 30]. This exponential growth poses many scaling issues that are becoming increasingly difficult to deal with [30]. This makes the task of discovering new knowledge using data from the web by applying traditional mining techniques a very difficult task. Work is ongoing to develop more scalable methods of data storage and analysis to address the increasing volume of data on the web.

Diversity The availability of low cost storage and networking has lead the WWW into a liberal information culture; content generation and dissemination is extremely diverse [49]. Personal corners of the web exist that are provided by social or corporate hosts that can hold a diverse range of information about a user. Weblogs can hold thoughts on topics of interest and social networking can hold information on a users day to day activities. Data of almost all conceivable types is available on the internet, the range is ever increasing. Images, videos and audio files in

many formats are just a few common types of data that exist on the web today [128]. The diverse nature of the web can cause problems with the integration of information; multiple formats can cause information to be duplicated which can lead to inconsistencies [128].

The complexities that are associated with the web and its pages of diverse information can add additional data mining complexities that are far greater than when mining traditional data. Web pages do not have a clear unifying structure associated with their content, which can make diverse information seem unre- lated. Web pages can contain many different authoring styles and the content can vary dramatically, covering any subject area [95]. This vast range of formats and structures can cause issues that need to be carefully considered with respect to mining activities.

Semi-Structured data As originally proposed the web was intended to improve management of general information at CERN [38]. The overall strucutre provided flexibility and convenience for the management of very large amounts of data [135]. The web today can be considered as a huge digital library [95]. However, the tremendous number of documents that are held are not arranged or sorted into a particular order as in the intuitive notion of a library [95]. This is also true for the majority of individual web pages on the WWW, as they can quite often be un-structured [30]. This makes the web an abundant collection of data that comprises both structured and unstructured, well formed and non-well formed, content [128]. This makes the web a difficult medium to gather information from [50]. Specific methods have to be employed to help extract information from the semi-structured data on the web, which contrasts dramatically to the processes required to extract from traditional data sources.

Authority The WWW spans many nations and features users from a wide range of cultures and backgrounds. There is currently no existing methods for verification of information or editorial guidelines, no approval from authority of the content on the web [49]. Data can be false, or poorly written. Data can be inconsistent and become quickly out of date. These are all facets of the publishing medium that the web has become [30]. Traditional publication mediums, like print or broadcast, do not have the same downfalls as the internet [49]. The problem of finding authority within the WWW can be difficult to solve completely, as some of the problems are intrinsic to human nature [30]. A method that can be used to gain some sense of authority is to look at the structure of the web. Pages that are pointed to by many other pages can have an increased sense of trust. These web pages can be deemed to be of high quality or have a heightened sense of authoritativeness because of the sense of trust invested by the hyperlinks from

other pages [128].

Noise The WWW is full of irrelevant or noisy data that is not useful to many applications of web mining [128]. The noise in web data can be a feature of either: (i) Content or (ii) Quality. The content often includes information that is not needed or is irrelevant for a task. When mining a web page for a specific element of its content there is usually additional information on the page that is not needed. Examples include navigation links, advertisements or copy right notices. Only the content that is targeted is deemed as useful, everything else is considered as noise [128]. The quality can often be misleading, which is a cause of the issue of authority as mentioned above. There is no quality control regarding the information on a web page, therefore any information can be published. This can make a lot of information misleading or incorrect [128]. Spam content can also contribute immensely to the control of quality of web content. Finding the exact information that is needed from the web is essential to getting the most accurate results possible when applying mining techniques. “The clearer the data set the better the information and knowledge that can be extracted from it” [50]. Dynamic and Distributed The WWW is highly dynamic and distributed in its na-

ture [128, 95]. The content of the web is dynamic as it is subject to constant change [128]. The changing of content can be very frequent which is true for such pages as current affairs, news and stock market information to name just a few examples [95]. Keeping up with changes is very important for many applications of the web [128], out of date or redundant data is not useful for such time critical data mining applications.

The locations of data can be spread across many locations of the WWW. However, the request and provisions of this data is often seamless to the user’s experience. Redirections and load balancing of server resources across many geographical sites is very common, and can create a large distributed network of information for a single corporation.

This dynmaic and distributed nature is further increased due to the rapid growth of the web [95]. A large number of pages are added to the web each day [135]. Also a great number of computers can be connected to the network at any time, these can be of varying platforms, and in various geographical locations. New computers can be very easily added or removed, and can be intermittently connected or reconnected at varying times [30]. This adds to the volatility of the dynamic and distributed web. In addition broken links and relocation problems can occur due to moving computers, or when domains and files change names or disappear [30]. Multiple pages across the web may present the same information and could be inconsistent due to translations or mistakes, integration of such

distributed data can pose a problem for mining tasks [128].

Virtual Society The WWW does not only contain information and services, it also hosts virtual societies and communities that are conceptualised on the web [128, 95]. Theses societies feature a multitude of interactions between people, organisa- tions and automated systems [128]. They can be used to communicate instantly and express a user’s personal view on many different subjects [128]. These interactions are important aspects of the characteristics of the web. for example they can encapsulate likes or dislikes for certain products or services. This information can be very valuable with respect to web mining applications.

Persistance Information that is available on the web has some similarities to published works, but they do not share the same immutable properties. The information does not disappear when it is first broadcast, nor does it remain in an imperishable state. The nature of the WWW provides a third model that coexists between the recorded and the unrecorded [107]. This model reflects the nature of the information on the web and the persistence it exhibits. Data can be created and maintained for ever, or can be destroyed in the same instance, and every time frame in-between this. This property means that new knowledge discovered may become out of date or irrelevant very quickly.

Deep web The WWW contains data that is not accessible using standard means of web navigation. The traditional method of navigation is conducted by moving from one web page to the next following available hyperlinks, this is known as the surface web. There is however information that cannot be accessed via hyperlink or can only be accessed using form submission; information that is dynamically generated as a result of web interactions. This is known as the “deep web”. It has been estimated that the deep web is 500 times larger than the surface web [97].

In document Website boundary detection via machine learning (Page 41-45)