• No results found

6.2 Web mining

6.2.5 Web mining components

There are certain tasks that are standard to all the different types of web mining regardless of the goals or approaches followed in mining the web. Etzioni describes three subtasks that form the core of web mining [58]. These tasks are resource discovery, information selection/extraction and generalization. Some authors also identify a fourth subtask called Analysis which also forms an integral part of the web mining process [57]. The web mining process is summarized in figure 6.3 below.

Resource discovery Information selection and preproccesing

Generalization Analysis

WWW

Knowledge

Figure 6.3: Web mining subtasks adapted from [57, 58].

Resource discovery

The resource discovery subtask in web mining is the process of automatically retrieving relevant documents or services on the web whilst minimizing the retrieval of unrelated/irrelevant material.

The information can be available either online or offline from a multitude of sources including newsgroup postings, HTML documents or other manually selected sources on the web.

This subtask of web mining relates directly to that of information retrieval on the web. As was discussed in the previous chapter, web IR has the primary goals of indexing text and searching for relevant documents in a collection (called the corpus). In their paper, Kosala and Blockeel note that the task of web document classification or categorization can be used for the indexing task in web IR. This can be seen as an instance of web mining [57]. This indexing task in web IR is currently serviced by a host of search engines available on the web itself. Viewed in this respect, web IR can be seen as part of the web mining process.

With the above stated in mind, Etzioni predicts that future resource discovery systems will em-ploy automatic text categorization technology for the classification of web documents into cat-egories [58]. This could facilitate the automatic construction of web directories like Yahoo (http://www.yahoo.com) or aid in the filtering of query results received from other search en-gines.

Information selection/extraction

The information selection/extraction (IE) subtask in web mining can be seen as the automatic selection and pprocessing of specific information from web documents discovered in the re-source discovery subtask. Information extraction has the goal of transforming a collection of documents, usually with the aid of an IR system, into information that is more readily digested and analyzed [59]. In a data mining context, the information is “prepared” for the next step.

In essence there exist two fundamental types of IE systems, those that attempt to extract infor-mation from unstructured text and those that attempt to extract inforinfor-mation from semi-structured data [57]. Systems that operate in an unstructured environment typically use linguistic pre-processing techniques before commencing with data mining activities. These techniques can include syntactic analysis, semantic analysis and discourse analysis, which are more language driven technologies [57].

Systems that operate in a semi-structured (or vaguely-structured2) environment utilizes meta-information (e.g. HTML markup) that is available inside the semi-structured data. Approaches that do not utilize linguistic constraints are commonly termed wrapper induction approaches [60]. As noted by Etzioni, the bulk of today’s information extraction systems identify a fixed set of web resources and then rely on hand-coded customized wrappers to access the resource and processes the responses received from it [58].

Unfortunately, due to the highly dynamic and diverse nature of the medium, building customized IE solutions like this is neither a feasible nor flexible solution for the web environment [57]. In an attempt to scale with the growth of the web and provide a more flexible service, web IE systems

2See chapter 4 page 65.

are increasingly utilizing machine-learning techniques and data mining techniques. Most of these systems try to automatically discover extraction rules from the annotated corpora.

Generalization

In the generalization phase, the objective is to automatically discover general patterns at indi-vidual web sites as well as across multiple sites. Machine learning and data mining techniques are most frequently employed to fulfil this task. Etzioni remarks in his paper that a major prob-lem when learning about the web is the labelling probprob-lem [58]. Data is readily available on the web, but the data is unlabelled. In other words, it means that there is no real way to distinguish between different types of data in an easy way. Many data mining algorithms require labelled binary input that act as training data for the algorithm to indicate that it is an example (either

“positive” or “negative”) of some concept.

Techniques have been derived to attempt to deal with the labelling problem (like uncertainty sam-pling [61]) but they do not eliminate it completely. Clustering methods do not require labelled input and have been applied successfully to large collections of documents. The goal of cluster-ing, in the context of web data mincluster-ing, is to find groups of similar hypertext documents. There are a number of algorithms that can be applied to the clustering of textual data (e.g. agglom-erativeclustering or k-means clustering [62]). The web offers a promising arena for document clustering research.

Analysis

The analysis phase is concerned with the validation and/or interpretation of patterns mined in the previous phase. In this phase, more so than in the others, it is appropriate to note that humans play an important role in the information and knowledge discovery on the web since the medium is an interactive one. This is of particular importance for the validation and/or interpretation of the patterns identified in the previous steps. This step also involves the application of appropriate tools for the understanding, visualization and interpretation of the mined patterns [57, 58].

Following the above discussion, attention must now be given to various web mining agents uti-lizing the components of the web mining process. These agents are discussed in the next section.