Web Usage Mining Structuring semantically enriched clickstream data

(1)

Web Usage Mining

Structuring semantically enriched clickstream data

by

Peter I. Hofgesang

Stud.nr. 1421247

A thesis submitted to the

Department of Computer Science

in partial fulfilment of the requirements for the degree of Master of Computer Science at the

Vrije Universiteit Amsterdam, The Netherlands

(2)

(3)

supervisor

Dr. Wojtek Kowalczyk

Faculty of Sciences, Vrije Universiteit Amsterdam Department of Computer Science

second reader Dr. Elena Marchiori

Faculty of Sciences, Vrije Universiteit Amsterdam Department of Computer Science

(4)

Abstract

Web servers worldwide generate a vast amount of information on web users’ browsing activities. Several researchers have studied these so-called clickstream or web access log data to better understand and characterize web users.

Clickstream data can be enriched with information about the content of visited pages and the origin (e.g., geographic, organizational) of the requests. The goal of this project is to analyse user behaviour by mining enriched web access log data. We discuss techniques and processes required for preparing, structuring and enriching web access logs. Furthermore we present several web usage mining methods for extracting useful features. Finally we employ all these techniques to cluster the users of the domain www.cs.vu.nl and to study their behaviours comprehensively.

The contributions of this thesis are a data enrichment that is content and origin based and a tree-like visualization of frequent navigational sequences. This visualization allows for an easily interpretable tree-like view of patterns with highlighted relevant information.

The results of this project can be applied on diverse purposes, including marketing, web content advising, (re-)structuring of web sites and several other E-business processes, like recommendation and advertiser systems.

(5)

Content

1 Introduction ... 7

2 Related research ... 9

3 Data preparation ... 11

3.1 Data description ... 11

3.2 Cleaning access log data ... 13

3.3 Data integration ... 17

3.4 Storing the log entries ... 17

3.5 An overall picture ... 18 4 Data structuring ... 20 4.1 User identification ... 20 4.2 User groups... 21 4.3 Session identification... 22 4.4 An overall picture ... 23

5 Profile mining models ... 25

5.1 Mining frequent itemsets ... 25

5.2 The mixture model... 27

5.3 The global tree model ... 29

6 Analysing log files of the www.cs.vu.nl web server ... 35

6.1 Input data ... 35

6.2 Distribution of content-types within the VU-pages and access log entries ... 39

6.3 Experiments on data structuring ... 40

6.4 Mining frequent itemsets ... 44

6.5 The mixture model... 52

6.6 The global tree model ... 59

7 Conclusion and future work ... 64

Acknowledgements ... 66

Bibliography... 67

APPENDIX ... 69

APPENDIX A. The uniform resource locator (URL)... 69

APPENDIX B. Input file structures ... 69

APPENDIX C. Experimental details ... 71

APPENDIX D. Implementation details ... 81

(6)

Structure

This Master Thesis is organized as follows: Chapter 1, “Introduction”

This chapter provides a high-level overview of the related research and main goals of this project.

Chapter 2, “Related research”

Chapter 2 gives a comprehensive overview of the related research known so far. Chapter 3, “Data preparation”

This chapter follows through all steps of the data preparation process. It starts describing the main characteristics of the input data followed by a description of the data cleaning process. The section on data integration will explain how the different data sources are merged together for data enrichment while the next section concerns data loading. Finally an overall scheme and an experiments section are laid out.

Chapter 4, “Data structuring”

In chapter 4 we explain how the semantically enriched data is combined to form user sessions. It also discusses the process of user identification and gives a description of groups of users, both of which are preliminary requirements of the identification of sessions. The chapter ends with an overall scheme of data structuring followed by a section of experiments.

Chapter 5, “Profile mining models”

This chapter provides an overview of the theoretical background of applied data mining models. First it explains the widely used mining algorithm of frequent itemsets. The following section describes the recently researched mixture model architecture. And finally a tree model is proposed for exploiting the hierarchical structure of session data.

Chapter 6, “Analysing log files of the www.cs.vu.nl web server”

Chapter 6 discusses experimental results of mining models applied on the semantically enriched data. All the input data are related to a specific web domain: www.cs.vu.nl.

Chapter 7, “Conclusion and future work”

Finally in chapter 7 we present the conclusions of our research and explore avenues of future work.

(7)

1 Introduction

The extensive growth of the information reachable via the Internet induces its difficulty in manageability. It raises a problem to numerous companies to publish their product range or information online in an efficient, easily manageable way. The exploration of web users’ customs and behaviours plays a key role in dissecting and understanding the problem.

Web mining is an application of data mining techniques to web data sets. Three major web mining methods are web content mining, web structure mining and web usage mining. Content mining applies methods to web documents. Structure mining reveals hidden relations in web site and web document structures. In this thesis we employ web usage mining which presents methods to discover useful usage patterns from web data.

Web servers are responsible for providing the available web content on user requests. They collect all the information on request activities into so-called log files. Log data are a rich source for web usage mining.

Many scientific researches aim at the field of web usage mining and especially at user behaviour exploration. Besides, there is a great demand in the business sector for personalized, custom-designed systems that conform highly to the requirements of users.

There is a substantial amount of prior scientific works as well on modelling web user characteristics. Some of them present a complete framework of the whole web usage mining task (e.g., Mobasher et al. (1996) [18] proposed WEBMINER).

Many of them present page access frequency based models and modified association rules mining algorithms, such as [1, 31, 23]. Xing and Shen (2003) [30] proposed two algorithms (UAM and PNT) for predicting user navigational preferences both based on page visits frequency and page viewing time. UAM is a URL-URL matrix providing page-page transition probabilities concerning all users’ statistics. And PNT is a tree based algorithm for mining preferred navigation paths. Nanopoulos and Manolopoulos (2001) [21] present a graph based model for finding traversal patterns on web page access sequences. They introduce one level-wise and two non-level level-wise algorithms for large paths exploiting graph structure.

While most of the models work on global “session levels” an increasing number of researches show that the exploration of user groups or clusters is essential for better characterisation: Hay et al. (2003) [14] suggest Sequence Alignment Method (SAM) for measuring distance of sessions incorporated within structural information. The proposed distance is reflected by the number of operations required to transform sessions into one another. SAM distance based clusters form the basis of further examinations. Chevalier et al. (2003) [8] suggest rich navigation patterns consisting of frequent page set groups and web user groups based on demographical patterns. They show the correlation between the two types of data.

Other researches point far beyond frequency based models: Cadez et al. (2003) [4] propose a finite mixture of Markov models on sequences of URL categories traversed by users. This complex probability based structure models the data generation process itself.

In this thesis we discuss techniques and processes required for further analysis. Furthermore we present several web usage mining methods for extracting useful features. An overall process workflow can be seen in figure 1.

(8)

INPUT DATA

Web server’s access log data Content type mapping table

URL / content type

DATABASE DATA PREPARATION DATA FILTERING DATA INTEGRATION

SESSION IDENTIFICATION PROFILE MINING

1 2 3 4 5 Association rules s Identified sessions AR FORMAT MM FORMAT GTM FORMAT 2 3 3 3 Tree model USER SELECTION Geographical and organizational information Pr obab ili ty Content types TextTe x t Te xt Text Mixture model

Figure 1: The overall process workflow

This thesis considers three separate data sets as input data. Access log data are generated by the web server of the specified domain and contains user access entries. The content-type mapping table contains relations between documents and their category in the form of URL / content type pairs. Mapping tables can either be generated by classifier algorithms or by content providers. In the case of this latter type, contents of pages are given explicitly in the form of content categories (e.g., news, sport, weather, etc.). Geographical and organizational information make it possible to determine different categories of users.

All data mining tasks start with data preparation, which prepares the input data for further examination. It consists of four main steps as it can be seen in figure 1. Data filtering strips out irrelevant entries, data integration enriches log data with content labels and the enriched data are stored in a database. The user selection process sorts out appropriate user entries of a specified group for session identification.

The following step in the whole process is session identification. Related log entries are identified as unique user navigational sequences. Finally these sequences are written to output files in different formats depending on the application.

The profile mining step applies several web usage mining methods to discover relevant patterns. It uses an association rules mining algorithm [1] for mining frequent page sets and for generating interesting rules. It also applies the mixture model proposed by Cadez et al. (2001) [5] to build a predictive model of navigational behaviours of users. Finally it presents a tree model for representing and visualizing visiting patterns in a nice and natural way.

In the experimental part of this thesis we employ all these techniques to address the problem of defining clusters on the users of the www.cs.vu.nl web domain and we study their behaviours comprehensively.

The contributions of this thesis are content based data enrichment and visualization of frequent navigational sequences. Data enrichment amplifies users’ transactional data with the content

(9)

2 Related

research

There are numerous commercial software packages usable to obtain statistical patterns from web logs, such as [11, 22, 37]. They focus mostly on highlighting log data statistics and frequent navigation patterns but in most cases do not explore relationships among relevant features.

Some researches aim at proposing data structures to facilitate web log mining processes. Punin et al. (2001) [24] defined the XGMML and LOGML XML languages. XGMML is for graph description while the latter is for web log description. Other papers focus only (or mostly) on data preparation [6, 13, 15]. Furthermore there are complete frameworks presented for the whole web usage mining task (e.g., Mobasher et al. (1996) [18] proposed WEBMINER).

Many researches, such as [1, 23, 31], present page access frequency based models and modified apriori [1] (frequent itemset mining) algorithms. Some papers (e.g., [32] [10] [9]) present online recommender systems to assist the users’ browsing or purchasing activity. Yao et al. (2000) [32] use standard data mining and machine learning techniques (e.g., frequent itemset mining, C4.5 classifier, etc.) combined with agent technologies to provide an agent based recommendation system for web pages. While Cho et al. (2002) [10] suggest a product recommendation method based on data mining techniques and product taxonomy. This method employs decision tree induction for the selecting of users likely to buy the recommended products.

Hay et al. (2003) [14] apply sequence alignment method (SAM) for clustering user navigational paths. SAM is a distance-based measuring technique that considers the order of sequences. The SAM distance of two sequences reflects the number of transformations (i.e., delete, insert, reorder) required to equalize them. A distance matrix is required for clustering which holds SAM distance scores for all session pairs. The analysis of the resulting clusters showed that the SAM based method outperforms the conventional association distance based measuring.

In their paper Runkler and Bezdek (2003) [27] use relational alternating cluster estimation (RACE) algorithm for clustering web page sequences. RACE finds the centers for a specified number of clusters based on a page sequence distance matrix. The algorithm alternately computes the distance matrix and one of the cluster centers in each iteration. They propose Levenshtein (a.k.a edit) distance for measuring the distance between members (i.e. textual representation of visited page number sequences within sessions). Levenshtein distance counts the number of delete, insert or change steps necessary to transform one word into the other. Pei et al. (2000) [23] propose a data structure called web access pattern tree (WAP-tree) for efficient mining of access patterns from web logs. WAP-trees store all the frequent candidate sequences that have a support higher than a preset threshold. All the information stored by tree are labels and frequency counts for nodes. In order to mine useful patterns in WAP-trees they present WAP-mine algorithm which applies conditional search for finding frequent events. WAP-tree structure and WAP-mine algorithm together offer an alternative for apriori-like algorithms.

Smith and Ng (2003) [28] present a self-organizing map framework (LOGSOM) to mine web log data and present a visualization tool for user assistance.

Jenamani et al. (2003) [16] use a semi-Markov process model for understanding e-customer behaviour. The keys of the method are a transition probability matrix (P) and a mean holding time matrix (M). P is a stochastic matrix and its elements store the probabilities of transition

(10)

states. M stores the average lengths of time for processes to remain in state i before moving to state j. In this way this probabilistic model is able to model the time elapsed between transitions. Some papers present methods based on content assumptions. Baglioni et al. (2003) [2] uses URL syntactic to determine page categories and to explore the relation between users’ sex and navigational behaviour. Cadez et al. (2003) [4] experiment on categorized data from Msnbc.com.

Visualization of frequent navigational patterns makes human perception easier. Cadez et al. (2003) [4] present a WebCanvas tool for visualizing Markov chain clusters. This tool represents all user navigational paths for each cluster, colour coded by page categories. Youssefi et al. (2003) [33] present 3D visualization superimposed web log patterns and extracted web structure graphs.

(11)

3 Data

preparation

Preparing the input data is the first step of all data and web usage mining tasks. The data in this case are, as mentioned above, the access log files of the web server of the examined domain and the content types mapping table of the HTML pages within this domain.

Data preparation consists of three main steps such as data cleaning/filtering, data integration and data storing. Data cleaning is the task of removing all irrelevant entries from the access log data set. Data integration establishes the relation between log entries and content mappings. And the last step is to store the enriched data into a convenient database. A comprehensive study has been made by Cooley et al. (1999) [13] on all these preprocessing tasks.

This chapter starts with the description of the input data and generation procedure, followed by the details of log access file cleaning and data integration for log entries and mapping data integration. Finally it presents the database scheme for data storing and an overall picture and description of the data preparation process.

3.1 Data description

This section describes the details of the access log and content type mapping data.

3.1.1 Access log files

Visitors to a web site click on links and their browser in turn requests pages from the web server. Each request is recorded by the server in so-called access log files1. Access logs contain requests for a given period of time. The time interval used is normally an attribute of the web server. There is a log file present for each period and the old ones are archived or erased depending on the usage and importance.

Most of log files of web servers are stored in a common log file format (CLFF) [34] or in an extended log file format (ELFF) [35]. An extended log file contains a sequence of lines containing ASCII characters terminated by either the sequence LF or CRLF. Entries consist of a sequence of fields relating to a single HTTP transaction. Fields are separated by white space. If a field is unused in a particular entry dash, a "-" marks the omitted field.

Web servers can be configured to write different fields into the log file in different formats. The most common fields used by web servers are the followings: remotehost, rfc931, authuser, date, request, status, bytes, referrer, user_agent.

1

There are other types of log files generated by the web server as well, but this project does not consider them.

(12)

The meanings of all these fields are explained in the table below with given examples: The most commonly used fields of access log file entries by web servers

Field name Description of the field (with example)

remotehost

Remote hostname (or IP number if DNS hostname is not available) example:

82.168.4.229

rfc931

The remote login name of the user. example:

-

authuser

The username with which the user has authenticated himself. example:

-

[date]

Date and time of the request with the web server’s time zone. example:

[20/Jan/2004:23:17:37 +0100]

"request"

The request line exactly as it came from the client. It consists of three subfields: the request method, the resource to be transferred, and the used protocol.

example:

"GET / HTTP/1.1"

status

The HTTP status code returned to the client. example:

200

bytes

The content-length of the document transferred. example:

12079

"referer"

The url the client was on before requesting the url. example:

"-"

"user_agent"

The software the client claims to be using. example:

"Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"

Table 1

3.1.2 Content types mapping table

A content types mapping table is a table containing URL/content type pair entries. URLs are file locator paths referring to documents, and content types are labels giving the types of documents

(13)

We use an external algorithm [3], which attaches labels to all HTML documents in a collection of HTML pages based on their contents. The algorithm is based on the naive Bayes classifier supplemented by a smart example selector algorithm. It uses only the textual content of the HTML pages stripping out the control tags. Some parts of the text enclosed within special tags (e.g., title or header tags) are biased. The algorithm chooses the first 100 pages randomly to be categorized by humans. This initialization step is followed by an “active learning” method. This method chooses the examples by considering the ones already selected.

This thesis deals with other documents besides HTML as well (eg. pdf, ps, doc, rtf, etc.). However it would be a difficult process to attach labels to each of them based on their content. This is because the structure of these files is specific and most of the time very complex. And their size is usually very large. For these reasons a very simple technique is used to identify such documents. The label “documents” is attached to all pdf and ps files that refers to scientific papers, e-books, documentations, etc., while the label “other documents” is attached to all other document types (e.g., doc, rtf, ppt, etc.). Other documents determine e.g., administrative papers, forms, etc. According to these remarks, a mapping table is completed to contain entries for the two labels.

The following table presents an example of content types mapping table: An example of content-type mapping table

URL content type identifier

bi/courses-en.html 4 ci/DataMine/DIANA/index.html 6

… …

Table 2

3.2 Cleaning access log data

As described above, raw access log files contain a vast amount of variant request entries. Each log entry can be informative for some application but this project excludes most of them. Processing of certain types of requests would lead to misconclusions (e.g., requests generated by spider engines). Besides, stripping the data has a positive effect on processing time and the required storage space.

Since this project focuses only on documents themselves (like html, pdf, ps, doc files) all the request entries on different file types should be stripped out. Furthermore as the main goal is the characterization of users, robot transactions, which generate web traffic automatically by robot programs, must also be filtered out. There are several other criteria for filtering. Detailed descriptions of the filtering criteria and methods follow further on.

3.2.1 Filtering unsupported extensions

A typical web page is made up of many individual files. Beyond the HTML page it consists of graphical elements, code styles, mappings etc., all in separate files. Each user request for an

(14)

HTML file evokes hidden requests for all the files required for displaying that specific page. In this manner access log files contain all the hidden requests’ traces as well.

Extension filtering strips out all the request entries for file types other than predefined (for the structure of extension list file refer to APPENDIX B4 Extension filter list file). Requested files’ extensions in log entries could be extracted from the “request” field.

An example of such request field: "GET /ai/kr/imgs/ibrow.jpg HTTP/1.0"

3.2.2 Filtering spider transactions

A significant portion of log file entries is generated by robot programs. These robots, also known as spider or crawler engines, automatically search through a specific range of the web. They index web content for search engines, prepare content for offline browsing or for several other purposes.

The common point in all crawlers’ activity is that, although they are mostly supervised by humans, they generate systematic, algorithmic requests. So without eliminating spider entries from log files, real users’ characteristics would be distorted by features of machines.

Spiders can be identified by searching for specific spider patterns in the "user_agent” field of log entries. Most of the well-disposed spiders put their name or some kind of pattern that identifies them into this field. Once a pattern has been identified, the filter method ignores the examined log entry.

Spider patterns can be looked up browsing the web for spiders. There are several pages considering spider activities and patterns, and there are lots of professional forums on the subject (mostly discussing how to avoid them) [29].

Spider patterns are collected in a separate spider list file (refer to APPENDIX B5). An example of such user_agent field:

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"

3.2.3 Filtering dynamic pages

Web pages generated on user requests dynamically are called “dynamic pages”. These pages can not be located on the web server as an individual file, since they are built by a specific engine using several data sources. For this reason dynamic pages cannot be analyzed in a simple way. However with the application of several tricks it is possible to still obtain useful

(15)

There is no standard for the structure of URL requests for dynamic pages except that parameters appear after the “?” (question mark) in the URL which consist of name/value pairs. Therefore, dynamic pages can basically be filtered out by searching for the question mark in “request” fields of log entries. Note that requests for a single dynamic page without any parameters, thus without the delimiter question mark, would be stripped out during extension filtering (e.g., *.jsp, *.php, *.asp pages).

An example of such a dynamic page’s request field:

"GET /obp/overview.php?lang=en HTTP/1.0"

3.2.4 Filtering HTTP request methods

HTTP/1.0 [25, 26] allows several methods to be used to indicate the purpose of a request. The most often used methods are GET, HEAD and POST. Since using the GET method is the only way of requesting a document that could be useful for this project, the request method filter ignores any other requests. The filter examines the “request” field of the log entry for the “GET” method identifier.

An example of such a request field:

"POST /modules/coppermine/themes/default/theme.php HTTP/1.0"

3.2.5 Filtering and replacing escape characters

URL escape characters are special character sequences made up of a leading “%” character and two hexadecimal characters. They substitute special characters in URL requests that could be problematic while transferring requests to different types of servers. Special characters are simply replaced by sequences of standard characters.

In most cases the task is only to replace these escape sequences with their representatives, but in certain instances URLs contain corrupted sequences that cannot be interpreted. In these cases the entries should be ignored. Corrupt sequences can be caused by typing errors of the users, automatically generated robot requests, etc.

3.2.6 Filtering unsuccessful requests

If a user requests a page that does not exist, his browser replies with the well-known “404 error, page not found” error message. In this case the user has to use the “back” button to navigate back to the previous page or type a different URL manually. Either way the user doesn’t use the requested page to navigate through it, since the error page doesn’t provide any link to follow. For this reason log entries of erroneous requests should also be ignored. These entries can be filtered by examining the “status” field. The status of corrupt requests mostly equals to “404”. In special cases status field can take other values as well, such as “503” etc.

(16)

An example of such a log entry:

200.177.162.127 - - [16/May/2004:08:07:42 +0200] "POST

/modules/coppermine/include/init.inc.php HTTP/1.0" 404 302 "-" "Mozilla 4.0 (Linux)"

3.2.7 Filtering request URLs for a domain name

A URL of a page request consists of a domain name and the path of the requested document relative to the public directory of the domain. Since the domain name is not ambiguous to the responsible web server, it stores only the relative path of the request in the access log files, without the domain name. In a few cases however, log file entries tend to contain the whole absolute path. This leads to mapping errors during data integration, since the mapping table contains only relative paths and comparison is based on paths similarity. For these reasons a URL in the “request” field has to be transformed to the relative format.

An example of such request field:

"GET /www.cs.vu.nl/fb/generated/wrk_units/120.html HTTP/1.1"

3.2.8 Path completion

When a user requests a public directory instead of a specific file, the web server tries to find the default page in that directory. The default page is “index.html” in most cases, but it varies between web servers. Thus the task is to complete the URL with the name of the default page in case a log entry contains a directory request. It is possible that the server does not contain the default page in the requested directory. In this case the certain log entry will be filtered while looking it up in the content type mapping table (refer to section 3.1.2 Content types mapping table).

An example of such a request field:

original request field: "GET /pub/minix/ HTTP/1.1"

completed request field: "GET /pub/minix/index.html HTTP/1.1"

3.2.9 Filtering anchors

Anchors are special qualifiers for HTML link references. They act as reference points within a single web page. If a named anchor is placed somewhere in the HTML page’s body, a link referring to the HTML page completed with a special character hash mark and the name of the anchor (e.g., link + “#” + anchor name) following the link will scroll directly to the place where the anchor is put.

(17)

We don’t filter frame pages. Frames are supported by the HTML specification and make it possible to split an HTML document into several “sub documents” (e.g., a frame for the navigation menu, a frame for the content, etc.). Each frame refers to a specific HTML document, resulting in a separate page request. The main frame page contains mostly special tags for controlling all the subframes. This page is either labelled miscellaneous or labelled the same as its subframes by the text mining algorithm [3]. Either way there is no need to pay special attention to such pages while preparing the data.

3.3 Data integration

A novel approach in this project is to use content types of the visited pages rather than URL references. Content types, as described earlier, are given in a special mapping table where each entry consists of an URL/content type pair (refer to section 3.1.2 Content types mapping table). Data integration in this context means that there should be a content type label attached to every single stored log entry.

The most simple and convenient method is to attach content labels to transactions during data cleaning2. This would save time, since it uses the same cycle for both processes.

After cleaning and filtering a log entry, the data integration step looks up the entry’s request URL in the mapping table. If the URL is present, the corresponding type label is attached to the entry. Otherwise the extension of the URL is checked for a valid document type, other than HTML (refer to section 3.2.1 Filtering unsupported extensions), and looked up in the table again. If the extension was an HTML page, it should be deleted3.

3.4 Storing the log entries

The final step of the data preparation is to store the data in a convenient database. MySQL was chosen as a database server in spite of the fact that the current version does not support stored procedures.

In most cases it would be easier and faster to use internal methods for manipulating the data inside the database, but there were no inextricable difficulties that occurred during the project in this context. The advantages of MySQL are that it is fast, easy to maintain, free to use for research purposes and it’s widely accepted. The database scheme for storing cleaned log entries can be seen in table 3.

2

Depending on the application. For continuous streaming data, a better solution would be to attach labels online to entries, and probably it would use the content identification model also to identify unknown contents besides a preset mapping table.

3

(18)

Database scheme of the cslog table

column name type name

id bigint remotehost varchar rfc931 varchar authuser varchar transdate datetime request text content_type tinyint status smallint bytes int referer text user_agent text Table 3

The column names respond to the log field names mentioned in section 3.1.1 Access log files

except for the content_type field which refers to the attached content type described in the previous paragraph and id which is the unique identifier of the entries.

3.5 An overall picture

The following figure gives an overall picture of our data preparation scheme. Loading/filtering/mapping access log data

Object LogParser RAW LOG Object Transaction Object TransactionFilter Object Log2Database Object Transaction (filtered,mapped)

cslog.txt extension.flt datahandling.prop

Object MappingTable

(19)

The first step in the data preparation process is to load raw log files into the memory line by line by the LogParser object. This object transforms all entries into suitable Transaction objects, which contain all the fields of the log file. Once a Transaction has been parsed, it goes through the TransactionFilter, which filters out useless entries (by simply ignoring them). After this step a content-type label is attached to all transactions by the MappingTable object. Finally Log2Database loads the filtered transactions into the specified database.

(20)

4 Data

structuring

Sessions a.k.a. transactions4 constitute the basis of most web log mining processes. They are related to users and composed of pages visited during a separate browsing activity.

This chapter starts with the description of user identification, which is essential for session identification. This is followed by details on grouping of users, which is also a relevant topic as characterization of them is the main goal of this project. The next paragraph deals with session identification methods and types, while discussing moreover how the selection method is restricted to groups of users. The final section presents a comprehensive overview of the data structuring process.

4.1 User identification

Identification of users is essential for efficient data mining. It makes it possible to distinguish user specific data within the whole data set.

It is straightforward to identify users in Intranet applications since they are required to identify themselves by following the login process. It is much more complicated in the case of public domains. The reason is that Internet protocols (e.g., HTTP, TCP/IP) do not require user authorization from client applications (e.g., web browser). The only private information exchanged is the machine IP address of the client. Identification based on this information is unreliable. This is because multiple users may use the same machine (thus the same IP address) to connect to the Internet. And on the other hand, a single user may use several machines to use the same service. Besides, proxy servers and firewalls hide the true IP address of the client. There are many solutions to resolve this problem. Content providers can force users to register for their services. In this way users have to follow a login process each time they want to browse their contents. To avoid explicit user authentication, servers can use so called cookies. Cookies are user specific files stored on client machines. Each time a user visits the same service, the server can obtain user information from stored cookies.

The most accurate identification based solely on access log files is to use both IP address and browser agent type as a unique user identification pair [13]. However some papers use IP/cookie pairs [2].

The identification procedure proposed in this thesis takes place inside the database as a select query, which fills up the users table from the cslog table. Table 4 shows the data scheme of the

(21)

Data scheme of the users table

column name type name

id bigint remotehost varchar host_name varchar TLD varchar user_agent text Table 4

Remotehost and user_agent fields are equal to the above mentioned pair while host_name and

TLD will be discussed in the next section (4.2 User groups).

4.2 User groups

Arranging users into specific user groups is essential for further examinations. All the statistics and models described later are based on sessions belonging to user groups.

The advantage of user authenticated systems is the availability of personal information on registered users. This would help to form the most exact and diverse groups for them.

Possibilities are restricted to the information which can be mined from access log files in case of public domains. In public domains, groups can be formed based on user IP addresses (network ranges), geographical data, visiting frequency, etc.

Access log file entries contain either the IP address or the domain name in the remotehost field. For this reason in both cases the IP address or the domain name should be looked up and updated in the users table. After this process the remotehost field should refer to IP addresses while the host_name field refers to the corresponding domain name in users table.

Organizational groups A natural grouping of users is present in most internal networks in the term of subnetwork address ranges. Subnetwork address ranges determine sub network domains within the whole network. There can be separate network ranges for user groups like staff, management, students, administration, etc. Using these ranges and IP addresses of users, a variety of groups can be formed.

Geographical groups Most of the network (IP) addresses or network ranges have a domain name registered to them. The domain name consists of level and sublevel names divided by dots. The most right-hand side name of the whole string refers to the top level domain (TLD). TLD can be country codes like nl, hu, uk, etc. or other reserved names for public organizations such as com, org, gov etc. The rest of the domain name could be built of organization names followed by department names etc., all in hierarchical structure (e.g., www.cs.vu.nl).

Geographical distinction among users can be set up using TLD names. A group can be formed for example based on the “nl” TLD. Users can be selected for this group by searching for “nl” TLD in their corresponding domain name. No special geographical observations can be obtained from organizational TLDs, such as network infrastructure (net) and commercial (com) top level domains. This is because these domains can be registered worldwide and thus they have no clear relationship to countries.

(22)

4.3 Session identification

Sessions constitute the basis of most web log mining processes. They are related to users and composed of pages visited during a separate browsing activity. Visited pages belong to a specific domain and form a sequence in visiting order.

It is worth mentioning that not all the requests are present in log files. Most of the browsers use cache technology that allows the usage of previously visited pages instead of downloading them again. Besides, proxy servers also use page caching. They collect all frequently visited pages within a company and store them to reduce bandwidth load.

This result on some pages is visited in “offline” mode in a visiting sequence. That means that no entry refers to these accesses in log files. This problem can be solved by setting the expiration timestamp of pages to minimal, which forces clients to download expired pages. However this solution assumes that we can change the structure of documents. Several methods were proposed (e.g., [13]) to offer algorithmic solutions for this problem. We believe that the main characteristics can be observed without the necessity of such data preparation techniques. There are several session identification methods described in different scientific literatures [6, 13, 20]. The most widely accepted methods are the so called time frame (or time window) identification [13] and the maximal forward reference (MFR) identification [7].

Both methods work on pre-selected page accesses, so they work on data grouped by users and ordered by access time. The data consists of the user identification number (id field), the date and time of page access (transdate field) and the content type of the visited page (content_type field). In addition, MFR requires the request URL (request field).

The time frame identifier method divides page accesses for a user using a time window. This window or time interval is suggested to be approximately 30 minutes [13, 14, 30]. Most of the commercial products set a 30 minute timeout interval for splitting. The identifier iterates through the entries and whenever an entry’s access time (transdate) is out of the time interval it starts a new session and starts to measure time interval from that entry again.

The maximal forward reference identifier adds page access entries to a session list up to the page before a backward reference is made. Backward reference is defined to be a page that is already contained in the set of pages for the current transaction. In that case it starts a new session list and goes on with iteration. For example, an access sequence of A B C B D E E E F G would be broken into four transactions, i.e. A B C, B D E, E, and E F G. The drawback of this method is that it does not consider that some of the “backward” references may provide useful information. And besides it may include entries within the same session even if a week elapsed between them.

(23)

4.4 An overall picture

The figure below represents the functional model of session identification: GetSessions webmining.prop page access entries User sessions in the appropriate data format selected . entries Object TransactionMemoryIterator Object TransactionSimple Object Identifier Object TimeFrameIdentifier Object MFRIdentifier Object UserIPGroupSelector Object UserGroupSelector Object UserCountryGroupSelector DATABASE cslog, users tables identified . sessions Object SessionFormatPrinter

Figure 3: Functional model of the session identification process

At the beginning TransactionMemoryIterator object retrieves all the log entries from cslog table ordered by id and sub-ordered by transdate.

Note that although the number of log entries can be large, the memory requirement of the whole dataset is still manageable because all the information needed for an entry is its id, content_type

and transdate (and URL for MFR identification).

After fetching the data, TransactionMemoryIterator iterates through the user ids and for each id it forces UserGroupSelector to decide whether the given user belongs to a group or not. More specifically UserGroupSelector could be a subnet network ranges (UserIPGroupSelector) selector or a geographical group selector (UserCountryGroupSelector) depending on the settings in the webmining.prop properties file (for more information on group selections refer to section

4.2 User groups).

When a user is selected by the group selector it is passed forward to the Identifier for identification of access entries into user sessions.

(24)

Note again that an Identifier could more specifically be, as it was described earlier in session identification section (4.3 Session identification), a time frame identifier (TimeFrameIdentifier) or a maximal forward reference identifier (MFRIdentifier).

Finally, identified sessions for a user are appended to the output file by the SessonFormatPrinter in the appropriate format (e.g., association rule format, mixture model format, global tree model format, etc.).

(25)

5 Profile

mining

models

So far we discussed all techniques and steps required for data preparation and data enrichment. This chapter deals with the discussion of data mining models used in this project for pattern discovery on enriched data.

It starts with an explanation of the widely used association rules mining technique and follows with the discussion of a recent model called mixture model. Finally it presents the global tree model, which represents session data in a natural way and makes it easy to mine session-specific statistics on stored data. This model is also able to represent its structure in an easily interpretable graphical way.

Consider the following formal notion5 as dataset representation for all the models described below:

5.1 Mining frequent itemsets

One of the most well known and popular data mining techniques is the association rules (AR) or frequent itemsets mining algorithm. The algorithm was originally proposed by Agrawal et al. [1] for market basket analysis. Because of its significant applicability, many revised algorithms have been introduced since then, and AR mining is still a widely researched area.

5_{Note that the notion is almost the same as it was proposed in [9], with the difference that transactions are}

not considered as sets of items but rather as an ordered list of content types of visited pages within a session.

Notion 5.1

Let D={D₁,D₂,...,D_N} be a transaction or session data set generated by N individuals, where D_i is the observed data on the

i

th user, 1≤i≤ N. Each individual data set D_i consist of a set of one or more transactions for that user, i.e.,

} ,..., ,..., { ₁ i in ij i i y y y

D = , where n_i is the total number of transactions observed for user i, and y_ij is the jth transaction for user i, 1≤ j≤n_i.

An individual session y_ij consists of content-type references of visited pages within a user session. { ₁,..., ,..., } ij ijk ijk ij ij n n n

y = , where k_ij is the length of the

i

th user’s jth session,

k

_ij

≥

1

.

nij

n is a content-type reference, which can take values from the content type reference range: 1≤n_nij ≤K. Each reference of the range 1...K refers to a content group (refer to section 3.1.2 Content types mapping table).

(26)

The aim of association rule mining is exploring relations and important rules in large datasets in expressions of the form “if premise then conclusion” (X →Y X ∩Y =0) implication form. A dataset is considered as a sequence of entries consisting of attribute values also known as items. A set of such items is called an itemset (entries themselves are itemsets). Formally,

Using the notions (Notion 5.1) introduced at the beginning of this chapter, items refer to n_nij content-type references and an itemset is a y_ij user session with the restriction that each item can occur at most once.

A problem with association rules is that for a given

i

number of items there are 2i itemsets and for each k−itemset there are 2k rules. This could result in an unacceptable amount of rules. The solution is to consider only rules with a support and confidence higher than s and c.

The problem of mining association rules can be decomposed in two major steps: 1. Find all frequent itemsets that have support greater than the threshold s and

2. for each frequent itemset, generate all the rules that have confidence greater than the threshold c.

“Apriori” was the first association rules mining algorithm. Lots of improved algorithms (most of them are “apriori”-based) have been introduced since it was published. In the following we give the pseudo code of the “apriori” algorithm [1].

Let X →Y X ∩Y =0 be an association rule.

It has support s (in D) if s% of transactions from D contain X ∪Y.

It has confidence c if c% of transactions from D that contain X also contain Y. Let I ={i₁,i₂,...,i_n} be a collection of all items, where i_j∈(1...n) is an item. An itemset is a collection of items, where each item can occur at most once. A transaction or session is an itemset.

(27)

Rules can be generated incrementally, starting from 1-itemset conclusions, because of the property of confidence:

5.2 The mixture model

In their paper Cadez et al. (2001) [5] proposed a generative mixture model for predicting user profiles and behaviours based on historical transaction data. A mixture model is a way of representing a more complex probability distribution in terms of simpler models. It uses a Bayesian framework for parameter estimation on the other hand the mixture model addresses

Let L be a frequent itemset and

A

⊂

L

is a subset, then the following

statement is true:

If confidence of

(

L

−

A

)

⇒

A

is c then for any

B

⊂

A

the confidence of

B

L

−

)

⇒

(

is at least c. Initial conditions : k

L set of large k-itemsets (have minimal support)

:

k

C set of candidate k-itemsets

:

D set of transactions (as described above), t⊂D

s: support threshold Algorithm k k k k k k kL itemsets frequent all of Set s count c C c L count sub C sub if t of sub subsets k all for D t ns transactio all for candidates new of Set C k L k for itemsets frequent L U = ≥ ⊂ = + + ⊂ − ⊂ = + + <> = − = − } } . | { ; . ) ( ){ ; 0 ; 2 ( }; 1 { 1 1 } ) 1 ( | { 3 } | | , | { 2 1 1 1 − − ∉ − − ∃ ∧ ∈ − = ∪ ∧ ∈ ∪ ⇐ ⇐ = k k k k k k k L subset k C p p C step k q p L q p q p C step empty C step candidates new of set C

(28)

the heterogenity of page visits. Even if a user hasn’t visited a page before, the model can predict it with a low probability.

Cadez et al. (2001) presented both a global and an individual model, this thesis applies only the global mixture model. Transaction data consistently mean web page visits or sessions in this thesis, instead of the slightly different market basket data mentioned in [5]. While sessions are ordered sequences of visited pages, market baskets are sets of purchased items. However session data can be simply transformed towards the market basket data structure for applying mixture model:

The global mixture model consists of K components. Each of the components describes a prototype transaction forming a basis function. A component models a specific session’s prototype which consists of visited page types with counts relatively higher than for other items. A K-component mixture model for modeling a users’ site visit y_ij is given below:

As for modeling components, [5] proposed a simple memoryless multinomial model. For every component there is a multinomial distribution Θ_k =(Θ_k₁,...,Θ_kC) present, conditioned on n_ij, the total number of pages visited in the i-th user’s j-th session. The mixture model (Notion 5.3 – (1)) completed with multinomials can be written as

Notion 5.4 – Mixture model with multinomials

∑ ∏

= = Θ = K k C c n kc k ij ijc y p 1 1 ) (

α

(2)

Notion 5.3 – K-component mixture model

∑

= = K k ij k k ij P y y p 1 ) ( ) (

α

(1)

Where

α

_k >0 is the component weight for the k-th component,

∑

=

k

α

k 1.

K k

P_k,1≤ ≤ is the k-th mixture component. Notion 5.2 (alteration of Notion 5.1)

For the mixture model approach transaction notion should be altered in the following way: an individual session y_ij consists of counts of content type references of visited pages within a user navigational sequence. y_ij ={n_ij₁,...,n_ijk,...,n_ijK}, where n_ijk indicates how many pages of k content type are in the ith user’s jth session,

K k≤ ≤

(29)

The full data likelihood is presented below with the independency assumption of an individual’s behaviour:

The unknown parameters {Θ₁,...,Θ_K} and {

α

₁,...,

α

_K} are estimated by an expectation maximization (EM) algorithm.

5.3 The global tree model

Pei et al. (2000) [23] propose a WAP-tree architecture for efficiently mining frequent itemsets. The tree based model contains besides the tree structure a link-queue for each type of label. The queues connect all the same labels forming chains. Xing and Shen (2003) in [30] present so-called preferred navigation tree (PNT) for mining preferred navigation paths. PNT stores URL, frequency of visits and visiting time in its nodes.

In our approach we use a global tree model (GTM). The GTM provides a special representation of session data for groups of users. The structure of the model is similar to that of the PNT presented in [30]. The model preserves the information obtained from the structure of sessions and it stores individual pages in visiting order. In this model sessions with the same prefix share the same branch of the tree. This results in less storage required for the model. Also, the model was built to be able to visualize frequent navigational paths in a tree structure. Visualization helps to understand the patterns by highlighting relevant information.

Each node in a tree model registers four pieces of information: content-type label, frequency number, reference to its parent node and reference to its children nodes. The root of the tree model is a special virtual node with an optional title label and frequency 0. Every other node is labelled by one of the content-type labels and is associated with a frequency which stores the number of occurrences of the corresponding prefix ended with that content-type in the original session database. A model consists of K6 branches (session trees) connected to the virtual root node. Each branch contains a root node labelled with a unique content-type identifier. A branch stores only those user sessions which start with a page labelled with the same content-type as its root’s. Figure 4 presents the visualization of a sample tree model.

An A→B path of a tree from any A node to any B node (where the level number of A in the tree is not greater than that of B) represents one or more subsessions where the frequency

6

K is the number of content-types, refer to Notion 5.1.

Notion 5.5 – Full data likelihood

∏

= Θ = Θ N i i D p D p 1 ) | ( ) | ( (3)

Θ represents the unknown parameters: both the parameters of the K component multinomials,

{

Θ

₁

,...,

Θ

_K

}

, and the

α

vector for profile weights, {

α

₁,...,

α

_K}.

(30)

number of the B node represents the total number of sessions containing this ordered subsequence pattern.

A special case of the A→B path is when A is the root node (of a session tree). In this case the path represents one or more sessions or subsessions depending on the frequency of B node and the sum frequency of its children nodes:

Building the tree model

Model building starts with the initialization of the K session trees. All trees are initialized for a unique k content type. Then all sessions of the data set are added to its correspondent session tree. Each session is examined for its first page type and a tree is selected according to the result.

Adding a session to its tree can be implemented recursively. The recursive function takes a parent node and subsession parameters and updates or creates the child node of this parent with the content-type given by the first element of the subsession. The recursive step is to pass the child node as parent parameter and the new subsession parameter arises from the removal of the first entry of the original subsession. The recursive process stops when the length of the subsession is equal or less than one.

Algorithm to build the global tree model

Initial conditions

D

s∈ :session

i

s: is the ith element of session s

sessionTrees: array [1..K] of SessionTree

SessionTree: tree object for k content type, consists of a root node and children nodes

node: is a SessioTree node containing ct: content-type of node

Let f_B be the frequency number of the B node and let =

∑

B of node children C all for c f sum be

the summed frequency of its children nodes.

Let Root→B be the path from the root node to the B node, then Root→B represents at least one real session if f_B >sum, in which case the f_B −sum difference gives the number of real Root→B sessions.

(31)

Algorithm

scheme of the algorithm:

} ); ( ]. [ { ; 1 add s s es sessionTre D s all for es sessionTre init ∈ initialization of sessionTrees: } } ; . ; . ; 0 . ; . ); 0 , ( ] [ { .. 1 { null children root null parent root freq root i ct root i e SessionTre i es sessionTre K i for es sessionTre init ⇐ ⇐ ⇐ ⇐ = =

adding a session to the correspondent SessionTree:

} } } ); (), . ( { } ); (), . ( { () . (); . { 1 . ; . ){ : , : ( } ); , ( ; _ ){ : ( ]. _ [ 1 s nt firstEleme s for child create addSession else s nt firstEleme s for child addSession exists nt firstEleme s for child if tElement removeFirs s length s if freq node session s parentNode node addSession s root addSession return type content s if session s add type content es sessionTre > + + <>

(32)

Mining preferred paths from GTM

Preferred navigation paths can be mined directly from the tree model. (A, B) paths or sessions that have a higher support than a preset threshold value are the preferred navigation paths. The algorithm given below scans each level of all session trees for possible candidates ignoring branches that have low support.

Trees’ similarity

Mining preferred paths initial conditions treshold support : s value support their and sessions supported of list : supported nodes children candidate of list hildren candidateC nodes candidate of list the candidates : : algorithm } ; } } } ); ( { . { ]. [ .. 1 // } ]), [ , (( { ]) [ , ( { () . .. 1 { 0 () . hildren candidateC candidates child add hildren candidateC s freq child if do ildren NumberOfCh i candidates j for candidates child possible gather support) i candidates root add supported s i candidates root f frequencyO if do size candidates i for empty hildren candidateC do size candidates while node root candidates j j ⇐ ⇐ ≥ = ⇐ ≥ = ⇐ <> ⇐

(33)

The distance measure proposed is a simple approach based on forming the intersection of the two trees’ session dataset.

The similarity proportion can be easily calculated then by dividing the sum value by the summed number of different sessions in the two trees (the summed number of all sessions in the two trees subtracted the sum value). If we multiply the resulted value by 100 we get the similarity percentage. Trees’ similarity initial conditions sessions common all of number the registers sum nodes children candidate of list hildren candidateC nodes candidate of list the candidates model tree two the T T : : : : , ₂ 1 algorithm } ; } ; } ; { ) ( ) , ( { . .. 1 { 0 . ; ; 0 2 1 hildren candidateC candidates trees both in present are that candidates of nodes children the all put hildren candidateC sessions common of number the add sum s session candidates root have both T and T if do size candidates i for do size candidates while trees both in occure that trees session from nodes root the all put candidates sum i i ⇐ ⇐ ⇐ = <> ⇐ ⇐ Assumptions

1. A similarity distance measures not only the structure of trees but also (or rather) the frequencies of their nodes. Higher frequencies should be taken into account with higher weights.

2. The extra information that originates from sessions should be exploited. 3. Considering T₁,T₂ trees: the distance of T₁ from T₂ should be equal to the distance of T₂ toT₁. Formally T₁.dist(T₂)=T₂.dist(T₁).

(34)

Visualization of tree models

Frequent navigational paths are conventionally represented by text or tables which are not easy to understand. Visualization of a tree model however makes it easy to interpret the patterns. A picture of a tree model consists of nodes with content-type labels and their colour code. Nodes are connected with lines (edges) in different thickness marking the frequencies of given paths. Besides thickness, edges contain proportional numbers for each child of a node measuring the distribution of frequencies for the given children nodes. Besides, the number of “real” sessions for that path of the tree is also given in parentheses. The tree visualization contains only the supported sessions based on a support threshold set for the model. Figure 4 presents a sample tree.

(35)

6 Analysing log files of the www.cs.vu.nl web server

For the purpose of this thesis the discussion will be restricted to the analysis of user behaviours for a single web domain www.cs.vu.nl. Therefore all the data used in the following experiments are in connection with the web server of the Computer Science Department of the Vrije Universiteit.

This chapter presents experimental results using all the techniques described earlier. The first section describes the details of the input access log files and mapping table. This is followed by experimental results of data preparation and data structuring techniques. Finally, the last sections present results of the three profile mining models AR, MM and GTM.

Results of association rules and frequent itemsets mining can show which page sets users tend to visit within a session and what rules can be defined on frequent itemsets. A mixture model can tells what distribution the data come from and how many components (based on different user behaviours) are likely to have generated the data. Both the AR mining and the Mixture model ignore the information which can be mined from the order of pages within sessions. The global mixture model, in contrast, is based on the structure of sessions. It can answer the question which session sequences (or subsequences) are highly preferred by users. It also provides a visualization of frequent navigational paths in the tree structure.

Most of the algorithms were implemented in the Java programming language. For further details on their implementation each section refers to the proper APPENDIX table.

Only the most frequent and most important patterns will be presented in this section but an additional CD-ROM for this master thesis contains all the results and outputs experimented (refer to APPENDIX E).

6.1 Input data

The input data in this case are the access log files of the www.cs.vu.nl web server for a certain period of time, the content types mapping table of the HTML pages of the www.cs.vu.nl domain and the organizational and geographical information for user group identification.

6.1.1 Access log files

Four consecutive access log files were collected and merged together from the www.cs.vu.nl server. In total they sum up to one month of access log entries. The details are summarized in the table below:

Details on the merged access log entries

File name cs_access_log_20040530-20040704

Size (MB) 1 533, 344

Period 30 May 2004 – 4 July 2004

Number of entries 7 126 732

(36)

The apache web server of the www.cs.vu.nl domain writes the following fields, in the given sequence, into the log files: remotehost, rfc931, authuser, date, request, status, bytes, referrer, user_agent. For the accepted access log file structure refer to APPENDIX B2.

6.1.2 The mapping table

Data enrichment is partly based on the content information of visited web pages. This information is given by a table with URL/content type entries. The table was generated by a text mining algorithm that was developed in a different project [3]. The text mining algorithm attaches labels to all HTML pages of a document set based on their contents.

The HTML pages (VU-pages) were downloaded by wget [36] invoked with the following parameters:

The given parameters force wget to download all the *.htm and *.html files from the www.cs.vu.nl domain in the depth of five levels recursively. In case of a page access failure it retries to download the page four more times again.

This resulted in a collection of 13.001 HTML pages (with a total size of about 90MB) that were consequently assigned to 19 categories:

Description of the content-types (content-categories) type

id. type name description

1 photo

This type refers to pages containing a negligible quantity of textual information with one or more images.

It most likely refers to personal photo albums, lecture slides or informational pages with messages like “under construction” or “this page has been moved to …”.

2 miscellaneous

“Miscellaneous” type refers to pages with absent or insufficient content.It most likely refers to framesets, empty, file list, form, moved or redirected pages. It can contain photo pages as well in case that the page doesn’t contain relevant textual information.

3 dutch/department This type-group contains department pages in Dutch.

4 english/reference

“English/reference” group most likely refers to pages containing e-books or manual pages for different systems or programs. It can be a manual for an operation system or an API reference for a programming language. It contains pages written in English.

This group most likely refers to pages containing invitations for official or free time activities. Among these events can be

(37)

/faculty members. They are usually very formal and they mostly consist of fields of research, professional background, research projects and other information related to the member’s research area or department. It contains pages written in English.

9 english/person

/student

This group most likely refers to student pages. Student pages mostly contain personal information (e.g., hobby, lyrics, etc.) and links to pages of friends and courses. The group contains pages written in English.

10 english/person

/faculty/publication

“English/person/faculty/publication” category most likely refers to pages containing publications of faculty members comprising at least the abstracts. It contains pages written in English.

11 english/course

This group most likely refers to course pages. They mostly contain the description of the course, lecture slides, recommended literature and set assignments in English.

12 dutch/course The same as the english/course group, but containing pages

written in Dutch.

13 dutch/person

/student

The same as the english/person/student group, but containing pages written in Dutch.

14 dutch/person

/faculty

The same as the english/person/faculty group, but containing pages written in Dutch.

15 other_language This type-group contains pages written in other languages

than English or Dutch.

16 dutch/project The same as the english/project group, but containing pages

written in Dutch.

17 dutch/activity The same as the dutch/activity group, but containing pages

written in Dutch.

18 documents

This group contains documents in Adobe Acrobat (pdf) or Postscript (ps) format. They are most likely to be scientific papers, publications, e-books, etc.

19 other documents

“Other documents” contains documents in Microsoft Word (doc), Microsoft PowerPoint (ppt), Microsoft Excel (xls), Rich text (rtf) or plain text (txt) format. They are most likely to be administrative papers, forms, course materials etc.

Table 6: Description of the content-types

The labelling algorithm (supported by human) provided only an approximate categorization of pages. Roughly, about 74% of pages got the right labels; see [3] for details.

To reduce the length of type names in some places we will use the letter “E” and “D” referring to English and Dutch groups (e.g., “E/department” refers to “english/department”). For the accepted (by the webmining package) file structure of the mapping table refer to APPENDIX B3. <