The web log is in general a summary of all the activities done by all users visiting the web site.
It is necessary to clean the web log before it is used; this step is referred to as preprocessing. The target of the preprocessing step is to “clean” the web log in order to minimize interference from robots; this step will make the web log ready to generate the actual output of this stage. Actually this is the preprocessing step for the whole system. So, the input to the association rules mining and the social network construction and analysis methodology is the clean web log data produced after this preprocessing stage.
The preprocessing step converts the raw format into a form that can be accepted as input by a specific implementation of an algorithm. Thus, it is important to understand the raw data and the new environment to which the data is to be mapped so that the preprocessing step would produce the targeted outcome. One of the issues with analysing the data is figuring out how to distinguish unique users [36]. The problem is that certain users may use a proxy server or share a data connection, which means we cannot use the Internet Protocol (IP) address of the user as a unique identifier; this is one of the main issues in web usage mining. Cooley et al. [36] prioritises suggestions among many that can be found in the literature to get around this problem. These include: combining machine name, browser information or temporal information with IP addresses to uniquely identify users. There are also other methods described in the literature, which consult site-topology [82], or session timeouts [28] to determine legitimate page traversals by a single user. Once the users have been semi-uniquely identified, we can parse the data into the format we require for the next steps.
36
A sequence of page views separated into transactions can be very useful for web usage mining activities. At this preprocessing stage, each user session is separated into a transaction [36]. In other words, each transaction contains the sequence of pages a particular user viewed on a given web site. A new transaction is created if the time between page views exceeds some predefined limit. For instance, Catledge et al. [28] found that 25.5 minutes is a good timeout period based on their user experiments. Computing these sequences also allows us to easily calculate the time a user spends on each page, which can be correlated to the interest or relevance of the page to the user. It should be noted that we only extract the relevant lines from the log file. A user accessing images or other media not directly related to the information they are accessing can be discarded at the beginning of the preprocessing stage.
We also perform what we call path compression on the sequences before they are handed off for analysis. Basically, this means we ensure that consecutive views of the same page do not exist in the sequence. In other words, a sequence A,A,B,C,C,D,E,F will be compressed to A,B,C,D,E,F.
This allows our algorithms to be written much cleaner and should not place much impact on the final results. While it is true that a user viewing the same page multiple times consecutively may be much more interested in the data, we feel that does not come into play with our basis for the analysis. In fact, a user may visit the same page several times in a session or sequence of sessions because it contains some information that caught his/her attention and hence may come back to refresh his/her mind or check more details. At the end, we are interested in the fact that the page was visited as a destination not a junction along the path leading to a destination.
We applied this data preparation methodology on the data used in this chapter and also in Chapter 4. Further, in order to fruitfully present our approach, we adopted an example driven approach where we study a specific web log of catalogue browsing. To be precise, we have used
37
the Music Machines1web log from the University of Washington’s web log data repository2. This web log in particular exhibits users’ musical instrument catalogue browsing behaviour and consists of around 2500 distinct pages including HTML pages, images, media, and text files. For our analysis, we have considered only the portion of the data related to the year 1997. In order to make things simple, we also did not consider dynamic pages, images, and media files; thus we only considered html files. The web log was sessionised using Chen et al.’s approach [30] with a session interval of 30 minutes.
One approach which may be effective for this preprocessing step is described in [110]. This approach concentrates on identifying and discarding sessions which may be characterised by the following access patterns that are likely to be robots’ characteristics:
• Trying to avoid time latency by visiting after midnight, i.e., during light traffic load periods.
• Using “HEAD” instead of “GET” as the access method to verify the validity of a hyperlink; the
“HEAD” method performs faster in this case as it does not retrieve the web document itself.
• Doing breadth search rather than depth search; robots do not navigate down to low-level pages because they do not need to access detailed and specific topics
• Ignoring graphical content; robots are not interested in images and graphic files because their goal is to retrieve information and possibly to create and update their databases.
It is possible to identify sessions from the cleaned log file. Sessions are used to compute the total number of visits vi and the total time spent by visitors, ti for each page pi. It is necessary to confirm that the number of user sessions extracted from the web log are of sufficient size to provide a realistic ranking of popular pages. Finally, the outcome from this preprocessing step is the main input to our web log analysis approach described in this chapter.
1 http://machines.hyperreal.org/
2 http://www.cs.washington.edu/ai/adaptive-data/
38