Employing Probability Measures in Sequence Analysis

3.2 Web Log Analysis

3.2.2 Employing Probability Measures in Sequence Analysis

It is necessary to investigate various types of sequences in order to predict and decide on the probability of the next page to be viewed. There are three possible cases to investigate: (1) Consider only one step back, i.e., the last two pages accessed will lead to the probability of accessing the next page. For instance, the sequence of views A,B and A,C would lead to the case where from page A there is 50% probability of viewing page B and 50% probability of viewing page C. (2) Consider the complete sequence of previous pages already viewed. This requires matching the whole prefix sequences up until the point where we predict the next page. For instance, viewing two sequences in different sessions such as A,B,C,D in one session and A,B,C,E in another session, will lead to the prediction that a user who views the prefix pages A,B,C is 50% likely to view page D and 50% likely to view page E. (3) Consider the set of previous pages viewed. This case is similar to the previous case but here the order of the previously visited pages is not important.

It is possible to combine these probability measures to decide whether it is essential to move links of pages that will be more likely viewed next into a more prominent section of the web page.

Table 3.2: Summary of Next Page Probability values from Previous Page Views

Previous Page Next Page Probability

41 Computing Probabilities Based on Previous Page

The probability of the next page to be viewed based on the previous page viewed is one of the most important measures within this grouping of probability measures. This allows us to recommend products on the current page that the vast majority of users will likely browse to on their own. This clearly allows users to find information faster than following manual navigation.

Another method a web site might deploy using this information is the ability to purchase the current product as a bundle with the product that is most likely to be viewed next by a majority of the web site visitors. This does not only increase revenue, but also provides increased user satisfaction by reaching their goals without wasting their precious time. A summary of the top probability values are reported in Table 3.2.

Table 3.3: Summary of Next Page Probability values from Previous Sequence

Sequence Next Page Probability predicting the next page(s) to be viewed. The previous ordered sequence consists of a complete

sequence of pages which have been viewed in the same session. The main motivation here is the fact that most sessions of web site visitors include a small number of pages. This makes it easier to investigate the access patterns further and derive interesting predictions regarding the next page(s) to be viewed. Here paths are compressed by eliminating all repetitions of the same page in the sequence. In other words, the concentration is on the pages visited in a specific session rather than the number of times a page is visited within the same session. It is anticipated that the first few pages viewed will provide an idea about the pages expected to be visited next. Based on prediction accuracy the pattern will continue to exist as the length of the sequence increases, i.e., for longer sequences. To illustrate this, consider the following sequence of five pages from our data: 906, 185, 905, 185, 905, there is 29% (see row 8 in Table 3.3) chance that the user would visit page 905 or 30% (see row 7) chance the user would visit page 185.

It is worth noting that, for the purpose of clarity, only page numbers are reported in Table 3.3 instead of using their full names. However, the sequence in question (see row 7) should be read as 30% of the time users have visited page “images/index.html” after they visited the sequence

“index.html”, “images.html”, “images/index.html”, “images.html”, “images/ index.html”,

“images.html”. This particular sequence once analysed closely may signify that the user is browsing the catalogue of instrument images and is comparing many different instruments. It should be noted that insignificant data has been removed from the results. In other words, we will not find many very long sequences that have good prediction for the next page. This is simply because users drift widely once they get deeper into the site. It is, therefore, only practical to use this web mining method for the first couple of page views. Beyond that, it does become insignificant. web sites which provide better organisation (allowing users to find what they are looking for within their first four pages or so) would be better suited for this type of web usage mining, as they could provide some useful and informative outcome.

Table 3.4: Summary of Next Page Probabilities from Previous Unordered Set Set Next Page Probability

This is similar to the above method; however, we are no longer looking at the sequence as being of higher relevance. Instead, we are looking at the first part as being a set. We consider the pages the user views as a set of pages, and from there we can calculate the probability of the next page to be viewed based upon this set. This accounts for the fact that users may be reaching the same goals, but could be approaching that in vastly different ways. In other words, if the user has viewed the given set of pages, in any order, then they are likely to view this other page next. This method of web usage mining is extremely resource intensive in computing. As such, this method may be discarded on face value because of the longer amount of time it takes to process the data. The music machines web log file contains only 643,000 (six hundred and forty three thousand) useful entries. This is fairly small size compared to some data collected by online shopping web sites.

Some of the important information found is summarised in Table 3.4.

It should be noted that the probability of 100% listed in Table 3.4 is significant. It does not mean that only one particular user went from the set to the next page, these individual results were filtered out to provide only useful data. For example, row 2 in the table suggests that the user would certainly visit the page “/software.html” after visiting the page set {“/categories/software/include.html”, “/search.html”}.

In document Integrating Network Analysis and Data Mining Techniques into Effective Framework for Web Mining and Recommendation. A Framework for Web Mining and Recommendation (Page 50-55)