• No results found

5. TEXT IN WEB LOG MINING

5.3. Text for Results Presentation

5.3.1. Description of the Process

“The essence of Information Visualization is referred to the creation of an internal model or image in the mind of a user. Hence, information visualization is an activity that humankind is engaged in all the time” (Banissi 2000). The knowledge discovery steps described by (Fayyad et al. 1996) does not involve information visualization as a part of the pattern discovery cycle. However, it is important to display the outcome in the understandable format (Hunphreys 1992; Chi 2002). This process called pattern analysis stage – when results of the mining process are presented in a form most readily understood by the end user/business analyst (Pabarskaite 2002).

Technical experts are comfortable with the knowledge discovered from web pages in technical terms. For example, a list of five most common pages through a web site being discovered might be presented in the form of URLs. The end user might find this technical form baffling. A non-technical end-user would

likely find more useful the most common path represented in a form of page names such as Homepage instead of index.html and TOPIC search page instead of /xxx.html.

Text tags in HTML links carry valuable information. They can be used to replace unfriendly URLs. This is applicable in any application where URL is presented to the end user. This is often a case in web log mining. In novel approach all unfriendly links can be replaced by much more meaningful text tags.

Fig 5.2. Replacing engine replaces link tags with text tags

For example, information presented for technical analyst might not be understandable by a business person. Technical interpretation of the page

index.html is commonly understood by web site developers. The expression index.html might be more easily understood by a business person as Homepage.

The implemented engine (see Fig 5.2) takes web pages from the cleaned records table called “Links” (see Chapter 3) and replaces them with text tags.

Table 5.4 Replacing technical version of web pages with user attractive interpretation

Technical version of the web page Textual interpretation

/content.cfm Home

/ipacareers_html/factfile/listcategories.cfm IPA recruiting agencies /ipacareers_html/home_maillayer.cfm Graduate recruitment (Flash) /ipacareers/welcome.cfm Graduate recruitment (HTML)

/services/main.cfm MEMBERS ONLY

/members/map/searchoptions.cfm IPA agencies /services/training/cpd/loghome.cfm CPD ZONE

Table 5.4 shows example of the applied methodology. The technical version of web pages was replaced by text tags extracted from HTML code. This

modification is much more attractive and allows tracing results because text which users can see on the links used for results presentation.

5.3.2. Limitations

Replacing URLs with text tags is not straightforward process. There are numerous problematic situations. For example, there can be multiple text tags or there can be no text tag at all. Some of all problematic areas are addressed in this section.

5.3.2.1. Multiple Text Tags

In certain cases, the same link in various pages can have different text tags. For example, the root page in one place can be indicated as “Home” and in another place as “Home Page”. This creates a problem witch one to use. Several solutions exist. First, any/random/first text tag can be used. Next, text tags can be combined together.

5.3.2.2. Missing Text Tags

In reverse to previous problem, text can be missing. This can be a case, where link text is an image, special script or just meaningless word such as “click here”. In image case a simple solution can exist. Sometimes (quite often), images contain tag alt. This is textual tool tip that is displayed if user holds mouse on top of the image. This text can be used as a replacement instead of the link text tag. However, it is often a case that developers do not use or forget to set them and alt tags just do not exist or contain file name of the image.

There is no good solution if text tag does not exist and there is nothing to replace it. In these cases, usually, it is represented by the same unfriendly URL. In future there are some ideas how to overcome this problem. One can use not only text tags, but neighbour text as well. I.e. cases where links are placed near meaningful text like: ”If you are interested in cars <click here>”, select the text which comes before <click here>. From another hand there is a risk of capturing unrelated or even harmful text and displaying it instead of a link.

5.3.2.3. Not HTML based Links

If web pages are highly scripted or use a lot of ActiveX or Java Applet components, it may be difficult to parse and extract text tags. In HTML case, it is pretty straightforward, however if link is implemented in some other way, the

parsing engine must be enhanced. There should be added support for technologies, targeted web site uses.

5.3.2.4. Speed

Replacing page URL with text tag is pretty straightforward process. If it is preformed in database, the URL field must be indexed. In this case matching will be performed very quickly without noticeable effect in performance. In case of programming (e. g. C++) implementation, the array of URL strings must be sorted. That allows utilising fast binary search routine. The same as in database case, the decrease in performance while changing URLs into text should be small and unnoticeable.

5.3.2.5. Multilanguage pages

Whereas data mining tasks are language independent, using text in mining language plays an important part. Currently there is no methodology created to deal with this problem. All text mining techniques are applicable to English language (Tan 1999).