• No results found

2. A unique session id representing a search session. A session is the entire series of queries, one or more, submitted to the search engine by a user over some given time.

3. A unique query id is given to every distinct query submitted during a search session.

4. The query terms used by the search engine to display results. 5. The URL of the clicked result.

Next, rule based method to identify the source-orientation is presented.

5.5 Rule Based Method

During the rule based method, a simple rule based technique was applied on the clicked URLs of the log entries. This work was the first step towards identification of the “source relevance” to a given query Sushmita et al. [2009b]. In other words, given a query, the challenge was to identify the intended source, from which the result would be required in order to satisfy the information need. In the following sub-sections, the methodology followed is described in Section5.5.1, and the outcomes are presented in Section5.5.3. Finally, the overall outcomes are discussed as summary in Section5.5.4.

5.5.1

Methodology

In web search, it is possible to breakdown search queries into at-least two broad cat- egories5-2: navigational and informational, because informational and transactional queries have similar characteristic Lee et al. [2005]. It was hypothesized that search result aggregation was most useful for supporting informational queries (as also dis- cussed in chapter3).

To focus on the informational queries in the dataset, the data was separated into two sets based on the number of clicks made within a single session. More specifically, two click sets were made; a single-click set, which had only one click in a session, and multiple-click set, which had more than one click in a session. Although this was a very simple method to separate navigational queries from the others, single clicks are one of the main properties of the navigational queryYuan et al.[2008]. As a result, 3,218,588

5.5. Rule Based Method

single-click sessions (27%) and 8,932,479 multiple-click sessions (73%) were obtained. The following analyses mainly focuses on the multiple-click data-set, and zero-click queries were not included in the analysis. That is, the sessions where no clicks were made in response to the submitted queries were not used in this analysis.

5.5.2

URL analysis

The following set of pattern rules were used to identify the sources of click-through documents. For example, if a click-through URL contained a string movies, it was assumed that the main content of the clicked document was a movie.

Image: /img/ /images/ /image/ /pictures/ /picture/ /photo/ /photos/ Video: /vid/ /video/ /videos/ /movie/ /movies/

Wikipedia: /wiki/ News: /news/ Blogs: /blogs/ /blog/ Audio: /audio/ /audios/ Map: /map/ /maps/

Web + Others: URLs that did not match any of above

While the patterns may not be exhaustive to identify all pages belonging to a source, it can be considered as a reasonable approximation of the distribution of different ori- entations in the dataset. It should also be noted that the overlap of orientations were not considered during these methods (Rule based and Combination both). Although, it is possible that a query will be associated with multiple orientations, but this study fo- cused on identifying only single orientation for a given query. The aim was to first build a suitable single orientation classifier and to explore its possibilities and outcomes. For future research direction, it should be possible to further elaborate these methods to be able to identify overlap of orientations.

5.5. Rule Based Method

FIGURE5.2: Distribution of Source-Orientations.

5.5.3

Results

Once the URLs were classified using the rule base technique, the number of clicked URLs that matched with any of the sources (as shown in section5.5.2) in the multiple- click set was counted. There were 2,74,755 click-through URLs (3%) that matched with one of the six source orientations, image, video, blogs audio, Wikipedia, news and map, the rest was classified as Web + Others. Figure5.2 shows the distribution of the seven orientations based on the matched URLs. The results show that, the images were the most frequent orientation followed by Wikipedia, news and video. Furthermore, the percentage of queries oriented to audio source was found to be very low.

Similar findings were also reported in the survey5-3. The findings were based on users’ feedback on a set of questionnaires asked during the survey. The survey selected on- line consumers randomly from the NPD U.S. online consumer panel. A total of 2,404 individuals responded to the survey. Respondents received an email invitation to par- ticipate in the survey with an attached URL linked to a WebÐbased survey form. A result presented after an image search is clicked by 26% of users (the most frequently clicked "vertical search" category). The second most commonly clicked vertical search category was found to be news search at 17%, followed by video search at just 10%. Since Wikipedia was not considered in this survey, therefore percentage of clicks for Wikipedia were not reported. The findings from log analysis reported in this chapter are therefore in consistent with the findings of the online survey, hence the estimations can be assumed reasonable.