• No results found

2.2 INFORMATION SEEKING ON THE WEB

2.2.2 Information Searching: Explicit Behavior

Search log analysis refers to the use of data collected in a search log to tackle particular research questions concerning the interactions among the searchers, the search engine, or the Web content during searching episodes (Jansen, 2008). And the search log is an electronic file kept on the server of a search engine and recording the interactions that have occurred during a searching episode between the search engine and its users (Jansen, 2008). Basic search log data include user IP address, query terms submitted by the user, and date and time of submission, whereas other types of data such as URL of the result page and URL of the page viewed may also be captured depending on the file format supported by the server (Jansen, 2006).

Search log analysis is a quantitative method applicable to various Web-based search environments. Jansen et al. (1998) and Silverstein et al. (1998) were among the earliest to conduct search log studies, with data from Excite and AltaVista respectively, two general Web search engines. Search systems for special purposes, including bibliographic tools (Blecic et al., 1998; Wolfram & Xie, 2000; Bernstam et al. 2008), academic websites (Rozic-Hristovski et al. 2002; Wang et al., 2003; Wolfram et al. 2009), digital libraries (Jones et al. 2000), and so on, have also been researched with this method. Language specific search systems, such as Chinese (Chau et al. 2007), Korean (Park et

al. 2005), and Chilean (Baeza-Yates & Castillo, 2001) search engines, have attracted researchers’

attention too. While many search log studies are based on relatively small sample sizes or short time lengths, or both, Jansen & Spink (2006) compared 9 search engines from the U.S. and Europe over a period of six years, presenting the most comprehensive breadth and depth of analysis in the

22 literature.

Guiding these studies is an established systematic search log analysis framework which consists of three levels: term, query, and session (Jansen, 2008). Most search log studies have made analysis at one or more of these levels.

2.2.2.1 Term

The term is the basic unit of analysis. Measures that can be examined at this level include term occurrence, total number of terms and unique terms, high usage terms, and term co-occurrence (Jansen, 2008). Of these measures, term co-occurrence is the most useful one. Ross and Wolfram (2000) analyzed the search subject content of Excite and categorized more than 1000 of the most frequently co-occurring term pairs into one of more of 30 developed subject areas. Their cluster analyses resulted in several well-defined high-level clusters of broad subject areas. In Huang et al. (2003), the researchers proposed a log-based term extraction and suggestion approach to interactive Web search. The approach could provide organized and highly relevant terms that co-occur in similar searches, and could exploit contextual information to make more effective suggestions.

2.2.2.2 Query

A query consists of one or more terms. Web users prefer short queries. The reported average query length never exceeded 2.5 terms, which was true to both English and non-English language search engines. (Jansen et al., 1998; Beitzel et al., 2004; Baeza-Yates & Castillo, 2001; Park et al., 2005).

23

One-term queries are very common. Their percentages in the 9 search engines studied by Jansen and Spink (2006) ranged from 20% to 35%, and the percentage in Wang et al. (2003) reached 38%. If the first query fails to return satisfactory results, a user will submit subsequent queries which are usually different from the initial one (Jansen, 2006). It has been found by Rieh and Xu (2001) that while most query reformulation involves content changes, about 15% of the reformulation relate to format modifications. Another fact is that Web users also avoid complex queries that contain Boolean operators (Jansen & Spink, 2006; Beitzel et al., 2004).

2.2.2.3 Session

It is not easy to define a session since the boundaries of a single search session will not be marked in search logs. One way of session detection is automatically grouping a user’s consecutive queries on the same search topic into one session (He et al., 2002). But its performance would be limited if users submit few queries and search on multiple topics (Özmutlu & Çavdur, 2005). The other way exploits the temporal characteristics of the queries. Two temporally adjacent queries submitted by a user belong to the same session only if their submission interval value is less than a cutoff value. The cutoff values vary from study to study, typically between 5 and 30 minutes (Huang et al., 2001; Göker & He, 2002; Spink & Jansen, 2004; Baeza-Yates et al., 2005).

Session length, i.e. the number of queries contained, and session duration, i.e. the total time the user spent interacting with the search system, are the two basic attributes of a session. Jansen et

al. (2007) noted that on Dogpile.com the mean session length was fewer than 3 queries and the

24

click-through analysis which examines how users view the documents on result pages returned by the search engines. Based on the click-through data from AlltheWeb.com, Jansen and Spink (2003) found that more than 55% of all the users view only one document per query, and more than 66% of them view fewer than 5 documents in a given session. This echoes the previous finding that 85% of the time only the top 10 results were viewed (Silverstein et al., 1999). Web users tend to evaluate their search results with the minimal effort, just like they do in constructing queries.