assignments have to be averaged. In practice, the different formulas were formally derived which were needed to compute the statistics from the source-orientation and click distributions, so the number reported are probabilistic expectation of the different statistics.
Process for generating random log is shown in Figure 6.3 where, the random log of size equal to the size of the original log is generated, that is having equal number of sessions. For each session, number of clicks (random) is predicted, and then for each click a random source-orientation (from seven source-orientations) is assigned. This is repeated for all the sessions in the random log. Ten random logs are generated using this process, and finally their assignments are averaged to obtain a single random log. In the rest of this chapter, statistics computed for both the real and the random logs are reported. Going back to the example of the 96% of single source-orientation sessions in Table6.1, it can be seen (Table6.1, column “Random”) that we would expect 93% of the sessions to be single source-orientation in the random log. This sets the following limit: if the real number was below this limit (even with a percentage as high as 90%), then it would be possible to say that users tend to combine more than one source-orientations within a session; instead, it is observed that the statistic is higher (96%), which means that sessions are indeed generally single source-oriented.
6.6 Combination of Source-Orientations
Here the third research question [R3] is investigated, that is the existence of frequent combinations of source-orientations in search sessions. To this end, how often source- orientations co-occur within a same search session is computed. This was done on 3,957,888 sessions (the number of sessions with less than twenty clicks). The sessions up to twenty clicks were considered because while generating the random log, sessions up-to twenty clicks were generated. This was done in order to limit the size of sessions and to have comparable session size between random log and the real log. Since gener- ating a random log required two steps, first generating a session with random number of clicks, then assigning a random source-orientation to each click. It was necessary to re- strict the number of session in order to have fair comparisons between the two log-data. It should be noted that limiting the session size to maximum of twenty clicks did not cause discarding large log data samples. When analyzed, there were very few sessions that contained over twenty clicks (less than 200).
First, in Table 6.1the percentage of sessions that contain two, three, ... , to seven dif- ferent source-orientations are reported. This was done for the original log data, and the
6.6. Combination of Source-Orientations
TABLE6.1: Percentage of sessions with one, two, three, four, five, six and seven different source-orientations. Note that the number of distinct orientation does not correspond to the number of clicks.
Distinct orientations in the session Log (%) Random (%)
Seven 0.000 0.000 Six 0.000 0.000 Five 0.000 0.002 Four 0.005 0.034 Three 0.121 0.470 Two 3.385 6.354 One 96.489 93.140
randomly generated log data as well. From the table, it can be observed that there are very few sessions with more than two different source-orientations. This is in accor- dance with the study reported inArguello et al.[2009b] where, through manual assess- ments, only 30% of the queries had more than one source-orientation (it should be noted that, in the study, verticals were assigned to the queries, some of which corresponding to the domains and genres studied here).
The difference between the original and the random log statistics show that the fact that a session is associated with a low number of source-orientations is not due to chance. Moreover, the difference increases as the number of sources increase (from 2 times higher with two intents to 7 times higher with four intents), which shows that when users have diverse orientations, it is generally restricted to at most two. Therefore, in the rest of the study, the analyses are restricted to sessions having two source-orientations. Returning to the research question [R3], which source-orientations co-occurred fre- quently is now investigated. The percentage of sessions where at least two source- orientations appeared were computed and reported the values in Table6.2. For instance, in this table, the value 0.01% for “nb” means that there are very few sessions with a blog and a news orientations. Given that the different source-orientations have distinct prob- abilities of occurrence, it is also interesting to look at the conditional probability that a source-orientation can be present (third and fourth series of columns). For instance, in the line labeled “bb”, it can be seen that if the first observed orientation is “blog” in the session, then there is a 0.189 probability that a second “blog” orientation would be observed in the same session. The results are further analyzed in the rest of this section. The most important observation from the results is that most users do not mix sources. Indeed, the first and last series of rows of the table show that users are less likely to combine two different sources in the same session that what would be expected by
6.6. Combination of Source-Orientations
TABLE6.2: Pairs of intents. Column % report the percentage of sessions that had the cor- responding pair of intents. Column % among 1st (respectively 2nd) reports the percentage of sessions with the first intent of the pair (respectively second) that also had the second (respec- tively first). L stands for log, and R for random. For the intents, we use n= News, m= Map, i= Image, v=Video, w (lowercase) = Wikipedia, b=Blog and W= Web.
% % among 1st % among 2nd Combination L R L R L R bm 0.00 0.02 0.5 4.6 0.1 0.5 nb 0.01 0.03 0.3 0.5 2.6 6.2 vm 0.01 0.2 0.3 4.5 0.4 4.4 bw 0.02 0.01 4 2.8 0.7 0.5 ib 0.02 0.02 0.8 0.5 4.1 4.3 im 0.02 0.19 0.9 4.5 1.1 4.1 in 0.04 0.26 1.5 6.2 1.1 4.2 nm 0.04 0.28 1 4.5 1.8 6.2 nw 0.04 0.18 1.1 2.8 1.6 6.2 vb 0.04 0.02 1.4 0.5 7.8 4.5 wm 0.04 0.13 1.5 4.5 1.9 2.8 vn 0.05 0.27 1.7 6.2 1.3 4.4 iw 0.09 0.12 3.3 2.8 3.4 4.1 iv 0.10 0.18 3.6 4.4 3.5 4.2 vw 0.06 0.12 2.1 2.8 2.2 4.4 bb 0.09 0.00 18.9 0.2 18.9 0.2 ww 0.50 0.04 18.6 1.5 18.6 1.5 mm 1.11 0.1 51.8 2.3 51.8 2.3 vv 1.17 0.1 43.1 2.2 43.1 2.2 nn 1.33 0.2 35.5 3.2 35.5 3.2 ii 1.40 0.09 52.5 2.1 52.5 2.1 bW 0.43 0.51 88.6 97.1 0.4 0.5 mW 1.35 4.42 62.7 97.3 1.4 4.4 iW 1.85 4.09 69.2 97.3 1.9 4.1 vW 1.99 4.31 73.1 97.3 2.1 4.3 wW 2.48 2.77 92.9 97.2 2.6 2.8 nW 2.87 6.1 76.5 97.4 3 6.1 WW 91.77 91.41 94.9 91.7 94.9 91.7
random (around 3 times less likely in average).
Moreover, looking at the second series of rows it is on average around ten times more likely that users repeat a click on the same source (rows “bb” to “ii”) than what would be expected by random. In sessions made of two or more clicks, when one orientation is map, video, image or news, then there is above 35% of chance to observe a second click with the same orientation (third and fourth group of columns). For blog and