7.4 TellMyRelevance! From User Interactions to Relevance Models
7.4.5 Initial Evaluation
Between December 2012 and April 2013, we collected ∼23 GB of anonymous user interaction data on two large German hotel booking web portals. We have used a total of 29,483 randomly chosen search sessions and 53,069 query–result pairs from these data for an initial evaluation of the correlations between mouse features and result relevance. The underlying assumption for this was that a completed conversion (a completed hotel booking process) is a very strong indicator for the relevance of a search result—in this case a hotel presented to the user—which stands in contrast to earlier case studies. Therefore, we considered a
10http://weka.sourceforge.net/doc.dev/weka/classifiers/bayes/BayesNet.html (Feb. 15, 2015).
11http://weka.sourceforge.net/doc.dev/weka/classifiers/bayes/NaiveBayes.html (Feb. 16, 2015).
108 Chapter 7 From TMR to Turtle: Predicting Result Relevance from Mouse Cursor
Interactions in Web Search
Tab. 7.1.: Correlations between mouse features and conversions for the individual and the combined datasets.
Pearson’s r DS1 DS2 DS3 comb.
avg. hover time (a) 0.24 0.22 0.24 0.23 arrival time (r) 0.17 0.14 0.15 0.15 clicks (c) 0.07 0.08 0.07 0.08 clickthroughs (l) 0.44 0.35 0.45 0.41 hovers (h) 0.17 0.16 0.17 0.17 unclicked hovers (u) -0.44 -0.35 -0.45 -0.41 max. hover time (m) 0.25 0.23 0.25 0.24 cursor trail (t) 0.14 0.13 0.16 0.14 cursor move time (o) 0.23 0.22 0.23 0.22 cursor speed (s) 0.16 0.16 0.17 0.16 position (p) -0.06 -0.07 -0.08 -0.07 combined 0.47a 0.39b 0.47c 0.44d
ay = 0.01a + 0.03r + 0.07c + 0.34l + 0.03h + 0.03m + 0.01o + 0.02s − 0.04
by = 0.02r + 0.08c + 0.24l + 0.03h + 0.04m + 0.02o + 0.03s − 0.04
cy = 0.02r + 0.06c + 0.34l + 0.04h + 0.03m + 0.02o + 0.03s − 0.06
dy = 0.02r + 0.07c + 0.30l + 0.03h + 0.03m + 0.01o + 0.03s − 0.05
conversion to be a positive implicit relevance judgment for the booked hotel regarding the search query that led the user to the landing page12. For the analysis, we have divided the query–result pairs together with their relevances (i.e., normalized conversions) into three disjoint data sets, each corresponding to 20 days of collected data. The data sets are denoted DS1, DS2 and DS3.13
Correlating Mouse Features with Relevance Correlation of the 11 mouse features with conversions (see Table 7.1) shows very consistent results across all three data sets as well as the combination thereof. That is, although the data sets are disjoint, the average user behavior changes only very slightly with respect to whether a conversion is triggered or not. The most expressive feature is the number of clickthroughs, which is contrary to Huang, White, and Dumais (2011), where hover rate is most expressive (r=0.46). Moreover, we found a positive correlation for maximum hover time while Huang, White, and Dumais (2011) describe a low negative correlation (r=-0.15) for the majority of search sessions.
Position shows effectively no correlation with conversions, which we did not expect since position bias suggests that higher-ranked results are considered more relevant by users (Lagun and Agichtein, 2011). However, our finding is in line with Q. Guo and Agichtein (2012), who also describe a correlation of only r=-0.07 between a result’s position and its relevance. To summarize, the correlations indicate that a) clickthroughs are the strongest single indicator for relevance, b) there are slight differences between “traditional” SERPs as investigated by, e.g., Huang, White, and Dumais (2011) and the more specific setting
12In the remainder of this thesis, we are going to use “conversion(s)” and “relevance” synonymously.
That is, more conversions mean higher relevance and vice versa.
13Much to our regret, we are not allowed to provide either the complete raw tracking data or specific information about it. Particularly, we cannot provide information about the concrete ratio of search sessions to conversions due to the fact that our data contains critical information concerning the co-operating company’s business model.
7.4 TellMyRelevance! From User Interactions to Relevance Models 109
TMR DBN TMRclick
DS1 DS2 DS3
0 0.15 0.3 0.45 0.6
Fig. 7.5.: Comparison of different approaches for predicting relevance.
of travel search, and c) particularly position is useless as a single indicator for predicting a conversion, although position bias suggests otherwise.
TMR vs. Dynamic Bayesian Network Click Model Additonally, to compare TMR’s pre-dictions to prepre-dictions by an existing state-of-the-art approach for estimating search result relevance, we have reimplemented the generative Dynamic Bayesian Network Click Model (DBN) by Chapelle and Zhang—which considers clicks only—with γ=1 (Chapelle and Zhang, 2009, Algorithm 1). According to Huang, White, Buscher, et al. (2012), DBN has proven its good performance and “is the most cited searcher model since the Cascade Model (which compared favorably to all models before it)”. As an additional point of reference, we im-plemented a variation of TMR reduced to considering clickthroughs only, which we denote TMRclick. All of the compared approaches were given the same information and amount of data during analysis.
For comparison, we build on the Matthews Correlation Coefficient—abbreviated “MCC” (Baldi et al., 2000, also cf. Section 8.3.1). Analysis of the confusion matrices of the three approaches applied to each of our three data sets (Figure 7.5) shows that in terms of prediction quality, TMR shows the tendency to outperform DBN, which reaches MCC values of only 0.30, 0.32 and 0.38 for DS1, DS2 and DS3, respectively. DBN performs similar to TMRclick, which reaches values of 0.38, 0.29 and 0.36, however, with slight advantages for DBN. These results indicate that our data-driven model enriched with additional information about user interactions can yield better predictions than a model relying on clickthroughs only.
This confirms previous work by, e.g., Huang (2011), who states that “adding additional independent data provides greater improvements than smarter algorithms”. Our initial findings show that the information gained from user interactions other than clickthroughs—
although the latter still show the highest correlation with relevance—and engaging data-driven approaches (if suitable data is available in sufficient amounts) yield great potential for improving web search.
110 Chapter 7 From TMR to Turtle: Predicting Result Relevance from Mouse Cursor
Interactions in Web Search
Additional results of this initial evaluation can be found in Appendix D.1. The training data and serialized real-world models for reproducing our results based on WEKA are available via our online appendix14.
7.4.6 Discussion
This section presented TMR, which is a new automatic end-to-end pipeline for collecting user interaction data and relevance judgments on SERPs and learning ready-to-use relevance models from these.
Yet, a major shortcoming of our pipeline is the fact that it is batch-oriented. That is, raw tracking data have to be fetched from the key-value store at predefined intervals and it is at the moment not possible to learn incremental classifiers. Instead, we need to completely reprocess all mouse features and relevance judgments if we want to update an already existing model. Assume we want to obtain an up-to-date model once a day. Then, at some point in time, it would take longer than 24 hours to (re-)process all data that is required for the update, or the system would have to scale accordingly, e.g., by adding more/faster hardware. Thus, for the sake of feasibility it is necessary to have a solution that processes data once and only once. That is, a streaming-based pipeline that works on a per–search session basis and learns a model incrementally that is automatically fed back into the ranking process of the corresponding search engine. Based on the limitations just described, it becomes evident that TMR does not meet all of the three requirements derived from our assessment of the industry context of this thesis (Section 7.3).
✔ (R7.1) TMR does leverage interactions beyond clicks to infer search result relevance from user interaction data.
✘ (R7.2) TMR has not been designed for stream processing. That is, the data must be pulled by the system in predefined intervals, e.g., once a day.
✘ (R7.3) Currently, TMR does not leverage the advantages of incremental models. That is, all previously processed mouse feature values and relevance judgments need to be processed once again for an updated model.
In summary, TMR is not yet a system that is feasible for productive use by companies in a real-world setting. Therefore, at this point we introduce an additional primary hypothesis to be investigated in the remainder of this thesis:
Hypothesis 7.1 There exists a solution for optimizing the relevance of search results based on user interactions that yields higher efficiency, robustness and scalability than competing state-of-the-art approaches.
Still, TMR is a working prototype highlighting the potential and advantages of our new interaction-based approach to relevance prediction. Moreover, Röder et al. (2013) have
14https://github.com/maxspeicher/tellmyrelevance-resources/tree/phdthesis (Mar. 02, 2016).
7.4 TellMyRelevance! From User Interactions to Relevance Models 111
SMR
Fig. 7.6.: The main components and process flow of SMR (Streams are visualized by se-quences of chevrons; Storm topologies are annotated using a “T”).
made use of an adjusted version of TMR’s interaction tracking facilities in the context of their research. Based on the collected user behavior they analyzed how quality raters assess the coherence of word sets. Their results suggest that ratings vary with respect to the number of diplayed word sets while rating efficiency remains constant. This shows that our tracking facilities are also feasible and effective in settings different from relevance prediction.
In the following, we are going to build on TMR to develop a more elaborate system that also meets the remaining requirementsR7.2 and R7.3.