Supporting interactive summarization for explainable exploratory search

(1)

search

Jing Li

Helsinki June 5, 2020

UNIVERSITY OF HELSINKI Department of Computer Science

(2)

Faculty of Science Master’s Programme in Computer Science

Jing Li

Supporting interactive summarization for explainable exploratory search

Dorota Glowacka and Alan Medlar

Master’s thesis June 5, 2020 43 pages + 4 appendices

layout, summary, list of references

Thesis for the Networking and Services study track

Exploratory search is characterised by user uncertainty with respect to search domain and infor-mation seeking goals. This uncertainty can negatively impact users’ abilities to assess the quality of search results, causing them to scroll through more documents than necessary and struggle to give consistent relevance feedback. As users’ information needs are assumed to be highly dynamic and expected to evolve over time, successful searches can be indistinguishable from those that have drifted erroneously away from their original search intent. Indeed, given their lack of domain knowl-edge, searchers may be slow, or even unable, to recognise when search results have become skewed towards another topic.

With these issues in mind, we designed and implemented an interactive search system which in-tegrated a keyword summaries algorithm, Exploratory Search Captions (ESC) to support users in exploratory search. This thesis investigated into the usefulness of ESC in terms of user experience, user behaviour and also explored impact of design decision in terms of user satisfaction.

We evaluated the ESC system with a user study in the context of exploratory search of scientific lit-erature in Computer Science. According to the user study results, participants almost unanimously preferred the retrieval system that incorporated ESC; and the presence of captions dramatically impacts user behaviour: users issue more queries, investigate fewer documents per query, but see more documents overall. We demonstrated the usefulness of ESC, the improved usability of ESC system, and the positive impact of our design decisions.

ACM Computing Classification System (CCS):

Information systems → Information retrieval → Information retrieval → query processing → Query suggestion

Information systems → Information retrieval → Users and interactive retrieval → Search interfaces Human-centered computing → Human computer interaction (HCI) → HCI design and evaluation methods → User studies

Tekijä — Författare — Author

Työn nimi — Arbetets titel — Title

Ohjaajat — Handledare — Supervisors

Työn laji — Arbetets art — Level Aika — Datum — Month and year Sivumäärä — Sidoantal — Number of pages

Tiivistelmä — Referat — Abstract

Avainsanat — Nyckelord — Keywords

Säilytyspaikka — Förvaringsställe — Where deposited

(3)

5.2 Design . . . 26 5.3 Tasks . . . 26 5.3.1 Task Design . . . 26 5.3.2 Familiarity Questionnaire . . . 27 5.4 Measurements . . . 27 5.5 Procedure . . . 28 6 Results 30 6.1 User Perception . . . 30 6.1.1 User rating . . . 30 6.1.2 Qualitative feedback . . . 31 6.2 Usability analysis . . . 32 6.2.1 SUS Analysis . . . 32 6.2.2 ResQue Analysis . . . 33 6.3 User behaviour . . . 34 6.3.1 Task performance . . . 34 6.3.2 Interactions . . . 35

7 Conclusion and future work 35

References 38

Appendices

(5)

1 Introduction

Search can be broadly divided into two categories: lookup and exploratory [1]. Lookup search is where users have precise search goals, such as question-answering and fact-finding. These search tasks are well-understood in terms of user behaviour and information retrieval (IR) system design. Exploratory search, on the other hand, is characterised by user uncertainty with respect to search domain and information seeking goals. Users performing exploratory search are trying to learn about a spe-cific topic or otherwise acquire knowledge [1]. As such, exploratory search tasks tend to have open-ended search goals, where it can be difficult for users to initiate searches and assess the relevance of search results. Exploratory search is, there-fore, considered to be challenging [2] and requires IR systems to provide additional support to help users articulate their information needs and understand the search results that are presented to them.

In exploratory search, document relevance is highly subjective. Users engaged in knowledge acquisition tasks will, by definition, lack the domain knowledge neces-sary to assess the importance of search results. In this context, users’ search goals are ill-defined because there are many potential paths to obtain information and no objective criteria for when the task is complete. User behaviour studies have shown that users performing exploratory search behave differently from those performing lookup search: they scroll through more documents [3] and provide less consistent relevance feedback [4]. This can be frustrating for users, either because of time wasted inspecting lower ranked search results or because the accumulation of low quality feedback leads to query drift [5], degrading the quality of all search results. Unfortunately, it is unclear how to assess whether these behaviours are counterpro-ductive – even during successful exploratory search, users’ information needs are assumed to be highly dynamic and expected to evolve over time. IR systems should provide users with additional information summarising the current search session, helping users to assess whether search results are on-topic and to avoid unintentional digressions.

In typical exploratory search scenarios, users get a list of results produced by rel-evance ranking algorithms of the search engine after entering a search query. To gain knowledge in the search domain that users are not familiar with, users would browse through the result list, read and learn during the iterative search sessions. However, the whole search-and-learn process can be slow, leading to reduced user engagement. Traditional studies of IR systems primarily concentrate on developing

(6)

new algorithms to improve predictive accuracy and recommendations, while recent research has emphasised on the noticeable effect of visualisation and user interaction on boosting the performance of the IR system through improving user engagement in the search process [6].

To solve the above mentioned problems and to support users in investigating, explor-ing and acquirexplor-ing new knowledge in exploratory search, in this thesis, we proposed and designed an interactive visualisation system with keyword summarising support. We focus on the integration of an existing search engine and ESC for exploratory search. Exploratory Search Captions (ESC) is a keyword summarization algorithm of the search engine results page that provides users with succinct, keyword-based descriptions of search results. In the context of exploratory search, ESC helps users to identify whether they have deviated from their intended search or as inspiration for follow-on search queries. We want to investigate whether and to what extent ESC supports users search activity and the impact of design decisions from user’s perspective. This thesis aims to answer the following research questions (RQ):

• RQ1: Does ESC improve user experience during exploratory search?

• RQ2: Are there any behavioural differences between users of ESC and the baseline system?

• RQ3: What is the impact of design decisions on user satisfaction?

To investigate the above research questions, the basic rationale behind our method is to review previous related studies, find optimal way of design for integration and evaluate ESC system in practical scenarios through user study. User data would be collected to combine with survey results, such as usability survey, for further investigation. The user study is conducted in laboratory settings that situated participants in an essay writing scenario based on topics they were unfamiliar with. In this thesis, we focus on academic literature from Computer Science, but this approach can be expanded to other scenarios. The rest of this thesis is structured as follows:

• Section 2 will introduce basics of information retrieval and exploratory search, necessary background knowledge and related work in the field of exploratory search interfaces and behaviours.

(7)

• Section 3 will detail the mechanisms within ESC as a summarization algo-rithm, including the autoencoder model used by ESC, model training, data augmentation, caption generation and ensembling.

• Section 4 will describe the system architecture and interface design. It intro-duces the original baseline system which is an existing search engine developed by the research group, present the design decisions we made about caption sys-tem integrating ESC, the experimental approach we adopted and indicators of information search behaviours.

• Section 5 will illustrate the evaluation of our designed caption system and the performance of ESC algorithm. We compared the baseline system and the caption system. We record every step in the evaluation, from design, measurements to execution.

• Section 6 will analyse and show the results of the data collected in the user study. It covers the usability of the system to the user perception and be-haviours.

• Section 7 will discuss the user study results and answer the three research questions in this thesis. We also summarised the contribution of the thesis and the future work in brief.

2 Background and related work

In this section, we introduce background literature, ranging from information trieval to the adopted usability questionnaires for evaluation. We also identify re-lated work on summarising and interactive interfaces, and studies of exploratory search behaviour.

2.1 Information retrieval

Information retrieval (IR) has been characterised in a variety of aspects: from a depiction of its goals, to comparably abstract models of its elements and processes. Though some of the characterisations have been in disagreement with each others, they have some commonalities. Generally, an IR system is regarded to perform the role of “leading the user to those documents that will best enable him/her to satisfy

(8)

his/her need for information”[7]. More commonly, information retrieval refers to how information is found, and the goal is identified as “for the user to obtain information from the knowledge resource which helps her/him in problem manangmant” [8].

Figure 1: A general model of information retrieval based on Belkin 1992 [9]

Figure 1 presents models that illustrate such goals of IR. It is a general model revealing elemental objects and processes in IR, developed by Belkin [9]. When a user with some goals or tasks finds that the goals cannot be achieved because of insufficient knowledge or resources, then the user is prompted to participate actively in information seeking behaviours – for example, submitting queries into an IR system. The issued query, which should be formulated in a language that can be comprehended by the IR system, is a depiction of users’ information need. This is presented on the right side of the figure. The other side of the figure pays attention to the information resources that the users would ultimately reach: from the producer of text, to the organisation of text representation into collections of text surrogates. Then, the two sides of the model merge into the comparison of query and text

(9)

surrogates, which results in potential relevant retrieved text. The retrieved texts are then used or evaluated and forwarded to the modification phase, such as modification of query. The process of query modification by means of user evaluation is referred to as relevance feedback in information retrieval.

In IR, both phrase and boolean queries have a long history. Early boolean IR systems permitted users to define their information needs using a set of boolean, such as AND, OR and NOT. But boolean systems have various drawbacks, e.g., hard for searchers to formulate good search queries. Additionally, the application of phrase in text representation has been studied since the early research into information retrieval. For example, Salton (1968) [10] depicted a diversity of experiments with the SMART system which uses phrases. The experimental outcomes received with phrases, however, were very blended, vary from small enhancements to reductions in effectiveness. They do not support the hypothesis that proper use of phrase would improve the performance of text representation.

However, a great amount of regular IR system users look at rank retrieval in IR systems. IR systems rank retrieved documents by giving each document a numeric score as an estimated rate of usefulness for users’ query [11]. Various types of retrieval models have been proposed and studied for this procedure in IR, the three most frequently applied models are:

• Vector space model. Text is interpreted by a vertex of terms, which are regular phrases and words.

• Probabilistic model. Because genuine probabilities are not accessible to an IR system, probabilistic IR models calculate the probability of relevance of documents on a issued query.

• Inference network model. Retrieval of documents is demonstrated as inference procedure.

Nevertheless, there are occasions that users lack the knowledge or contextual un-derstanding to formulate queries in a “correct” way that could be well interpreted by the system. For instance, how to find classic music to personal interest in the case that the user has little knowledge about Beethoven or Berlioz? The answer is we normally submit a provisional query and start from that point, investigat-ing the remaininvestigat-ing information, specifically lookinvestigat-ing for and inactively gettinvestigat-ing signs

(10)

about what the following stages are. Researches on supporting these sorts of situa-tion is known as "exploratory search” [12]. Since exploratory search often requires additional support from IR systems [13], this thesis aims to help support users in exploratory search process in IR systems.

2.2 Exploratory search

Marchionini (2006) [1] divided search activities into two vast categories: lookup and exploratory search. He depicted exploratory search and lookup search activities as overlapping clouds considering that users may engage in multiple search types at one time. Exploratory search tasks emerge from circumstances where searchers “lack the knowledge or contextual awareness to formulate queries or navigate complex information spaces,” and the search tasks naturally have ambiguity, vagueness and are dynamic. For instance, possible situations are that we need to discover something from a domain where we have a general intrigue however not explicit information [12]. In 2009, exploratory search was further defined as an information-seeking problem context that is open-ended, persistent, and multifaceted, and processes that are opportunistic, iterative, and multi-tactical [2]. An exploratory task is motivated by the searcher’s desire to broaden his or her knowledge of a topic, i.e. to foster learning or investigation [14]. It has vaguely structured information needs and it can be difficult to phrase and often includes multiple aspects or a number of concepts [14].

Figure 2 illustrates that search activities defined by Marchionini, where exploratory search can be mainly divided into learn and investigate activities, each containing several sub-activities. Whereas lookup tasks return distinct formatted objects which are considered as answers to fact finding or questions, exploratory search tasks, on the contrary, do not have specific or well-structured answers. Searching to learn is a typical exploratory search activity as more learning material flourishes online. In our study, we choose the learning search scenario.

More frequently, exploratory search activities are also discussed with reference to their essential attributes. In a review of past research, a series of tasks attributes associated with exploratory search are distinguished [15]:

• Exploratory search tasks are general. For example, task descriptions can be very vague and ill-defined.

(11)

Figure 2: Search Activities according to Marchionini 2006 [1]

at several objects or documents. For instance, an earlier study by Marchionini [16] noted open-ended information tasks and questions, depicted as questions “for which there was no specific answer”.

• Exploratory search tasks contain uncertainty as well as motivation derived from ambiguously defined problems.

• Exploratory search tasks have a dynamic information need, dynamic process and they often develop with time. The changes may come from the evolving need and understanding of searchers as their knowledge broadens during the search sessions.

• Exploratory search tasks are multifaceted and the procedure may be compli-cated. Depicting a search task as multifaceted implies that it “incorporates various viewpoints or various ideas”.

• Exploratory search tasks commonly come with context information or cog-nitive behaviours, such as decision making, organizing and inspecting search results and, most frequently, sensemaking.

(12)

2.3 Relevance feedback

Relevance feedback asks information seekers to make relevance judgments about search results and then execute a revised query based on those judgments [1]. In relevance feedback, users do not describe in detail what their information needs are, instead, they simply mark relevant objects. Typically, users would mark the returned objects or documents that they consider relevant, then the system would give improved recommendations to users, such as potential query terms, etc [17]. Relevance feedback appears to be a natural approach to support searchers in re-formulating their queries in a heuristic manner. It has been shown that relevance feedback can be especially effective for content-based image retrieval scenario [18], and for improving retrieval accuracy and performance [19][17].

In spite of the benefits, however, studies showed users’ reluctance to provide explicit feedback. For example, in an exploratory searching experiment in the area of rele-vance feedback [20], responses indicate a tendency that although the searchers un-derstand how to use relevance feedback, they lean toward finding relevance feedback not very useful during the search sessions. On the other hand, marking relevance feedback all the time can be cognitively demanding, but not giving feedback may lead to imprecise judgement in systems. For example, users may stop giving feedback when they are happy with search results and switch to a browsing behaviour, but then some systems interpret this to mean everything is not relevant. User percep-tion and experience of relevance feedback can also be affected by the framing of the relevance feedback mechanism, e.g. users prefer giving positive relevance feedback as opposed to negative relevance feedback [21].

2.4 Types of systems for exploratory search

Over the last two decades, there have been many specialised interfaces created to support different aspects of exploratory search. These include summary visualisation to enable faster relevance feedback [22, 23], automatic adaptation based on search task [24], interactive visualizations that allow users to direct their search [25, 26], interactive exploration of faceted data sets [27] and interfaces to support query formulation [13].

(13)

2.4.1 Faceted search

Faceted search provides an iterative way of refining the search results by facets [28]. Faceted search systems helps searchers to find what they are looking for, not only by keywords related to their information need, but metadata specified by searchers query refinement. Faceted search utilizes organized metadata to give a diagram of results and consolidate interactive categories into result lists. This helps users limit the result range and explore the results without reformulating the queries [29]. Facet is defined as “a set of meaningful labels organized in such a way as to reflect the concepts relevant to a domain”[30].Faceted search constructs information and uses filters to support users investigate a set of returned items [31]. For example, online accommodation platform, such as Airbnb, allows users to narrow down their selection by adding limits regarding facets such as the price, district and the type of an accommodations. On the basis of a faceted taxonomy, faceted search enables users to choose facets and facet terms in an iterative manner which aims to narrow down the search results and prevent the contextual problems or pitfalls through the exploitation of global characters or features [31].

When the query is too broad or the number of results is excessively large, exploratory search may not have a good performance since it will be hard to decide the relevance ranking of the numerous results. The faceted method bunches the results into a wide assortment of categories as indicated by metadata, such as keywords. As a result, users can only concentrate on the results to their needs, saving time from browsing other irrelevant ones, which also adequately eases the data over-burden issue [28].

2.4.2 Summary exploratory search interfaces

During the exploratory search session, sometimes the search results are too long for users to browse easily, or the ranking of the search results seems less reliable. To support users in such scenarios, Mika (2005) [32] developed a search result categori-sation algorithm and a filtering interface: the documents are processed into several categories, and a list of categories are displayed in the left sidebar on the result page. During a search session, users can filter the results by clicking on a cate-gory. Researchers conducted a longitudinal study with 16 participants who used the prototype system for two months and analysed their interaction data. The results confirmed the usefulness of result categorization in real settings, especially when the ranking of search results fails – users feel that the ranking does not contain

(14)

use-ful search results and therefore fail to use-fulfil their information needs. However, the factors affecting the usefulness remain complicated, the users’ behaviour was not controlled.

While Mika aims to improve the result evaluation phase, Kules et al. (2008) [33] dug into the effectiveness of result visualization for users mainly in the phase of discovery and the search process. A study of existing information systems is given as examples that have been demonstrated to effectively improve the user searching experience.

Matejka et al. (2012) [22] introduced the Citeology System which provides sum-mary visualisation in a combined view of heatmap and line chart to enable faster relevance feedback. It looks into the relationships between publications in different years through their citations. However, the information in the system is generally presented textually and there is no way to look at multiple generations of ances-tors or descendants or get a view of the overall connection network of the corpus [22]. Since the heatmap visualization focuses on the big picture, the lack of zoom-ing capability in the applet and the simple text-based formated data decrease the practicality.

Although these summarising interfaces give an aid to users to varying extents and aspects, they fail to deliver the summarised information in a intuitive way and so inexperienced searchers can hardly make good use of these professional interfaces. Our goal is to design an intuitive and effective interface which is friendly for searchers at all levels: from novice to skilled searcher.

2.4.3 Interactive recommendation interfaces

In most recommender systems, users have little knowledge of how the recommender engines work, their derived distrust of search results as well as the uncertainty in the exploratory tasks have therefore adversely affected the usability of the search systems. To engage users in the search process more positively and place them in continuous control, researchers have devised highly interactive user interfaces, and many studies have been done to improve the user engagement, controllability, transparency, etc. The main advantage of interactive visualization is that a multi-dimensional representation allows the user to more easily see multiple aspects of data while being in control when exploring information.

(15)

interactive network visualisations. Combined with machine learning algorithms, Chau et al. (2011) [34] developed a system named Apolo that engages the user in bottom-up sensemaking to gradually build up an understanding over time by starting small. They argue that just finding and filtering or even reading items is not sufficient, so they endeavour to provide support with multi-group sensemaking of network data. Moreover, users can incrementally and interactively build up a personalized visualization of the data in the system. Though both quantitative and qualitative results are positive in the usability evaluation, the system does not take into account the evolving needs of users over time, such as the groupings created by users through merging or subdivision. Zhao et al. (2013) [27] designed an interactive interface in the form of node-link graph, focusing on faceted data sets. It allows users to discover the implicit and explicit relations among the large-scale data and also manages their queries in a multi-focus way.

Interactive recommendation interfaces that allow users to direct their search is an-other important branch of research in the area of interactive visualizations [25, 35]. Users can directly manipulate document features keywords to indicate their interests and reinforcement learning is used for user modeling by allowing the system to trade off between exploration and exploitation [25]. Verbert et al. (2013, [6]) proposed a combined approach of both interactive visualization and traditional ranked result list recommendation to improve users’ trust in the results offered by the black-box recommender engines. The approach considered personalized relevance and social relevance prospects and enabled users to explore interrelated bookmarks by other users using a cluster map visualization.

In addition to the above approaches, Medlar et al. (2016) [13] presented a system that supports query formulation in exploratory search for scientific literature. It applies an interactive alluvial diagram showing what topics there are in different years and how they are related to one another in the phase of discovery and query formulation.

These types of interfaces can greatly aid users in conducting exploratory search, however, they fail to account for the evolving nature of user knowledge in exploratory search tasks and the associated drift from the initial search query. Our proposed method will improve the search session transparency to the user by summarising the content of documents linked on the screen.

(16)

2.5 Studies of exploratory search behaviour

Understanding user behaviour in exploratory search can also inform interface design and search results presentation. There are a number of models of user behaviour that specifically address these issues. For example, Information Foraging Theory [36] is a model of exploration that predicts information seekers’ decisions from the expectation of information gain. This model was recently applied to predict the subjective information needs of users conducting exploratory search [37].

Hassan et al. (2014, [38]) proposed a model that aims to disambiguate search explo-ration from struggling with search. Through their analysis, detectable behavioural differences can be seen between struggling and exploring, and use these insights to develop machine-learned models capable of accurately distinguishing between the two situations[38]. As user factors have an effect on the user behaviour and search results of exploratory search, Jiaxin et al. (2018) [39] dug into users’ domain exper-tise, investigated and compared the effects of domain knowledge level across three different domains. Their work maps domain expertise level on users’ interaction and search outcomes.

In general, previous studies of exploratory search behaviour point to its dynamic nature with narrowing and broadening of search queries throughout a search session as well as changes in click behaviour and dwell time as search progresses [40]. Our proposed method will further facilitate the capture of gradual changes in users’ search intents throughout a search session, which could lead to further improvements in user modelling and search results presentation.

2.6 Usability questionnaire frameworks

Usability is not a common quality that exists in any genuine or absolute sense. Maybe it tends to be best summarised just like a general nature of the fittingness to a motivation behind a specific artefact [41]. It can only be measured under a certain context, hence there is no absolute metric for usability. The measurement of usability is to test the usability of the artefact in the context where the artefact is used. This, in exchange, means that the measures of usability is dependent on the context, or even defined by the context as well. In spite of this, there is a demand for broad generic measures which can be applied to reflect the usability of an artefact over a wide scope of contexts and settings. A few generic aspects of usability are proposed by ISO 9241-11 and state that the measures should include:

(17)

• effectiveness • efficiency • satisfaction

Although it is feasible to discuss features of usability in a general sense, due to these varying classes of usability, it is quite often the situation that it will be hard to compare two or more systems having various purposes and which work in different manners [42]. The exact measures to be applied inside each of these classes can hold significant differences. For instance, metrics of effectiveness are, with no doubt, strongly related to the sorts of assignments that are conducted with the system; a metric of effectiveness of a word refinement system may be the quantity of letters composed, and whether the composed letters have misspellings.

This section introduces two highly cited usability assessment frameworks.

2.6.1 SUS

The System Usability Scale, commonly referred as SUS, is developed under the requirements of general measure in 1996 by John Brooke [41] as a “quick and dirty” questionnaire scale. SUS is a straightforward Likert scale with 10 items. It enables a fast and effective assessment of people’s perception of the usability of a computer system which they have worked with [42].

To select the items for a Likert scale, the technique used in SUS development is to identify and select extreme expressions as item statement. Instead of general expressions which earn loads of agreement between respondents, extreme statement can arouse strong opinions in respondents on the agreement or disagreement towards the artefact. For example, if the subject is to investigate people’s attitude to crimes, respondents would have a more clear answer when they are asked to answer state-ments such as “things like this”, than general statestate-ments such as “It is not good” [41].

The construction of SUS starts with a pool of 50 collected items and then items prompting the highest extreme feedback were chosen. The coverage of the 10 chosen statements includes multiple system facets, for example, the demand for support, integration, and intricacy. The score in SUS ranges from 0 to 100. From project administrator to IT expert, it offers a simple score that can be handily comprehended

(18)

by the broad scope of people who are ordinarily engaged with the improvement of artefacts, such as product and service [43].

Studies have confirmed that SUS is an important assessment tool, being rigorous and dependable. It mirrored a pressing need in the area of usability for a tool that could efficiently gather users’ subjective responses about an artefact’s usability [43]. So we choose to use SUS to evaluate the usability of our proposed system.

2.6.2 ResQue

A recommender system actively gives suggestions of items to the interest or needs of the users depend on their behaviour or their expressed preferences. As recommender technologies are getting generally acknowledged, it is of utmost significance to have the capability to describe user experiences and users’ subjective preferences of the technology. In response to the needs, Recommender systems’ Quality of user experi-ence(ResQue) is a unifying evaluation framework that aims for assessing the quality of the suggested items in a recommender system, as well as usability, handiness, interface and interaction qualities, etc [44].

As an evaluation questionnaire framework, ResQue comprises of a collection of con-structs. It starts from defining the construct domain, and then comes to sample question generation and phrasing. The potential question items are divided into four construct level considering four aspects:

• Users’ perception of system qualities. For example, the quality of suggested items, the interaction quality to deal with user preferences, the quality of interface to introduce and clarify results, and the quality of providing data adequacy.

• Users’ beliefs of the qualities, comprising perceived usability, control and trans-parency [45].

• Users’ subjective attitude, containing confidence, trust, satisfaction, etc. • Users’ behavioural intentions, containing their agreement, willingness of

pur-chase, return and tendency to introduce the system to others.

The full model consists of 15 constructs and 43 questions. ResQue can be used as a complete or a modified version of an evaluation survey, 43 and 14 questions respec-tively, that can help practitioners and specialists to investigate the appropriation of

(19)

recommender systems. The model can be used to evaluate various kinds of recom-mender frameworks [45]. In our experimental design, we selected 12 questions from the database as a modified ResQue questionnaire for the evaluation of our proposed approach.

3 Exploratory Search Caption

In this section, we introduce the autoencoder model used by ESC for representation learning of ranked search results. We described how the model is trained, including how data augmentation is performed, and describe how it is used to generate captions and ensembling.

ESC generates search captions using a combination of semantic and lexical informa-tion. Semantic information comes from a sequence-to-sequence autoencoder, which we used to learn a distributed representation for ranked documents. In this dis-tributed representation, semantically similar sets of ranked documents will be prox-imate to one another in vector space. By encoding the current search engine results page of documents into vector space, ESC can find nearby examples of ranked docu-ments that are associated with known queries. This approach is analogous to finding semantically-related terms in word embeddings [46]. In situations where it is diffi-cult to derive semantic information, such as when there is no coherent theme present in the search results, ESC falls back to a simpler, lexical method. ESC automati-cally detects when the autoencoder makes weaker, less coherent predictions using a logistic regression model, which is used to determine the proportion of captions to be presented from each method.

3.1 Autoencoder model

Autoencoders are unsupervised learning methods composed of two neural networks: an encoder network and a decoder network [47]. The encoder network transforms unlabelled training examples into a vector representation and the decoder network reconstructs the original input from this learned representation. The autoencoder learns a function, hW,b(x) ≈ x, of the weights, W , and biases, b, of the neural network

such that the mean squared error is minimized between the input data, x, and its reconstruction, x0, in the output layer: L(x, x0) = P

(x − hW,b(x))2.

(20)

copy

Learned representation

Figure 3: Sequence-to-sequence autoencoder model. The output of the encoder network (left) is the input to the decoder network (right). The learned representation (blue) is used to find semantically similar search results pages.

encoder LSTM and a decoder LSTM. In our case, we want to create a learned representation of ranked search results, where each search result is modelled as a time-step in the LSTM network.

3.2 Data and model training

In order to generate the ranked search results to train the autoencoder, we used:

• A corpus of documents from a compiled collection of ∼170, 000 Computer Science articles from the arXiv preprint server (www.arxiv.org, downloaded November 2018). Each document was represented by concatenating the article title and abstract.

• A method to generate queries and a retrieval algorithm. We preprocessed the text of each document and detected common phrases by extracting bigrams, which were allowed to skip over stop words (making “eye of the beholder” a valid bigram).

As the number of query phrases was relatively limited for training the autoencoder, we performed data augmentation to increase the diversity of the training set. We used 95% of the data for training and the remaining 5% for validation to avoid

(21)

overfitting. We used Adam [48] with a learning rate of 0.001 to train our autoencoder model using a vector length of 300 (i.e. the length of the learned representation in Figure 3).

3.3 Caption generation and ensembling

The autoencoder forms the basis of a nearest-neighbour-based approach for caption generation. The autoencoder projects ranked search results into a vector space con-taining the learned representations of search results associated with known queries, the queries from the nearest-neighbours are used as candidates for captions. We dis-play the top-10 captions that would be unique after stemming, to avoid redundancies such as “neural network” and “neural networks”.

When presented with search results not found in the training set, the autoencoder outputs novel vectors whose positions in vector space reflect the contents of those search results. For example, if presented with documents related to “information retrieval” and “neural networks”, then these terms should be among the top ranking captions. Indeed, summing the vectors from our caption database associated with the queries “information retrieval” and “neural networks” achieves exactly this: the nearest neighbours include “neural network”, “information retrieval”, “neural”, “infor-mation retrieval systems” and “feedforward network”. Ideally, when the autoencoder is presented with a mixture of documents from two separate queries, the resulting vector representation should approximate vector arithmetic in the same manner. The autoencoder’s precision varies as a function of the average positive pointwise mutual information (PPMI) between all pairs of captions. Performance is worse when average caption PPMI is lower, making caption PPMI a predictor of caption quality. When ESC is predicted to have lower performance, we can use lexical information to improve results.

Finally, we use a given logistic model for ensembling by using the probability to define the proportion of captions to use from the autoencoder, with the remainder taken from a lexical method. The logistic model used mean keyword pair PPMI to predict whether the captions had a coherent theme, and the response variable was defined as a precision > 0.6 and that this threshold was determined through simulation. We use the probability output by the logistic regression model to define the proportion of keywords to use from ESC, with the remainder made up of the keywords with the highest Okapi BM25 term weights. The intent is to take

(22)

advan-tage of the autoencoder in scenarios where it performs well, but fall back to a lexical method when we detect there is no coherent theme to the search results.

4 System design

The goal of this chapter is to investigate how to integrate keyword summaries algo-rithm into an existing search engine effectively and in a user-friendly way. In this thesis, we focus on the academic practice – scientific literature search, with a base-line system which we refer to as the basebase-line system. We start by introducing the system and then describe our practice on the interface design of the caption system incorporating ESC. We explore the design solutions for the system to integrate ESC in terms of interface and practicality. Next, we identify the importance of taking an experimental approach and the matching information search indicators.

4.1 System architecture and interface design

In order to give a context on what the system is like, we first give an overview of the baseline system before going deeper into the design of our improved caption system.

4.1.1 Baseline search engine

We used a literature search system which has a typical layout and functions as many other search engines – a navigation bar at the top with a search box, and a list of result articles with title and abstract. Figure 4 depicts the interface.

As we want to enable experimental features in the system, like logging, we can not use Google scholar although it is the widely used interface for literature search. So we choose to use a baseline system with high flexibility, as a case study and integrated the ESC into the system. In many aspects, the interface is very close to Google Scholar: every results block contains the document title, authors, date of publication, venue and abstract. In the baseline system, a search is initiated by typing a query into the search box at the top of the page, which results in a list of articles appearing on-screen. Users can click on any article on the screen and continue to explore the details if needed.

Moreover, as Google scholar does not allow free access to its data, we use a free digital library, arXiv, and all Computer Science papers from the arXiv data set as

(23)

Figure 4: Screenshot of the baseline system interface. This image shows only part of articles that are displayed as outcome. Users can perform infinite scroll down to explore more articles.

our document corpus. arXiv is also one of the most accepted open access digital libraries in the domain of computer science.

In exploratory search tasks, users are predicted to investigate more results [2]. How-ever, users have the tendency not to move after the first search result page despite the fact that they are interested in exploring more articles [49]. Thus, to inves-tigate users’ behaviour without forcing them to click for “next page”, our baseline interface implements infinite scroll, fetching additional search results simultaneously on-demand as the user scrolls down the page.

In addition, the baseline system has a bookmark function where searchers can mark the articles of their interest by clicking on the mark icon at the top-right of a result block (Figure 5). To review the bookmarked articles, searchers click on the teal button next to the search box.

(24)

Figure 5: Screenshot of the bookmark list in the baseline system interface. This image shows only part of bookmarked articles that are displayed as outcome. Users can go back to result page by clicking on the yellow button.

4.1.2 Interface and system design with keyword summaries

For open-ended exploratory search tasks, it is of vital importance to provide tools that support users with concentration through organizing search results in a mean-ingful way and also enhancing their learning processes. Studies [6] have shown posi-tive effects brought by visual elements and strengthened interaction that can largely increase users’ engagement in the search iterations, boosting the performance of IR systems.

Considering the needs from the user perspective, we derived a few user interface principles for the caption system:

• Must be intuitive to understand even for first-time users. • Must not be ambiguous to use or understand.

• Must follow the style of the original interface, not breaking the harmony of the entire system.

(25)

• Must be visually distinguished in a clear manner, not blending with other interface elements.

• Must not distracting for users to perform the search sessions, e.g., looking into the result list.

• Must deliver the complete support provided by ESC with as little disturbance as possible.

Figure 6: First version of ESC incorporated system: horizontal caption bar. Cap-tions would change as the user scrolls through the result list.

With these principles in mind, the first version of caption system situates captions in a horizontal list under the navigation bar (Figure 6). The motivation for the visualisation was to offer users a simple overview of the articles they are browsing in the ongoing search session. As the user scrolls down, the displayed articles would change and accordingly the captions would change. We initially thought that the caption bar below the search box works as a typical way to balance the performance and perception of dynamic keyword summaries. However, the weakness of this layout was identified quickly after our pilot test: on the one hand, users find it hard to understand the relations between captions because horizontal list can barely depict the ranking relationship among the captions; on the other hand, the fixed width of

(26)

the caption bar limits the number of caption displayed especially when the length of the captions is uncertain. In addition, plain text captions are not visible enough to be perceived by users – they may not even realize there is a supportive function like this throughout the search session. Visual animations of caption may be more disturbing than beneficial as a results of the centering position. It is also hard to add more instructions or interactions to these captions since the space is really limited.

Figure 7: Second version of ESC incorporated system: vertical caption sidebar. Top 10 captions are displayed by default and it changes as the user scroll through the result list. The link to the demo video: https://youtu.be/vn8_Vli6tzo

These findings prompted us to transform the captions into a dynamic bar graph in the second version of design. We replaced the horizontal caption bar with a vertical sidebar section to present captions, situated at the right side of the screen(Figure 7). The coloured bar length is proportional to our confidence, and can directly demonstrate the top-down ranking of the captions for users to obtain the big-picture information at a simple glance. The ranking is dynamic as the users scroll up or down the result list – a caption would move up if its relevance probability increases as well as its ranking. When the captions apply an animation to reflect the changes, the ranking movement of each caption can be easily perceived by users. We also added a tool tip to provide faster navigation, when users hover on the caption bar.

(27)

Furthermore, to dig into a certain caption, users can start a new search session using the caption as the search query by simply click on it, either on text or the bar. The core advantage of this interactive visual design is to provide users with a multi-dimension view that allows them to comprehend different parts of data more effectively while staying in control when discovering new information.

4.2 Experimental approach

We designed and conducted an experiment to evaluate the caption system incor-porating ESC. A good experiment should include practical tasks, questionnaires, and subsequent interviews that provide researchers with sufficient collection of data while having control over other elements that have an impact on search behaviour [3].

We control over three factors that could influence user behaviour in the search process:

• domain knowledge • search expertise

• subjective task difficulty

Past studies [50] found that the domain expertise of students brought forth better search performance on the condition that the participants had adequate information-seeking experience. In our settings, participants with good search skills perform tasks on topic domains they are unfamiliar with. The tasks are set to be in the scenario of academic information seeking, for the reason that a major purpose of exploratory search is new knowledge acquisition, which is especially significant in the academic environment [15]. We chose the machine learning (ML) domain to produce all the tasks on the grounds that there is a decent inclusion of ML courses at the university, which makes it simpler to recruit participants with similar knowledge of the subjects. Furthermore, numerous ML articles are freely accessible in our selected data set, and ML experts are easier to find for task assessment in the later stage of the experiment.

Despite being specific and involving precise requirements in participant recruitment, the exploratory tasks are general, open-ended and ill-structured in a sense that we do not specify how participants should find a solution or how much information

(28)

is needed for completing the tasks. It means that participants, as searchers, will independently interpret the tasks, their queries, interactions, and their answers to the tasks – while retaining experimental control to a certain extent.

4.3 Indicators of information search behaviours

After analysing previous work, we identified the most appropriate indicators of in-formation search behaviour for our study. We focus on behaviour that could be collected easily without interrupting users performance and meanwhile would help discover the insights throughout the experiment.

We select two behaviours related to query.

• Number of queries issued per session: In exploratory search tasks, to formulate queries, users spent more time to familiarise themselves with the topic [2]. To capture this behaviour, we select number of queries the user issues per search session.

• Number of documents inspected per query: Users benefit from looking through multiple documents from time to time, as the inspected documents enables them to grab a better understanding about the search domain and the topic, and particularly, give the users extra hints on search query formulation. [51]. We included the measure because the relevance of documents reflects the qual-ity of formulated query.

User interactions have been an important measure, from which we selected two behaviours:

• Maximum scroll depth: The maximum number of documents from the result list thats users were exposed to through scrolling. In exploratory search tasks, users tend to explore more documents [1] but the scroll behaviour is hardly inspected. We create index for each document in the result list and then calculate the number of documents the user is exposed to by infinite scrolling. • Bookmark list: The total number of documents the user bookmarked in the search process. Bookmarks consist of pages or documents that the user feels helpful, thus clearly flagging that those pages are considered relevant [52].

(29)

• Duration of dwelling: the time users spend reading the clicked documents. Dwell time is utilised as an indicator of document relevance when participant information and task type are known beforehand [53].

• Task completion time: Total amount of time user spend on a search task, from the moment submitting the first query till the end button is clicked.

5 User study design

We conducted a user study to evaluate how well the system incorporating ESC (caption system) works compared to the system where users do not have support from captions (baseline system). The purpose of this study is to gather information on user interactions and search behaviours in both systems, so as to investigate how effective and useful the ESC algorithm can be in supporting users in exploratory search. We also concentrated on users’ understanding of the generated captions and their engagement in search iteration and users’ perceptions of captions in context of the search interface.

5.1 Participants

To recruit participants we posted advertisements in the Computer Science (CS) de-partment mailing list of University of Helsinki. We selected university students with at least 3 years of study in Computer Science or Data Science, including postgrad-uate students and PhD students. Because they are active users of exploratory and scientific search as a major part of their work [54]. Nineteen participants took part in the study. Eight of them (42%) were female and 11 were male. The mean age of the participants was 30 years (min. age = 23 and max. = 48 years). All participants were computer science students: 8 MSc students and 11 PhD students. They have studied at the university level for at least 4 years, whereas the median number of years was 7 (min.year = 4, max. = 29).

To ensure the exploratory search skill within the participants is at a moderate level, they were asked to fill out a background questionnaire before the study. Accord-ing to their answers to the questionnaire, Google Scholar is the most frequently used tool with 18 participants; ACM and arXiv are also popular among the par-ticipants; while 9 participants selected Just Google as one of their search methods. The medium frequency of performing scientific literature search was 4.3, whereas

(30)

the lower boundary was 3 (ratings are given on a 5-point Likert scale as 1[never ] to 5[a great deal ]). All the participants were experienced users of exploratory literature search tools.

5.2 Design

We designed a within subject study, where every participant performed one search tasks in each system: the caption system and the baseline system. To avoid the order effect, for each participant, the sequence of the topics and the search systems was determined by coin tossing.

We compared two systems: one incorporating ESC and a baseline without captions. Thus, any differences in observed user behaviour are generally attributable to the presence of captions. Aside from the presence or absence of captions, both systems were identical. A search is initiated by typing a query into the search box at the top of the page. In the caption system, a list of 10 captions summarising the search results currently displayed on the screen is shown in the right-hand margin. The captions are ordered according to their relative importance, which is quantified by the length of the bar next to each caption (shorter bars indicated greater uncer-tainty). As users scroll down the page, the captions change to reflect the contents of the documents currently visible to the user, i.e. the order/importance of the currently displayed captions changes or some of the displayed captions are swapped for new ones.

5.3 Tasks

5.3.1 Task Design

In the main tasks, users were instructed to write a short essay draft on a given topic. The task descriptions followed the template:

“You are going to start a new research project on the following topic: X. You would like to learn as much information as possible about this topic, e.g. applications, problems, specific algorithms. Write your answer as an essay draft or in bullet points, and bookmark at least 10 relevant articles that you could use as a reference in writing the article.”

(31)

in-between the two tasks (average task duration was 29.5 and 28.4 minutes for caption and baseline systems, respectively).

The task end time was logged when the participants thought the task was finished and clicked the red “End Experiment” button on the top-right corner of the result page. The users were provided with pen and paper as well as a text editor for note taking and could select the method they felt most comfortable with. All the participants opted for the text editor.

5.3.2 Familiarity Questionnaire

To ensure that the domain knowledge of topics for each task, we provided a ques-tionnaire to subjectively rate the familiarity with a list of topics which is carefully selected for the tasks. Before the experiment, users rated their familiarity, on a 4-point Likert scale, with the following topics: robotic surgery, political bias online, sports analytics, gender recognition, cancer diagnosis, autonomous driving evalua-tion, gender bias in natural language processing and music recommendation. The topics were selected beforehand to ensure appropriate coverage in the data set used in the experiment. To ensure that users would engage in exploratory search, only topics with low levels of familiarity for a given user were considered, i.e. topics marked as 1 or 2 on the Likert scale. If a user indicated only two topics that they are not highly familiar with, then those two topics were randomly assigned to the two systems. If users indicated low familiarity with more than two topics, then they were asked to select their two preferred topics, which were then randomly assigned to the two systems. Users who did not indicate low familiarity with at least two topics were excluded from the study.

5.4 Measurements

For every task performed, we logged the following details: search topic for the task, system used (caption system or baseline), the length of note-taking on the given paper. Both system logged task starting and ending time as well as the user interaction data: the number of clicks, maximum scroll depth, bookmarked documents, all queries used in the search, duration of results page browsing, and the time spent on reading the clicked documents. Specifically, for the system with captions, we logged the interactive data with captions: captions that users clicked on as a new search query.

(32)

We also collected qualitative feedback on the search systems through semi-structured interviews at the end of the experiment. As an indicator of the task performance, an expert assessor from the machine learning domain rated, on a 5-point scale, the quality of the answers returned using both systems for the search tasks. The assessor conducted the review in a blind manner, with no awareness of which system was used in the task.

5.5 Procedure

All the studies took place in a controlled laboratory, with a 15-inch Macbook Pro 2016. First, we explained the purpose and procedure of the study to the participants. We informed the participants that the purpose was to “test our search engine for scientific literature.” Therefore, we instructed the participants to perform the tasks as they would normally do using the search systems we provide. Further, we ex-plained that the search tools used all Computer Science papers from the arXiv data set as our document corpus and search systems basically work the same as most-commonly used literature search tools. Figure 8 shows every step the participants were involved in this study.

The experiment comprised the following steps:

1. Consent form We asked the participants to sign a form of consent to take part in research.

2. Short tutorial We showed the participants a 1.5 minute video tutorial on how the systems work. This was followed by a 15 minute practice session to allow the users to familiarize themselves with the two systems at their own pace. In the practice session the participants were instructed to perform searches related to their own interests.

3. Pre-experiment questionnaire Prior to the study, we provided a pre-experiment questionnaire for the participant to fill in, comprising the background ques-tionnaire and the topic familiarity rating quesques-tionnaire. This is to capture the basic information and exploratory search skills of the participants and topics of low levels of familiarity for later use in the tasks.

4. Sequence of systems to be used We flipped a coin to decide the order of topics and systems for the two tasks so that the noise brought by sequence was eliminated.

(33)

Figure 8: All steps of the study from the view of a participant

5. Performing task We told them that each task has a 30-minute time limit, and we put a clock on the table so the participants could keep track of the time themselves. To keep the whole process as natural as possible, we did not ask the participants to think aloud or keep on reminding them the time spent. • A written task description was presented to participants to read through

until they understood it.

• Participants typed in the query, clicked the search button and proceeded to the search result page. Meanwhile, their start time was logged. • Participants were instructed to write their answers in a text editor as well

as a blank A4 paper for each task to take notes, if necessary.

• While they were performing the task, we quietly observed their behaviour and made notes of anything special, which we may discuss with them during the interview.

6. Post-task questionnaire After each task, the participants completed the SUS[41] questionnaire with 10 questions (Table 6.2.1) and a modified version of the ResQue questionnaire[44] with 12 questions (Table 6.2.2).

(34)

7. Post-experiment questionnaire After both tasks were completed, the par-ticipants completed a post-experiment questionnaire, which consists of 19 ques-tions (Table 1 about the utility of capques-tions and the user perception of the two systems in exploratory search scenarios.

8. Interview We conducted a semi-structured interview about their search ex-perience with the search systems. The aim of the interview was to better understand the user behaviour and explore the insights of participants regard-ing the two systems, especially the caption system.

Each study lasted approximately 90 minutes in total and we compensated each participant with a 15 euro bookshop gift card.

6 Results

We examined the results using both quantitative and qualitative feedback that we collected during the study, as well as the expert assessment scores of participants’ answers to the tasks. All the participants completed all the tasks successfully. The 19 participants performed 38 search tasks in total (19 exploratory tasks × 2 systems = 38). No data points were excluded.

6.1 User Perception

To explore if and how the integration between the original system and ESC is from the perspective of user perception, we recorded the explicit preference and the ratings users gave in post-experiment questionnaire. We also analysed qualitative feedback the participants gave in the semi-structured interview.

6.1.1 User rating

During the post-experiment questionnaire, users stated that they preferred the sys-tem with captions to the baseline almost unanimously (18/19, p = 7.6 × 10−5, bino-mial test). Table 1 shows the mean score users gave for exploratory tasks (ratings are given on a 5 point Likert scale). A majority of users felt that the presence of cap-tions reassured them that search results were relevant to their search goals (16/19,

(35)

p = 0.004). While we were concerned that the dynamic nature of the caption com-ponent would annoy users engaged in exploratory search, none of the participants thought the captions were distracting (0/19, p = 3.8 × 10−6), although a minority found the caption animation distracting (3/19, p = 0.004).

Average Score

p Question

I = 18, B =1

7.6e-05 1. Which system did you prefer to use?

3.9 0.063 2. I found it easier to perform the search with captions 3.9 0.063 3. I found it easier to write the essay draft with captions 4.1 0.0007 4. The labels of the caption interface are clear

3.5 0.359 5. The bars of the caption interface are clear 3.2 1.0 6. The captions should be an optional function 4.1 0.0007 7. The captions enhanced my search session 4.1 0.0007 8. The captions were related to my search results

4.2 0.004 9. The captions reassured me that my search results were relevant to my search goals

3.8 0.063 10. The captions provided a good summary of my search results

3.6 1.0 11. The captions provided good followup queries 1.9 3.8e-06 12. The captions were distracting

2.4 0.004 13. The caption animations were distracting

3.4 0.647 14. The system’s confidence in each caption was clearly indi-cated

4.0 0.0007 15. The system was better with the captions than without 2.0 3.8e-06 16. There too many captions shown

Table 1: Post-experiment score question averages with p-values. Question number 1 was binary. Questions 2 – 16 were scored on a 5-point Likert scale from 1 (disagree) to 5 (agree).

6.1.2 Qualitative feedback

After the main tasks, we conducted a semi-structured interview with each user. Interviews further proved that the caption system exceeds the baseline system for exploratory tasks.

(36)

Almost all users reported that they preferred the caption interface to the baseline. Ten participants stated the caption system to be significantly better when switching between two tasks: “when no captions, it takes me more time to look at the actual papers” [Participant 9], “it feels disappointing with the system without captions.” [P4]. Fifteen participants looked at the captions frequently during their search, while 4 participants were not really paying attention and 2 participants confirmed that they never clicked on the captions. As for the perception of the caption utility, 12 participants gave affirmative answers: “it summarizes the key information well” [P2], “mainly for the relevance of the documents and get a sense of what’s in there for predicting new things to search for” [P19]. Other 7 participants stated that the captions were not really useful for their search mainly because they considered the captions unreliable: “it’s not hardcore useful, but interesting” [P7], “the first caption is not what I want to focus on” [P13]. When asked about preference over the interface of the caption system, 15 participants expressed their satisfaction with the existing style: “the current one is peaceful” [P7], “nice and minimus” [P16]. The most often mentioned benefits of the caption interface were:

• a concise summary of the search results • suggestion for followup queries

• help with search context

• help to find new and interesting documents faster

Users also provided some suggestions for future improvements, e.g. some of the captions provided by the system were too generic to form a basis of a good followup query, lack of “back button” functionality and indicating which papers are most correlated with which captions.

6.2 Usability analysis

The caption interface obtained higher scores in both SUS and ResQue questionnaires than the baseline.

6.2.1 SUS Analysis

In SUS, the overall score was 76.8 for the caption system and 71.2 for the baseline, as shown in Table 6.2.1. While the difference was not significant (p = 0.136, Wilcoxon

(37)

signed rank test), this shows that the usability of the system did not suffer from the added functionality provided by the caption component.

I B p Question

3.4 2.7 0.0049 1. I think that I would like to use this system frequently 1.7 1.9 0.107 2. I found the system unnecessarily complex

4.3 3.8 0.107 3. I thought the system was easy to use

1.6 1.5 0.726 4. I think that I would need the support of a technical person to be able to use this system

3.8 3.4 0.159 5. I found the various functions in this system were well in-tegrated

1.9 1.9 0.705 6. I thought there was too much inconsistency in this system 4.3 4.2 1.0 7. I would imagine that most people would learn to use this

system very quickly

1.9 2.1 0.129 8. I found the system very cumbersome to use 3.6 3.3 0.144 9. I felt very confident using the system

1.6 1.5 0.705 10. I needed to learn a lot of things before I could get going with this system

Table 2: SUS score averages for the caption system with caption (I) and baseline (B) systems with p-values from Wilcoxon signed rank test. Questions used a 5-point Likert scale from 1 (disagree) to 5 (agree). Best score in each row is bold; higher is better for odd numbered questions and lower is better for even numbered questions.

6.2.2 ResQue Analysis

In ResQue, the caption system significantly outperformed the baseline averaging 83.2 versus 67.8 (p = 0.001, Wilcoxon signed rank test), which shows that partici-pants found the captions useful. As can be seen in Table 6.2.2, the caption system significantly exceeded the baseline system in helping them find the ideal documents. The post-experiment questionnaire showed that 18 out of 19 users preferred the caption system to the baseline and agreed that the system was better with the captions. Generally, users agreed that the captions provided a good summary of the search results (14 users), helped them to perform the search (14), write the essay (14), provided a good summary of the search results (14) and enhanced the search session (17).

(38)

I B p Question

4.0 3.5 0.01 1. The documents recommended to me matched what I was searching for

3.7 3.1 0.0139 2. The system helped me discover new documents 3.6 2.8 0.1305 3. The documents recommended to me are diverse 3.6 2.8 0.0164 4. The system helped me find the ideal documents 4.2 4.1 0.7054 5. I became familiar with the system very quickly

3.8 2.8 0.0189 6. I found it easy to notice if the search results were not correct any more

3.9 3.2 0.011 7. I felt confident to modify my query

3.6 3.2 0.0522 8. Using the system to find what I like is easy

3.7 3.7 0.7192 9. I found it easy to re-find documents I had been recom-mended before

3.7 3.1 0.0079 10. The system gave me good suggestions

3.8 3.5 0.07 11. The system made me confident about the documents I bookmarked

3.6 3.2 0.0374 12. Overall, I am satisfied with the system

Table 3: ResQue score averages for caption (I) and baseline (B) systems with p-values from Wilcoxon signed rank test. Questions used a 5-point Likert scale from 1 (disagree) to 5 (agree). Best score in each row is boldface; higher is better.

6.3 User behaviour

To investigate whether there is a notable behavioural difference between the baseline and the caption system, and to find answer of RQ2, we analysed user behaviours in terms of: (1) task performance – expert assessment score, (2) the number of queries users issued – referred to as interactions, (3) the number of documents users inspected and bookmarked.

6.3.1 Task performance

Task performance was assessed in a blind manner by an expert assessor based on the written answers submitted by participants. The ratings were done on 5-point scale from 1 (bad) to 5 (good). The average task performance was 2.95 with the baseline and 3.37 with the caption interface (p = 0.035, Wilcoxon signed rank test).

Supporting interactive summarization for explainable exploratory search

search

Contents

1

Introduction

2

Background and related work

2.1

Information retrieval

2.2

Exploratory search

2.3

Relevance feedback

2.4

Types of systems for exploratory search

2.5

Studies of exploratory search behaviour

2.6

Usability questionnaire frameworks

3

Exploratory Search Caption

3.1

Autoencoder model

3.2

Data and model training

3.3

Caption generation and ensembling

4

System design

4.1

System architecture and interface design

4.2

Experimental approach

4.3

Indicators of information search behaviours

5

User study design

5.1

Participants

5.2

Design

5.3

Tasks

5.4

Measurements

5.5

Procedure

6

Results

6.1

User Perception

6.2

Usability analysis

6.3

User behaviour