IR experiments often use test collections to evaluate techniques. These test collections typically contain a set of documents (or items), queries to perform on the set and a list of relevant target result items for each query. Collections of this nature follow the now standard Cranfield model [Cleverdon and Keen, 1966]. Various evaluation workshops, such as TREC8, offer standardized IR evaluation tasks, by providing test collections, for various types of datasets, such as news archives and the linked and meta tagged World Wide Web. The focus of retrieval in these types of collections is predominantly on development of techniques to facilitate the finding of relevant information, for example finding news items on a topic of interest. A common char-acteristic of data sources that have been used for such standardized IR evaluation to date is that, the test set data can be shared with workshop participants (subject to copyright agreement). Further, the search requirements and information needs of tar-get user groups of the collections can generally be captured and studied, and used to develop experimental search topics for evaluation purposes. The success of these sys-tems is then assessed based on manual post hoc assessment of the data by analysing documents retrieved or submitted for assessment by workshop participants. Personal collections differ from the data sources used for existing standardized IR evaluation tasks in a number of ways. Firstly, these collections are personal to the individual, in that they have been created or obtained by the individual or represent experiences of the individual including for example, emails and SMS messages relating to con-certs attended, news articles relating to sports matches attended. Since this is the case, individuals will generally be unwilling to share these collections. The collec-tion owner may have personal experiences and memories associated with the items in their archive, which will inform, depending on their information needs at given moments in time, the items they wish to retrieve from the archive and the query terms
8http://trec.nist.gov/ (September 2011)
they will use in this retrieval process. This means that only the personal collection owner can provide their real re-finding information needs and the terms they would use for queries to retrieve relevant items. Further, only this individual can determine the relevance of items retrieved for a given information need from their personal col-lection. These key differences between personal collections and collections for which TREC-like IR tracks exist are important, make evaluation in this domain challenging9 and to date have hindered formation of shared Cranfield style collections for personal search evaluation.
To conduct experiments in personal data search, researchers largely need to create their own test collections consisting of individuals’ data, queries and result sets. There are a number of problems with this approach: 1) the effort required to create these col-lections; 2) the difficulty in gaining large volumes of subjects for such experiments;
and 3) lack of comparability across research efforts. In an effort to overcome these problems Kim et al [Kim and Croft, 2009] have created pseudo desktop test collections for the desktop search space. The authors proposed amassing and creating 3 pseudo desktop collections by extracting emails of 3 individuals prominent in the TREC Enter-prise Track collection [Craswell and Vries, 2005] and locating web pages, word docu-ments, PDF files and PowerPoint presentations related to these people by a web search query consisting of the person’s name, organization and area of speciality (provided by TREC expert search track). They randomly chose known items from these collec-tions and used a modification of the approach proposed by [Azzopardi et al., 2007], for simulated query generation for webpage re-finding, to generate simulated queries across multi-field personal items. This approach presents a promising direction to-wards larger scale test collection creation to support research in desktop search, and provides a means to support research into the utility of desktop retrieval approaches without the need for real users and their collections. However, these collections do not represent the diversity of real users’ collections, and hence may not provide a re-liable way to evaluate the performance of retrieval algorithms intended for personal desktop collections. The created collections contain a limited number of item types and the same volume of each provided item type across the three collections (with the exception of emails the number of which showed large variation across the
collec-9Indeed the need to move towards standardization in this domain was highlighted at the recent SI-GIR 2010 Desktop Search workshop [Elsweiler et al., 2010] and ECIR 2011 Evaluating Personal Search workshop [Elsweiler et al., 2011].
tions). Given the personal nature of desktop collections, we can expect individuals to have different types of collections, with varying volumes and types of content, cov-ering varying volumes of topics. Further it is not known to what extent the query formulation approach used reflects what collection owners will actually recall about required items and hence the query terms they will use. Indeed the query genera-tion approach of Azzopardi et al [Azzopardi et al., 2007] which forms the core part of this multi-field query formation approach was developed for webpage refinding, and is acknowledged by its authors to require further analysis and refinement to exhibit more of the characteristics observed by individuals in the webpage re-finding space in which they were working.
Other researchers in conducting retrieval experiments for personal search have used
’real’ users and some form of their collection. In [Ringel et al., 2003], 30 queries with target email items which had been sent to a large number of people in a company were manually created. These queries were then entered by test subjects into the Stuff I’ve Seen (SIS) interface, resulting in the retrieval of various items from their personal collections including the target email. Subjects were then required to locate the tar-get email, thus allowing for testing of various versions of the SIS interface. While this technique proved useful for testing features of personal search interfaces, it is not appropriate for evaluating PL retrieval algorithms where a user is required to use recalled content and context to form a query. Even taking a modification of this ap-proach where we generate tasks by giving a subject a task description of emails that had been sent to a group would not be appropriate, as the subject may have no rec-ollection of these emails, and therefore could not recall content or context data with which to form a query. Further, by providing individuals with the required target item, we are removing the real refinding requirements of individuals from the evalu-ation process. Finally, and most obviously, the test sets used in this approach lack the rich context sources we wish to explore in retrieval experiments.
Elsweiler’s work [Elsweiler et al., 2007], similar to the SIS evaluations, used a static email collection for PL experimentation. That is, the interactions individuals have with PL items were not recorded. This information is required to assign rich item access related context types to items. For example, to tag an item with the months an individual accessed an item, each access to the item needs to be recorded. How-ever Elsweiler’s work did adopt a more personal approach to task generation. This
work presented a framework for a task based approach to PL user evaluation. Specif-ically, over a period of approximately three weeks subjects recorded web and email related tasks (task = email viewed and the purpose for which it was viewed, e.g. re-locate email which contains Joe Smith’s phone number). They also allowed for the generation of additional tasks by the experiment investigator through observation of the type of tasks recorded by subjects and by having a number of subjects in a given group (e.g. students in the same class) provide a tour of their collections. This allowed them form task descriptions which simulated a ‘real world’ task the subject may en-gage in. Tasks were then categorized into three distinct types: tasks requiring a spe-cific piece of information from within a computer item (lookup task); tasks requiring a specific computer item (known-item tasks); and tasks requiring information from multiple computer items (multi-item tasks). This approach allows for comparison of retrieval performance across different task types (i.e., across ’lookup tasks’, ’known-item tasks’ and ’multi-’known-item tasks’), and importantly the evaluation of a personal in-formation retrieval application in a structured manner using the personal inin-formation owners themselves. Elsweiler et al. examined this task generation approach only on emails and web pages, and thus its portability to other item types is not guaranteed.
Moving beyond emails and web pages and into the space of personal computer items and of particular interest to us into the space of backend retrieval algorithm evalua-tion, [Soules and Ganger, 2005, Soules, 2006] logged the computer activity (including all accesses to items) of 6 subjects over a period of 6 months. The subjects submitted 3-5 content only queries, which they appeared to freely generate from their memory with respect to the collection period. These were multi-item standard ad hoc type queries, e.g. “locate all items associated with writing of thesis”. To create oracle re-sults for these queries the rere-sults for each query across different search engines were pooled together, and the subject rated the relevance of the pooled result set. This evaluation approach is similar to standard TREC type pooling strategy and is of par-ticular interest to us as it allows for a means to create Cranfield type test collections containing accesses to personal items, which once created can be used for exploration of and development of unlimited numbers of backend retrieval techniques without requirement for user interaction.