Evaluation - Shah_unc_0153D

Evaluating a CIS environment can be a huge challenge due to its complex design that involves a set of users, integrated systems, and a variety of interactions. One can evaluate a CIS system using typical measures of IR. However, as discussed before, information seeking is not merely about retrieving information, and thus, evaluating a CIS system with its retrieval effectiveness may not be sufficient. While traditional IR evaluations can still be used to measure the retrieval performance of a collaborative filtering system, just as Smyth, Balfe, Boydell, Bradley, Briggs, Coyle, and Freyne (2005) did, we need additional measures for CIS systems.

Baeza-Yates and Pino (1997) presented some initial work on trying to come up with a measure that can extend the evaluation of a single-user IR system for a collaborative environment. While this was based on the retrieval performance, Aneiros and Estivill-Castro (2005) came up with the proposal of evaluating the “goodness” of a collaborative system with usability. In addition, Baeza-Yates and Pino (1997) treated the performance of a group as the summation of the performances of the individuals in the group. While this may work for simple information seeking and retrieval, we can imagine situations in which this is not true. For instance, if two people working together can find twice as much information as either of them working independently,

participants may not be able to find twice as many results, but what if they achieved

better understandingof the problem or the information due to working in collaboration? Then there are other factors, such asengagement,social interactions, andsocial capital, which may be important depending upon the application, but are usually not looked at in non-interactive or a single-user IR evaluations.

The majority of the work reported in the literature that has attempted to evaluate the effectiveness of a collaborative system has looked at the usability of the collaborative interface. For instance, Morris and Horvitz (2007) tested their SearchTogether system with a user study to evaluate how users utilize various tools offered in their interface and how those tools affect the act of collaboration. The authors used seven pairs of users and let each pair choose their topic of mutual interest to work with. The evaluation was based on the log, observations, and questionnaire data. While they showed the effectiveness of their interface in letting people search together, there was no evaluation of learning that took place in the group due to collaboration. Laurillau and Nigay (2002) demonstrated how multiple users can navigate the web in a collaborative environment with theirCoVitessesystem. They presented evaluations for the user interface as well as various network-related parameters. However, no clear understanding of the effects on the retrieval performance was reported. Aneiros and Estivill-Castro (2005) presented a questionnaire to the participants of their user study to evaluate the usability of their Group Unified History (GUH) tool. Typical questions on their questionnaire were“how difficult was it to interpret the user identity symbols used in the tool?” and “did you visit any websites found by your team/peers using the group history?”

Smyth et al. (2003) tested their I-Spy system with leave-one-out evaluation method- ology. From 20 users, they left one user as a testing user and used the other 19 users as the training users. The relevancy results of the training users were used to populate I-Spy’s hit matrix (detail given earlier) and the results of each query were re-ranked

using I-Spy’s relevancy metric. Then they counted the number of those results listed as relevant by the test user for various result-list sizes and finally, they made the equivalent relevancy measurements by analyzing the results produced by the untrained version of I-Spy to serve as a baseline.

Some of the application designers also let “real” users use their systems and evalu- ated the effectiveness of their system from these users’ feedback and/or their success in solving their “real” problems with it. For instance, Twidale, Nichols and Paice (1995) invited volunteers to bring a problem that they already have to solve. Students from a wide range of academic backgrounds (including Psychology, Computing, Women’s Studies, Chemistry, Religious studies and Environmental Science) used their Ariadne system. The typical case was that they were about to write an extended essay, disser- tation or group project and needed to do a literature search. The testing informed the iterative development of the system.

Prekop (2002) presented a qualitative way of evaluating collaborative information seeking studies. He proposed this by measuring information seeking patterns. These patterns describe prototypical actions, interactions, and behaviors performed by participants in a collaborative endeavor. The three patterns that the author described were

information seeking by recommendation, direct questioning, and advertising information paths. On the similar line of studying the participants by analyzing their behavior and patterns, Olson et al. (1992) studied 10 design meetings from four projects in two organizations. The meetings were videotaped, transcribed, and then analyzed using a coding scheme that looked at participants’ problem solving and the activities they used to coordinate and manage themselves. The authors also analyzed the structure of their design arguments. The authors claimed that the coding schemes developed may be useful for a wide range of problem-solving meetings other than design.

Wilson and schraefel (2008) analyzed an evaluation framework for information seeking interfaces in terms of its applicability to collaborative search software. Extending Bates’ tactics model (Bates, 1979) and Belkin’s model of users (Belkin et al., 1993), they showed that the framework can be just as easily applied to collaborative search interactions as individual information seeking software, but pointed out that there are additional considerations about the individual’s involvement within a group that must be maintained as the assessment is carried out.

These efforts of evaluating various factors in CIS can be summarized as measuring (1) retrieval performance of the system, (2) effectiveness of the interface in facilitating collaboration, and (3) user satisfaction and involvement. Despite these efforts, there is still a lack of clarity and methods in evaluating CIS environments that can measure factors such as learning, user engagement, and group performance. Given this, the research reported here can provide a valuable contribution with proposals and demon- strations of various evaluation metrics for collaborative systems.

In document Shah_unc_0153D_11239.pdf (Page 110-113)