Theoretical techniques for performance evaluation

Chapter 2 Background literature

8 Comparative review o f related work

8.3 Theoretical techniques for performance evaluation

This thesis proposes the best-case analysis, a new theoretical approach to evaluating the

potential of a new application. This involves using the GOMS and information foraging

theory modelling techniques, to compute the best performance that could be achieved by a human user of the tool under evaluation. This performance is then compared to the

best-case performance, derived empirically or theoretically, of a competing alternative. This section reviews a number of the best-case analysis’ predecessors. These have

emerged from Xerox PARC which has pioneered the use of theoretical analysis to

determine performance of interactive software since the publication of Card’s seminal

book ‘The psychology of human-computer interaction’ (Card, Moran & Newell 1983).

Chapter 2 section 8 127

reviewed and their shortcomings in relation to the kind of evaluation performed for

thesis are discussed.

One theoretical approach to comparing the performance of systems is to compare their

cost structures by determining their ‘cost of knowledge functions’. This defines how the

number of items accessible varies as a function of cost in time (Card, Pirolli &

Mackinlay 1994). This approach was used to compare two calendar programs.

Empirical measurement and regression analysis were used to estimate the time taken to

access a day as a function of how distant it was from the current date (Card, Pirolli &

Mackinlay 1994). Since the visualisation being evaluated was slower than the

competition for all but very distant dates, design improvements were considered. Parameters in the cost of knowledge function were changed to simulate the effects of

proposed enhancements; this revealed that the visualisation would still under perform even with the improvements. The essential limitation of this approach is that in many

information retrieval applications it is the informativeness of retrieved items which matters more than the sheer number that are readily available. As such it does not offer

a particularly informative approach to examining the cost structure for satisfying an

information need.

GOMS has been used to compare the performance of an application to existing

competition and explore the performance of alternative designs. The support of two

packages (Splus and TableLens) for exploratory data analysis tasks was compared by

predicting the time taken for a range of elementary tasks (Pirolli & Rao 1996) such as

finding a variable’s median and judging the shape of its distribution. GOMS models were constructed to predict the time taken to do these tasks with each package as a

function of the size of the variable set being examined in the task. These models of

‘finding the important features of the variables’). These were used to plot graphs of the

change in task time for each of the packages, as the task is repeated with an increasingly

large set of variables. This comparison demonstrated that the information visualisation

tool, which the researchers had developed, offered faster or comparable performance to

the existing Splus package.

In general, the GOMS models used in these comparisons represent best-case

performances: the best methods are selected for accomplishing each task, performance

is error free, and the time estimates are derived from measurement of well-practiced

routine tasks. Crucially the tasks modelled with GOMS are entirely straightforward in

that the best procedure can be specified. This is unlike many information retrieval tasks involving information visualisations. In these, a set of interactions leading to the goal

form one of many possible paths through a large interaction (or problem) space. A visualisation’s ability to inform the user about which paths are shortest and most fruitful

is key to its success. This ability must be tested by any theoretical tools that purport to

evaluate visualisations that support information seeking.

One approach to dealing with the large interaction spaces inherent in information seeking tools was reported by Pirolli (Pirolli 1998). This used information foraging

theory and dynamic programming to determine the optimal path through the interaction

space. This approach was used to explore two hypothetical improvements to the

Scatter/Gather system, described in 6.4, under different task conditions. These

improvements and task conditions were incorporated into models of the Scatter/Gather

system that defined the costs (time) and benefits (number of relevant documents) of the

various interaction choices. Dynamic programming was used to search through the

interaction space for the most profitable interaction path. By this means, a direct

Chapter 2 section 8 129

made. Any performance differences observed, result directly from differences in the

design or task conditions. For example, Pirolli found that improving the clustering

quality by 25% produced superior performance when compared with doubling the clustering speed. However, when there was only a short time in which to complete the

task, doubling the clustering speed produced better performance. This demonstrates how

the coupling of dynamic programming and information foraging theory can be used to

calculate an absolute optimum performance achievable with an interface. However, it

takes no account of some important limitations of the interface and of the human operator. Any information seeking tool displays limited amount of information scent for

used in making interaction choices. Thus, unlike a dynamic programming algorithm,

users are limited in the accuracy of their estimates of information value and the likely

costs associated with alternative options. The actual interaction paths chosen will depend on the nature of the information scent displayed. Since an approach based on

dynamic programming ignores this factor, and calculates the absolute optimum performance, it is not suitable for computing the upper bound on the performance a

human operator of the tool might achieve. For this reason a dynamic programming

approach cannot be used in a best-case analysis, since in a best-case analysis the theoretically calculated performance, is compared to empirically measured expert

performance on a competing tool. For a best-case analysis a theoretical approach is

needed which can calculate the upper bound on performance achievable by a human

user.

There is, however, an existing technique, which could be used to compute the upper

bound on performance achievable by a human user of an information retrieval device.

This is typified by the integration of information foraging theory into ACT-R to create

of scatter-gather (Pirolli & Card 1999) and produces a reasonable fit to user data (see

section 5.2.3). It was therefore decided to use the same approach in the best-case

analysis described in this thesis. Here, however, it was not necessary to use ACT-R,

since its comprehensive scope is not needed. The model will only be to applied to one

tool and thus only some aspects of the user need to be modelled. In contrast, Pirolli and

his colleagues at Xerox PARC, need ACT-R’s comprehensive scope since they intend

to model a wide range of information seeking applications and phenomena. For example

they have recently been extended ACT-IF to model eye gaze in web page use (Pirolli et al. 2002).

8.4 Section summary

A number of visualisations are reviewed which attempt to aid navigation with miniaturised images of the document such as those used in GridVis, the visualisation

tool developed in this research. Although there is no direct evaluation of the utility of such images, its wide adoption suggests that is effective. Techniques for visualising

content distribution are reviewed. It is concluded that the keyword highlighting

technique is not able to clearly represent a large number of different keywords at once, as is required for paragraph-level metadata. The TileBars visualisation can however be

extended for use with an arbitrary number of items as is done in GridVis. The successful

use of outlining to support search and navigation in a number of within-document

navigation tools is described. The disadvantages of outlining relative to paragraph-level

metadata for work-place users are highlighted. The navigational affordances developed

in research on hypertext are discussed. The similarity of the one-to-many links, used in

the indexes of hypertext books, to the use of paragraph-level metadata for business

documents, is noted. The mixed results obtained with this form of link suggests that

Chapter 2 section 8 131

metadata in GridVis, might be desirable. Thus, the techniques developed for within document search and their relationship to the work done in this thesis have been

reviewed.

The theoretical evaluation techniques used in the construction of expert performance are

reviewed. Approaches used to compute a tool’s cost structure are problematic in their

concentration on number of items obtainable per unit cost. This excludes any notion of

item quality and cannot be used to calculate performance metrics for comparison with

empirical data as required for a best-case analysis. The use of GOMS is reviewed, but it

is noted that alone it is not able to simulate the selection of paths through the large interaction spaces involved in information retrieval devices. The use of information

foraging theory would allow this to be accomplished. However, when used with

dynamic programming the absolute optimum performance is calculated rather than the upper bound on the performance of a human operator. A cognitive modelling approach,

as typified by the use of ACT-R, can be used to simulate human ‘best-case’ performance. It is argued however that the use of the full ACT-R architecture is not

necessary for the simulation of performance on a single tool since the model will not need to be adapted for use with other tools. Thus in the best-case analysis described in

chapter 6, information foraging theory is used with GOMS models to model human

best-case performance.

The theoretical evaluation techniques used in the construction of expert performance are

reviewed. Approaches used to compute a tool’s cost structure are problematic in their

concentration on number of items obtainable per unit cost. This excludes any notion of

item quality and cannot be used to calculated performance metrics for comparison with

is noted that alone it is not able to simulate the selection of paths through the large

interaction spaces involved in information retrieval devices. The use of information

foraging theory would allow this to be accomplished. However, when used with

dynamic programming the absolute optimum performance is calculated rather than the

upper bound on the performance of a human operator. A cognitive modelling approach,

as typified by the use of ACT-R, can be used to simulate human ‘best-case’

performance. It is argued however that the use of the full ACT-R architecture is not

necessary for the simulation of performance on a single tool since the model will not be

adapted for use with other tools. Thus in the best-case analysis described in chapter 6,

information foraging theory used with GOMS models to model human best-case performance.

Chapter 3 section 1 133

In document Supporting document use through interactive visualisation of paragraph-level metadata. (Page 127-134)