Chapter 2 Background literature
8 Comparative review o f related work
8.3 Theoretical techniques for performance evaluation
This thesis proposes the best-case analysis, a new theoretical approach to evaluating the
potential of a new application. This involves using the GOMS and information foraging
theory modelling techniques, to compute the best performance that could be achieved by a human user of the tool under evaluation. This performance is then compared to the
best-case performance, derived empirically or theoretically, of a competing alternative. This section reviews a number of the best-case analysis’ predecessors. These have
emerged from Xerox PARC which has pioneered the use of theoretical analysis to
determine performance of interactive software since the publication of Card’s seminal
book ‘The psychology of human-computer interaction’ (Card, Moran & Newell 1983).
Chapter 2 section 8 127
reviewed and their shortcomings in relation to the kind of evaluation performed for
thesis are discussed.
One theoretical approach to comparing the performance of systems is to compare their
cost structures by determining their ‘cost of knowledge functions’. This defines how the
number of items accessible varies as a function of cost in time (Card, Pirolli &
Mackinlay 1994). This approach was used to compare two calendar programs.
Empirical measurement and regression analysis were used to estimate the time taken to
access a day as a function of how distant it was from the current date (Card, Pirolli &
Mackinlay 1994). Since the visualisation being evaluated was slower than the
competition for all but very distant dates, design improvements were considered. Parameters in the cost of knowledge function were changed to simulate the effects of
proposed enhancements; this revealed that the visualisation would still under perform even with the improvements. The essential limitation of this approach is that in many
information retrieval applications it is the informativeness of retrieved items which matters more than the sheer number that are readily available. As such it does not offer
a particularly informative approach to examining the cost structure for satisfying an
information need.
GOMS has been used to compare the performance of an application to existing
competition and explore the performance of alternative designs. The support of two
packages (Splus and TableLens) for exploratory data analysis tasks was compared by
predicting the time taken for a range of elementary tasks (Pirolli & Rao 1996) such as
finding a variable’s median and judging the shape of its distribution. GOMS models were constructed to predict the time taken to do these tasks with each package as a
function of the size of the variable set being examined in the task. These models of
‘finding the important features of the variables’). These were used to plot graphs of the
change in task time for each of the packages, as the task is repeated with an increasingly
large set of variables. This comparison demonstrated that the information visualisation
tool, which the researchers had developed, offered faster or comparable performance to
the existing Splus package.
In general, the GOMS models used in these comparisons represent best-case
performances: the best methods are selected for accomplishing each task, performance
is error free, and the time estimates are derived from measurement of well-practiced
routine tasks. Crucially the tasks modelled with GOMS are entirely straightforward in
that the best procedure can be specified. This is unlike many information retrieval tasks involving information visualisations. In these, a set of interactions leading to the goal
form one of many possible paths through a large interaction (or problem) space. A visualisation’s ability to inform the user about which paths are shortest and most fruitful
is key to its success. This ability must be tested by any theoretical tools that purport to
evaluate visualisations that support information seeking.
One approach to dealing with the large interaction spaces inherent in information seeking tools was reported by Pirolli (Pirolli 1998). This used information foraging
theory and dynamic programming to determine the optimal path through the interaction
space. This approach was used to explore two hypothetical improvements to the
Scatter/Gather system, described in 6.4, under different task conditions. These
improvements and task conditions were incorporated into models of the Scatter/Gather
system that defined the costs (time) and benefits (number of relevant documents) of the
various interaction choices. Dynamic programming was used to search through the
interaction space for the most profitable interaction path. By this means, a direct
Chapter 2 section 8 129
made. Any performance differences observed, result directly from differences in the
design or task conditions. For example, Pirolli found that improving the clustering
quality by 25% produced superior performance when compared with doubling the clustering speed. However, when there was only a short time in which to complete the
task, doubling the clustering speed produced better performance. This demonstrates how
the coupling of dynamic programming and information foraging theory can be used to
calculate an absolute optimum performance achievable with an interface. However, it
takes no account of some important limitations of the interface and of the human operator. Any information seeking tool displays limited amount of information scent for
used in making interaction choices. Thus, unlike a dynamic programming algorithm,
users are limited in the accuracy of their estimates of information value and the likely
costs associated with alternative options. The actual interaction paths chosen will depend on the nature of the information scent displayed. Since an approach based on
dynamic programming ignores this factor, and calculates the absolute optimum performance, it is not suitable for computing the upper bound on the performance a
human operator of the tool might achieve. For this reason a dynamic programming
approach cannot be used in a best-case analysis, since in a best-case analysis the theoretically calculated performance, is compared to empirically measured expert
performance on a competing tool. For a best-case analysis a theoretical approach is
needed which can calculate the upper bound on performance achievable by a human
user.
There is, however, an existing technique, which could be used to compute the upper
bound on performance achievable by a human user of an information retrieval device.
This is typified by the integration of information foraging theory into ACT-R to create
of scatter-gather (Pirolli & Card 1999) and produces a reasonable fit to user data (see
section 5.2.3). It was therefore decided to use the same approach in the best-case
analysis described in this thesis. Here, however, it was not necessary to use ACT-R,
since its comprehensive scope is not needed. The model will only be to applied to one
tool and thus only some aspects of the user need to be modelled. In contrast, Pirolli and
his colleagues at Xerox PARC, need ACT-R’s comprehensive scope since they intend
to model a wide range of information seeking applications and phenomena. For example
they have recently been extended ACT-IF to model eye gaze in web page use (Pirolli et al. 2002).
8.4
Section summary
A number of visualisations are reviewed which attempt to aid navigation with miniaturised images of the document such as those used in GridVis, the visualisation
tool developed in this research. Although there is no direct evaluation of the utility of such images, its wide adoption suggests that is effective. Techniques for visualising
content distribution are reviewed. It is concluded that the keyword highlighting
technique is not able to clearly represent a large number of different keywords at once, as is required for paragraph-level metadata. The TileBars visualisation can however be
extended for use with an arbitrary number of items as is done in GridVis. The successful
use of outlining to support search and navigation in a number of within-document
navigation tools is described. The disadvantages of outlining relative to paragraph-level
metadata for work-place users are highlighted. The navigational affordances developed
in research on hypertext are discussed. The similarity of the one-to-many links, used in
the indexes of hypertext books, to the use of paragraph-level metadata for business
documents, is noted. The mixed results obtained with this form of link suggests that
Chapter 2 section 8 131
metadata in GridVis, might be desirable. Thus, the techniques developed for within document search and their relationship to the work done in this thesis have been
reviewed.
The theoretical evaluation techniques used in the construction of expert performance are
reviewed. Approaches used to compute a tool’s cost structure are problematic in their
concentration on number of items obtainable per unit cost. This excludes any notion of
item quality and cannot be used to calculate performance metrics for comparison with
empirical data as required for a best-case analysis. The use of GOMS is reviewed, but it
is noted that alone it is not able to simulate the selection of paths through the large interaction spaces involved in information retrieval devices. The use of information
foraging theory would allow this to be accomplished. However, when used with
dynamic programming the absolute optimum performance is calculated rather than the upper bound on the performance of a human operator. A cognitive modelling approach,
as typified by the use of ACT-R, can be used to simulate human ‘best-case’ performance. It is argued however that the use of the full ACT-R architecture is not
necessary for the simulation of performance on a single tool since the model will not need to be adapted for use with other tools. Thus in the best-case analysis described in
chapter 6, information foraging theory is used with GOMS models to model human
best-case performance.
The theoretical evaluation techniques used in the construction of expert performance are
reviewed. Approaches used to compute a tool’s cost structure are problematic in their
concentration on number of items obtainable per unit cost. This excludes any notion of
item quality and cannot be used to calculated performance metrics for comparison with
is noted that alone it is not able to simulate the selection of paths through the large
interaction spaces involved in information retrieval devices. The use of information
foraging theory would allow this to be accomplished. However, when used with
dynamic programming the absolute optimum performance is calculated rather than the
upper bound on the performance of a human operator. A cognitive modelling approach,
as typified by the use of ACT-R, can be used to simulate human ‘best-case’
performance. It is argued however that the use of the full ACT-R architecture is not
necessary for the simulation of performance on a single tool since the model will not be
adapted for use with other tools. Thus in the best-case analysis described in chapter 6,
information foraging theory used with GOMS models to model human best-case performance.
Chapter 3 section 1 133