1
COMS W4172
Evaluation
Steven Feiner
Department of Computer Science
Columbia University
New York, NY 10027
www.cs.columbia.edu/graphics/courses/csw4172
April 8, 2021
Why Evaluate?
Is this the best way to accomplish a given
task?
What are the task attributes (in particular,
known problem areas) that make the most
difference?
3
Evaluation Methods
Cognitive Walkthrough
Polson et al. 92
Analogy to code walkthrough
Inputs
Description of the UI (design sketch; running system not needed)
Task scenario
Assumptions about knowledge a user brings to the task
Specific actions a user must perform to accomplish the task with UI
Evaluator(s) examine each step in the correct action
sequence, asking
Will user try to achieve the right effect?
Will user notice that the correct action is available?
Will user associate correct action with effect user is trying to achieve?
Will user see that progress is being made toward task solution if correct action is performed?
Evaluation Methods
Heuristic Evaluation
Nielsen & Molich 92
Expert evaluators evaluate prototype
individually by comparing it with heuristic
guidelines
5
Evaluation Methods
Formative Evaluation
Performed during system evolution
Repeat until satisfied {
Representative users try system in
task-based scenarios
Informal
↔
formal
Identify problems
Modify system to address problems
}
Evaluation Methods
Summative Evaluation
Performed after system complete
Representative users try multiple designs
in task-based scenarios
Informal
↔
formal
7
Evaluation Methods
Questionnaires, Tests
Demographic information
Age, gender, profession, experience,…
Physical/mental abilities (e.g., dominant hand, dominant eye, color vision, stereo vision)
Subjective data
Preferences
Ratings using Likert scale
Free-form commentsLikert-scale question from Presence Questionnaire, B. Witmer & M. Singer, 1998
Stereo Fly Test, http://www.stereooptical.com Ishihara Color Test PseudoIsochromatic Plate (PIP) Test
Evaluation Methods
Questionnaires: NASA TLX
(Task Load Index)
Subjective workload assessment tool
https://hsi.arc.nasa.gov/groups/tlx/
Six scales, normalized to 0–100 Scales are first weighted per subject and
task through 15 (= 6×5/2) binary comparisons of relative importance
Each scale weighted 0–5 of 15 total
Sum of weighted scales Is divided by 15
9
Evaluation Methods
Questionnaires:
igroup Presence Questionnaire
Subjective presence assessment tool
http://www.igroup.org/pq/ipq/index.php
One example of a questionnaire intended to measure presence–the subjective sense of being in an environment
Note: This is the English translation of a German questionnaire
Evaluation Methods
Interviews
Direct interaction by interviewer with subject
Structured Semi-Structured Unstructured
Structured: Fully standardized set of questions
Semi-structured: Based on a guide/framework, but
with the ability to explore and improvise
Unstructured: Open, with complete freedom to
11
Metrics for Evaluation
Time to learn Time to use
Implies benchmark task(s) Errors
How many?
What kind?
How important? Skill retention
For how long?
Frequent vs. casual user User impressions
Does user like the system?
Subjective impressions of the other factors
Presence
Comfort (e.g., cybersickness)
Metrics for Evaluation
Objective measures of presence andcomfort
Physiologic response
Meehan et al. compared users’ responses to stressful and non-stressful virtual rooms
Heart rate change correlated well Skin conductance change correlated
less well
Skin temperature change not as
effective
13
Metrics for Evaluation
Objective measures of 3D motion Head motion Comparing user head position/ orientation when doing 18 sequential maintenance tasks whose documentation is presented on tracked HWD (AR) stationary LCD (LCD)S. Henderson and S. Feiner. Exploring the
benefits of augmented reality documentation for maintenance and repair.
IEEE Transactions on Visualization and Computer Graphics, 17(10), 2011.
Evaluation Issues for 3DUI
Need to avoid evaluator intruding on
subject, affecting subject’s sense of
presence
Need to assist subject with unfamiliar
equipment
15
Evaluation Issues for 3DUI
Device limitations: trackers, displays
Device variations within class
E.g., different kinds of trackers, displays
May have a much greater effect than variations in
2D devices because of range of technologies, lack
of standardization
Case Study: Balloon Selection
Benko & Feiner, 3DUI 2007
18