We compare how well participants completed the four subtasks of Task 1 when using both
systems. The answers are considered invalid if any term in the selected combination does
not match to the original meaning and purpose of the question. For example, in Task 1.1,
where we wanted the participants to find a term for “company”, some participants choose
properties like dbprop:companyType or dbprop:companyLogo-. These terms, although
partially matches for the keywords, do not have the same semantics as the query purpose
(also they would not retrieve as many instances). Thus we count these answers as invalid
combinations, no matter how many instances they can retrieve.
there is a best combination of terms which will retrieve the most number of instances that
matches the purpose of the query. Also for each valid participant answer, we record how
many instances it will retrieve. We define the coverage as the ratio between the counts
of instances retrieved by a valid answer and the best combination 1. We compute the
average coverage ratio of the valid answers by questions and systems, and report result
in Table 5.1. For each subtask and each system, the number of attempts in the first row
is the number of participants who answer that question with that system, followed by
the number of valid participant answers in the second row, and the ratio between the
second row and the first row as the third row. In the last row we report the average of
the coverage ratios which we compute only for the valid answers. Note the final column
provides adjusted results for Task 1.4, which is an ideal interpretation of the participants’
answers, as will be explained later.
Table 5.1: How well do participants finish Task 1 with both systems?
Task 1.1 Task 1.2 Task 1.3 Task 1.4 Task 1.4*
HL TC HL TC HL TC HL TC HL TC
# of Attempts 8 6 7 7 6 8 7 7 7 7
# of Valid Answers 5 4 7 7 3 4 6 6 6 6
% of Valid Answers 0.63 0.67 1 1 0.5 0.5 0.86 0.86 0.86 0.86
Avg. Valid Coverage 0.63 0.58 0.89 1 0.66 1 0.49 0.47 0.99 0.93
Generally speaking, we find that there is no obvious difference in whether a participant
will find a valid answer by using both systems. We think that is because whether a
1We assume that all such instances are valid, even though it is likely that some terms
have erroneous instances. If we assume that such errors are distributed uniformly, this metric still records the best combination.
participant could provide a valid answer is mostly related to how well they understand
the general task purpose (or the way of querying structured datasets). Also from the result
we can see the TC system sometimes has significantly better average coverage of the valid
answers. In Task 1.2 and Task 1.3, TC gets full coverage, which means participants
can always find the best combinations by using TC in these two questions, as long as
they correctly interpret the purpose of the questions. In Tasks 1.1 and 1.4, TC users
have slightly lower coverage than HL users. We investigate the two answers of Task 1.1
via the TC system that do not get full coverage, and find that both answers are the
same: user chose yago:Company108058098 instead of dbpediaowl:Company as the term
for “company”. The former company class has 5656 instances in the dataset while the
latter one has 33747. The difference in the counts of instances is obvious; however we
find that in the TC interface, the font sizes for these two tags are 28.3 px and 31.6 px
respectively due to the log function mapping from instance counts to font sizes. When
these two tags are not placed side by side, the difference in font sizes is really hard to
discriminate by eye, especially when the former has more characters which makes it seem
to take more space. It might be that when the participants use the TC system for the first
time, they assume any difference in the instance counts should be easily reflected by font
sizes, or they do not know that they can get the exact count numbers when they hover
the mouse on the tags. However, in comparison, participants using HL sometimes select
tags with significantly fewer matches, e.g. one participant chose ns6998:Company ( which
Task 1.4 is a relatively tricky question. The best combination{yago:Actor109765278, dbpediaowl:starring-} has 15225 instances. Since we did not cover the concept of in- verse properties in the tutorial, we believe most of the participants do not know the
difference in semantics between dbpediaowl:starring- and dbpediaowl:starring. We
saw quite a few of answers without the inverse notation, and we are not sure whether
they found the right one but omitted the “-” when filling in the form. In the dataset,
both dbpediaowl:starring- and dbpediaowl:starring have some instances of actors
as its subjects, although the number of the former is much greater than that of the latter
(15225 vs. 6 instances). That means there are data supporting either direction of the
usage, and we consider both are valid answers for the keyword “starring”. Thus we report
two columns for Task 1.4, where the original one, has a much lower coverage ratio than the
adjusted one. That is because in the adjusted results, we assume that participants who an-
swered dbpediaowl:starring actually meant dbpediaowl:starring-. There is another
tricky point of Task 1.4 because the “greedy algorithm” will fail. When a user searches
for “actor”, the largest class freebase:film.actor has 26067 instances, which is larger
than yago:Actor109765278 (24760 instances). However, if the user chooses this “local
optimal” class, the combination of{freebase:film.actor, dbpediaowl:starring-} will only have 12175 instances. In fact, the only two answers by TC system that do not get
full coverage (after the inverse property adjustment) are exactly{freebase:film.actor, dbpediaowl:starring-} resulted from the local optimal actor class.
The first request that is definitely relevantto Task 1.x
The first request that is definitely irrelevantto Task 1.x
The last request that is definitely relevantto Task 1.x
t
1t
2t
3t
4Figure 5.2: Estimating time spent on each question of Task 1.
systems. We used web request logs to estimate the time spent on each task question.
When the request contained some keywords, or tags related to the keywords in any of the
tasks, we are very sure that it is related to a specific task, and thus we can certainly mark
them. However for other requests, we were not sure whether it is relevant to a specific
task, especially when this request is between two clusters of requests for two adjacent
tasks. Thus we simply apportion half of the ambiguous time interval to each of the two
tasks. As illustrated in Figure 5.2, we mark the first and last clearly relevant request
of each task question and estimate that each tasks ends exactly halfway between its last
clearly relevant request and the first clearly relevant request of the next task. Similarly,
we assume that the next task begins at this point.
We plot the box-and-whisker diagram for time spent on each question by each system
in Figure 5.3. The lower red boxes indicate the range between the first quartile and
the median, the upper blue boxes indicate the range between the median and the third
quartile, and the lower bar and upper bar indicates the min and max time. We can see
that participants usually spent less time to complete a task using TC in Task 1 except
systems are very close in this subtask. We also found that one of the time records, by a
participant who uses TC, may be estimated with a higher error, because the ambiguous
time interval (transition time) between Task 1.2 and 1.3 by that participant is the highest
among all all participants and tasks. The average length of the ambiguous time intervals
is 32.7 s, while the highest is 109 s.
We also checked the points that seem to be outliers, especially focused on the following
three points: 580.5 s in Task 1.1 with HL, 319.5 s in Task 1.3 with TC, and 252.5 s in
Task 1.4 with TC. It turns out that there is nothing unexpected. All the request logs
show that the participants were doing something related to their tasks. Also, we note that
these outliers are not due to our approach to estimating task time: If we do not include
the ambiguous time interval in the estimation, i.e. we only look at the time between the
first and the last relevant requests, the times are 574 s, 309 s, and 241 s respectively.
We believe that is just because of the browsing habits of the participants, and in fact
the longest times in Task 1.3 and Task 1.4 with TC come from the same participant.
Additionally, the participant who spent the most time in Task 1.1 with HL, also uses the
second longest time in Task 1.4 with HL. In order to investigate the times and also take
the differences between participants into account, we plot the total time spent on each
system by each participant in Figure 5.4. It shows us that 9 out of 14 (64.3%) participants
spent less time in total when they use TC. But also it indicates that some users might
find HL a more effective way for them to complete Task 1. Note that we have mapped the
0 100 200 300 400 500 600 Task 1.1 HL Task 1.1 TC Task 1.2 HL Task 1.2 TC Task 1.3 HL Task 1.3 TC Task 1.4 HL Task 1.4 TC T im e ( s)
Box-and-Whisker Diagram
Time Spent on Task 1
3rd Quarter2nd Quarter
Figure 5.3: Statistics of Time on each question of Task 1.
0 100 200 300 400 500 600 700 800 1 2 3 4 5 6 7 8 9 10 11 12 13 14
To
ta
l
T
im
e
S
p
e
n
t
(s
)
User IdTotal Time Spent on Both Systems by User
HL TC
Figure 5.4: Total time spent on each system by all participants.
study.
-100 0 100 200 300 400 500 Time Difference (s)
Time Difference between
the 1st and 2nd Time Usage
2nd Quarter 3rd Quarter TC
HL
Figure 5.5: Statistics of time reduced when a participant uses a system for the second time.
to Task 1.4. We hypothesize that this improvement could be attributed to participants
getting familiar with both the tasks and the tools. Based on this observation, we compute
the difference of time that a participant spent the first time they used a system and the
second time they used the same system. We find that 10 out of 14 participants improve
their completion time the second time they use HL, and 10 out of 14 for TC as well. So
usually there is an improvement, and the time gets reduced. We plot another box-and-
whisker diagram in Figure 5.5, which summarizes these differences. Although we find HL
has a larger value in the max, which is due to the “suspicious outlier” we mentioned before,
TC is better at the min value, the first quartile, the median, and the third quartile. That
suggests people are likely to get more improvement when they use TC for the second time.
We used a one-tailed t-test to verify whether our hypotheses are statistically significant:
(1) The total time spent on TC is less than that on HL; and (2) The improvement on TC
is larger than that on HL. However, the p-value for them are 0.175 and 0.306 respectively.
This means under the current size of samples, we are not able to prove our hypotheses.