Analysis on Results of Task 1 - Semantic Web for Everyone: Exploring Semantic Web Knowledge Bas

We compare how well participants completed the four subtasks of Task 1 when using both

systems. The answers are considered invalid if any term in the selected combination does

not match to the original meaning and purpose of the question. For example, in Task 1.1,

where we wanted the participants to ﬁnd a term for “company”, some participants choose

properties like dbprop:companyType or dbprop:companyLogo-. These terms, although

partially matches for the keywords, do not have the same semantics as the query purpose

(also they would not retrieve as many instances). Thus we count these answers as invalid

combinations, no matter how many instances they can retrieve.

there is a best combination of terms which will retrieve the most number of instances that

matches the purpose of the query. Also for each valid participant answer, we record how

many instances it will retrieve. We deﬁne the coverage as the ratio between the counts

of instances retrieved by a valid answer and the best combination 1. We compute the

average coverage ratio of the valid answers by questions and systems, and report result

in Table 5.1. For each subtask and each system, the number of attempts in the ﬁrst row

is the number of participants who answer that question with that system, followed by

the number of valid participant answers in the second row, and the ratio between the

second row and the ﬁrst row as the third row. In the last row we report the average of

the coverage ratios which we compute only for the valid answers. Note the ﬁnal column

provides adjusted results for Task 1.4, which is an ideal interpretation of the participants’

answers, as will be explained later.

Table 5.1: How well do participants ﬁnish Task 1 with both systems?

Task 1.1 Task 1.2 Task 1.3 Task 1.4 Task 1.4*

HL TC HL TC HL TC HL TC HL TC

# of Attempts 8 6 7 7 6 8 7 7 7 7

# of Valid Answers 5 4 7 7 3 4 6 6 6 6

% of Valid Answers 0.63 0.67 1 1 0.5 0.5 0.86 0.86 0.86 0.86

Avg. Valid Coverage 0.63 0.58 0.89 1 0.66 1 0.49 0.47 0.99 0.93

Generally speaking, we ﬁnd that there is no obvious diﬀerence in whether a participant

will ﬁnd a valid answer by using both systems. We think that is because whether a

1_{We assume that all such instances are valid, even though it is likely that some terms}

have erroneous instances. If we assume that such errors are distributed uniformly, this metric still records the best combination.

participant could provide a valid answer is mostly related to how well they understand

the general task purpose (or the way of querying structured datasets). Also from the result

we can see the TC system sometimes has signiﬁcantly better average coverage of the valid

answers. In Task 1.2 and Task 1.3, TC gets full coverage, which means participants

can always ﬁnd the best combinations by using TC in these two questions, as long as

they correctly interpret the purpose of the questions. In Tasks 1.1 and 1.4, TC users

have slightly lower coverage than HL users. We investigate the two answers of Task 1.1

via the TC system that do not get full coverage, and ﬁnd that both answers are the

same: user chose yago:Company108058098 instead of dbpediaowl:Company as the term

for “company”. The former company class has 5656 instances in the dataset while the

latter one has 33747. The diﬀerence in the counts of instances is obvious; however we

ﬁnd that in the TC interface, the font sizes for these two tags are 28.3 px and 31.6 px

respectively due to the log function mapping from instance counts to font sizes. When

these two tags are not placed side by side, the diﬀerence in font sizes is really hard to

discriminate by eye, especially when the former has more characters which makes it seem

to take more space. It might be that when the participants use the TC system for the ﬁrst

time, they assume any diﬀerence in the instance counts should be easily reﬂected by font

sizes, or they do not know that they can get the exact count numbers when they hover

the mouse on the tags. However, in comparison, participants using HL sometimes select

tags with signiﬁcantly fewer matches, e.g. one participant chose ns6998:Company ( which

Task 1.4 is a relatively tricky question. The best combination{yago:Actor109765278, dbpediaowl:starring-} has 15225 instances. Since we did not cover the concept of in- verse properties in the tutorial, we believe most of the participants do not know the

diﬀerence in semantics between dbpediaowl:starring- and dbpediaowl:starring. We

saw quite a few of answers without the inverse notation, and we are not sure whether

they found the right one but omitted the “-” when ﬁlling in the form. In the dataset,

both dbpediaowl:starring- and dbpediaowl:starring have some instances of actors

as its subjects, although the number of the former is much greater than that of the latter

(15225 vs. 6 instances). That means there are data supporting either direction of the

usage, and we consider both are valid answers for the keyword “starring”. Thus we report

two columns for Task 1.4, where the original one, has a much lower coverage ratio than the

adjusted one. That is because in the adjusted results, we assume that participants who an-

swered dbpediaowl:starring actually meant dbpediaowl:starring-. There is another

tricky point of Task 1.4 because the “greedy algorithm” will fail. When a user searches

for “actor”, the largest class freebase:film.actor has 26067 instances, which is larger

than yago:Actor109765278 (24760 instances). However, if the user chooses this “local

optimal” class, the combination of{freebase:film.actor, dbpediaowl:starring-} will only have 12175 instances. In fact, the only two answers by TC system that do not get

full coverage (after the inverse property adjustment) are exactly{freebase:film.actor, dbpediaowl:starring-} resulted from the local optimal actor class.

The first request that is definitely relevantto Task 1.x

The first request that is definitely irrelevantto Task 1.x

The last request that is definitely relevantto Task 1.x

t

₁

t

₂

t

₃

t

₄

Figure 5.2: Estimating time spent on each question of Task 1.

systems. We used web request logs to estimate the time spent on each task question.

When the request contained some keywords, or tags related to the keywords in any of the

tasks, we are very sure that it is related to a speciﬁc task, and thus we can certainly mark

them. However for other requests, we were not sure whether it is relevant to a speciﬁc

task, especially when this request is between two clusters of requests for two adjacent

tasks. Thus we simply apportion half of the ambiguous time interval to each of the two

tasks. As illustrated in Figure 5.2, we mark the ﬁrst and last clearly relevant request

of each task question and estimate that each tasks ends exactly halfway between its last

clearly relevant request and the ﬁrst clearly relevant request of the next task. Similarly,

we assume that the next task begins at this point.

We plot the box-and-whisker diagram for time spent on each question by each system

in Figure 5.3. The lower red boxes indicate the range between the ﬁrst quartile and

the median, the upper blue boxes indicate the range between the median and the third

quartile, and the lower bar and upper bar indicates the min and max time. We can see

that participants usually spent less time to complete a task using TC in Task 1 except

systems are very close in this subtask. We also found that one of the time records, by a

participant who uses TC, may be estimated with a higher error, because the ambiguous

time interval (transition time) between Task 1.2 and 1.3 by that participant is the highest

among all all participants and tasks. The average length of the ambiguous time intervals

is 32.7 s, while the highest is 109 s.

We also checked the points that seem to be outliers, especially focused on the following

three points: 580.5 s in Task 1.1 with HL, 319.5 s in Task 1.3 with TC, and 252.5 s in

Task 1.4 with TC. It turns out that there is nothing unexpected. All the request logs

show that the participants were doing something related to their tasks. Also, we note that

these outliers are not due to our approach to estimating task time: If we do not include

the ambiguous time interval in the estimation, i.e. we only look at the time between the

ﬁrst and the last relevant requests, the times are 574 s, 309 s, and 241 s respectively.

We believe that is just because of the browsing habits of the participants, and in fact

the longest times in Task 1.3 and Task 1.4 with TC come from the same participant.

Additionally, the participant who spent the most time in Task 1.1 with HL, also uses the

second longest time in Task 1.4 with HL. In order to investigate the times and also take

the diﬀerences between participants into account, we plot the total time spent on each

system by each participant in Figure 5.4. It shows us that 9 out of 14 (64.3%) participants

spent less time in total when they use TC. But also it indicates that some users might

ﬁnd HL a more eﬀective way for them to complete Task 1. Note that we have mapped the

0 100 200 300 400 500 600 Task 1.1 HL Task 1.1 TC Task 1.2 HL Task 1.2 TC Task 1.3 HL Task 1.3 TC Task 1.4 HL Task 1.4 TC T im e ( s)

Box-and-Whisker Diagram

Time Spent on Task 1

_{3rd Quarter}

2nd Quarter

Figure 5.3: Statistics of Time on each question of Task 1.

0 100 200 300 400 500 600 700 800 1 2 3 4 5 6 7 8 9 10 11 12 13 14

To

ta

l

T

im

e

S

p

e

n

t

(s

)

User Id

Total Time Spent on Both Systems by User

HL TC

Figure 5.4: Total time spent on each system by all participants.

study.

-100 0 100 200 300 400 500 Time Difference (s)

Time Difference between

the 1st and 2nd Time Usage

2nd Quarter 3rd Quarter TC

Figure 5.5: Statistics of time reduced when a participant uses a system for the second time.

to Task 1.4. We hypothesize that this improvement could be attributed to participants

getting familiar with both the tasks and the tools. Based on this observation, we compute

the diﬀerence of time that a participant spent the ﬁrst time they used a system and the

second time they used the same system. We ﬁnd that 10 out of 14 participants improve

their completion time the second time they use HL, and 10 out of 14 for TC as well. So

usually there is an improvement, and the time gets reduced. We plot another box-and-

whisker diagram in Figure 5.5, which summarizes these diﬀerences. Although we ﬁnd HL

has a larger value in the max, which is due to the “suspicious outlier” we mentioned before,

TC is better at the min value, the ﬁrst quartile, the median, and the third quartile. That

suggests people are likely to get more improvement when they use TC for the second time.

We used a one-tailed t-test to verify whether our hypotheses are statistically signiﬁcant:

(1) The total time spent on TC is less than that on HL; and (2) The improvement on TC

is larger than that on HL. However, the p-value for them are 0.175 and 0.306 respectively.

This means under the current size of samples, we are not able to prove our hypotheses.

In document Semantic Web for Everyone: Exploring Semantic Web Knowledge Bases via Contextual Tag Clouds and Linguistic Interpretations (Page 103-111)