Evaluating the Semantic Matcher - Semantic description and matching of services for pervasive e

5.3 Experiments

6.1.2 Evaluating the Semantic Matcher

As discussed in the previous sections, there are no agreed, best practice evaluation methods that can be used to evaluate semantic matching solutions. The precision and recall metrics in IR domain cannot be directly applied in this case due to the aforementioned problems.

The ultimate result of the proposed semantic matching framework in this research will be the ranking of the available advertisements, indicating which is the best match, which is the next best and so on. It is much easier to obtain such a ranking from a domain user, as opposed to obtaining a percentage score for an advertisement. As stated in the previous discussion, evaluation of semantic matching is the determination of how closely the rankings/classifications delivered by the engine, approximates the rankings/- classifications specified by domain experts/users. Hence to judge the effectiveness or correctness of how the matches are ranked or classified, the resultant ranking or classification will have to be compared against what a human subject or expert will view as the correct ranking or classification.

For the purpose of the evaluation of retrieval eﬀectiveness of this semantic matching solution, we will compare the rankings delivered by the matching engine with that provided by the domain users in the same context. The domain users’ rankings will be obtained through studies conducted and the average user ranking will be obtained4. The closeness between the average user ranking and the ranking from the matching engine for the same situation will be judged quantitatively and graphically which in turn helps to evaluate the eﬀectiveness of the matcher. We outline the methods and metrics used for this purpose in the following sections.

6.1.2.1 Adapting the Generalised Precision and Recall to Evaluate the Pro- posed Semantic Matching Approach

As discussed in Section 6.1.1.2, the well known measures of precision and recall have been extended to measure eﬀectiveness of systems that return a fuzzy value for the relevance. The equations for generalised precision and recall are given in Equation 6.5 and Equation 6.6. To use this equation to evaluate a matcher, the matching system should return a value ∈ [0, 1] as the degree of relevance for an advertisement. However, the proposed semantic matcher returns a ranking (∈ [1, n], where n is the number of advertisements considered during the matching process), where the best resource advertisement gets rank 1, the second best gets 2 and so on5. To exploit the generalised

4_{Human subjects can have subjective diﬀerences; for example what one views as the third best}

match could be viewed as the fourth best by another. By averaging the rankings obtained by a number of subjects, the eﬀects of subjective judgements can be minimised.

5_{Also, it is much easier for a domain user to rank the advertisements rather than assigning a relevance}

precision and recall as a metric for evaluation, the rank should be adjusted to a fuzzy relevance value ∈ [0, 1].

The rank can be adjusted to obtain a value∈ [0, 1] which will indicate the fuzzy relevance for an advertisement. We use the following equation to obtain a fuzzy relevance (f ) from the rank. The fuzzy relevance f_i for the i th advertisement that has rank rank_i, can be obtained by:

f_i= n− ranki

n (6.7)

where n denotes the number of advertisements considered during the matching process (and therefore the maximum value that can be taken by the rank). The measure f can then be used in Equation 6.6 and Equation 6.5 for calculating generalised precision and recall.

6.1.2.2 Chosen Evaluation Criteria

As pointed out by Tsetsos et. al. in [130], the service matching domain lacks established metrics and methods for evaluating the retrieval effectiveness and only a few semantic matching efforts have carried out a quantitative analysis of effectiveness of their proposed approaches. However, precision and recall metrics (or their generalised versions) have been adopted for evaluating certain matching solutions [41, 141, 130]. However, precision and recall metrics (as given in Equation 6.2 and Equation 6.3) will mean that the output of a semantic matcher (that returns a fuzzy relevance as the output) has to be converted into a boolean relevance; this approach has limitations as identified in Section 6.1.1. These limitations can be overcome by using the generalised precision and recall metrics as discussed in Tsetsos et. al.[130]; they have also adopted these metrics for evaluating semantic matchers that classify available services into an agreed set of classes as in [101, 82]. These generalised precision and recall metrics can also be extended for the evaluation of matchers that rank the available services as identified in Section 6.1.2.1. Thus in view of the above discussion, we use the following metrics and methods to judge the effectiveness of the Semantic Matching Approach.

• Generalised Recall and Precision and associated F-measure: We use the

fuzzy relevance scores obtained from the rankings (through Equation 6.7) to compute the generalised precision and recall (Equation 6.6 and Equation 6.5). These values of precision and recall are then used to compute the F-measure (Equa- tion 6.4) which gives a combined measure of eﬀectiveness.

• Standard Deviation: The Standard Deviation between the matcher ranking

deviation between the two rankings6.

• Graphical Illustration of the Rankings: Although this does not give a quan-

titative value, it is useful to gain an understanding of the variance between the semantic matcher ranking and the human ranking through visual inspection.

In document Semantic description and matching of services for pervasive environments (Page 123-125)