5.3 Experiments
6.1.2 Evaluating the Semantic Matcher
As discussed in the previous sections, there are no agreed, best practice evaluation meth- ods that can be used to evaluate semantic matching solutions. The precision and recall metrics in IR domain cannot be directly applied in this case due to the aforementioned problems.
The ultimate result of the proposed semantic matching framework in this research will be the ranking of the available advertisements, indicating which is the best match, which is the next best and so on. It is much easier to obtain such a ranking from a domain user, as opposed to obtaining a percentage score for an advertisement. As stated in the previous discussion, evaluation of semantic matching is the determination of how closely the rankings/classifications delivered by the engine, approximates the rankings/- classifications specified by domain experts/users. Hence to judge the effectiveness or correctness of how the matches are ranked or classified, the resultant ranking or classi- fication will have to be compared against what a human subject or expert will view as the correct ranking or classification.
For the purpose of the evaluation of retrieval effectiveness of this semantic matching solution, we will compare the rankings delivered by the matching engine with that provided by the domain users in the same context. The domain users’ rankings will be obtained through studies conducted and the average user ranking will be obtained4. The closeness between the average user ranking and the ranking from the matching engine for the same situation will be judged quantitatively and graphically which in turn helps to evaluate the effectiveness of the matcher. We outline the methods and metrics used for this purpose in the following sections.
6.1.2.1 Adapting the Generalised Precision and Recall to Evaluate the Pro- posed Semantic Matching Approach
As discussed in Section 6.1.1.2, the well known measures of precision and recall have been extended to measure effectiveness of systems that return a fuzzy value for the relevance. The equations for generalised precision and recall are given in Equation 6.5 and Equation 6.6. To use this equation to evaluate a matcher, the matching system should return a value ∈ [0, 1] as the degree of relevance for an advertisement. However, the proposed semantic matcher returns a ranking (∈ [1, n], where n is the number of advertisements considered during the matching process), where the best resource advertisement gets rank 1, the second best gets 2 and so on5. To exploit the generalised
4Human subjects can have subjective differences; for example what one views as the third best
match could be viewed as the fourth best by another. By averaging the rankings obtained by a number of subjects, the effects of subjective judgements can be minimised.
5Also, it is much easier for a domain user to rank the advertisements rather than assigning a relevance
precision and recall as a metric for evaluation, the rank should be adjusted to a fuzzy relevance value ∈ [0, 1].
The rank can be adjusted to obtain a value∈ [0, 1] which will indicate the fuzzy relevance for an advertisement. We use the following equation to obtain a fuzzy relevance (f ) from the rank. The fuzzy relevance fi for the i th advertisement that has rank ranki, can be obtained by:
fi= n− ranki
n (6.7)
where n denotes the number of advertisements considered during the matching process (and therefore the maximum value that can be taken by the rank). The measure f can then be used in Equation 6.6 and Equation 6.5 for calculating generalised precision and recall.
6.1.2.2 Chosen Evaluation Criteria
As pointed out by Tsetsos et. al. in [130], the service matching domain lacks established metrics and methods for evaluating the retrieval effectiveness and only a few semantic matching efforts have carried out a quantitative analysis of effectiveness of their proposed approaches. However, precision and recall metrics (or their generalised versions) have been adopted for evaluating certain matching solutions [41, 141, 130]. However, precision and recall metrics (as given in Equation 6.2 and Equation 6.3) will mean that the output of a semantic matcher (that returns a fuzzy relevance as the output) has to be converted into a boolean relevance; this approach has limitations as identified in Section 6.1.1. These limitations can be overcome by using the generalised precision and recall metrics as discussed in Tsetsos et. al.[130]; they have also adopted these metrics for evaluating semantic matchers that classify available services into an agreed set of classes as in [101, 82]. These generalised precision and recall metrics can also be extended for the evaluation of matchers that rank the available services as identified in Section 6.1.2.1. Thus in view of the above discussion, we use the following metrics and methods to judge the effectiveness of the Semantic Matching Approach.
• Generalised Recall and Precision and associated F-measure: We use the
fuzzy relevance scores obtained from the rankings (through Equation 6.7) to com- pute the generalised precision and recall (Equation 6.6 and Equation 6.5). These values of precision and recall are then used to compute the F-measure (Equa- tion 6.4) which gives a combined measure of effectiveness.
• Standard Deviation: The Standard Deviation between the matcher ranking
deviation between the two rankings6.
• Graphical Illustration of the Rankings: Although this does not give a quan-
titative value, it is useful to gain an understanding of the variance between the semantic matcher ranking and the human ranking through visual inspection.