5.3 Experiments
6.1.1 Current Practices used for Evaluating Retrieval Effectiveness
Currently there is no agreed “best practice” methods or heuristics which are used to evaluate effectiveness of a matchmaking solution. As pointed out by Tsetsos et. al in [130], although many semantic matching research efforts have conducted experiments to evaluate the performance with respect to scalability and response times, they lack a quantitative analysis of the retrieval effectiveness of their solutions: the main reason for this being the lack of established evaluation metrics and methodologies for semantic matching schemes.
However in the Information Retrieval (IR) domain (where relevant documents must be extracted from a collection of documents), which addresses a somewhat similar problem to matchmaking (identifying suitable resources out of several resources); Precision and Recall metrics and the associated F-measure [9] (see Equation 6.2, Equation 6.3 and Equation 6.1) are used to judge the effectiveness1.
precision = N umber of Retrieved Resources that are Relevant
N umber of Retrieved Resources (6.2)
recall = N umber of Retrieved Resources that are Relevant
N umber of Relevant Resources (6.3)
F1 = 2∗ precision ∗ recall
precision + recall (6.4)
However most information retrieval tasks assume a boolean relevance, i.e. the document is either relevant or completely irrelevant. The measures of precision and recall are thus based on this assumption; the positive and negative matches returned by an IR system are compared against those deemed to be relevant or irrelevant (as determined by a human subject) in order to arrive at the precision and recall metrics.
Semantic matching solutions typically aim to provide a more flexible approach for match-ing, rather than just classifying the available services into crisp sets of relevant and irrel-evant cases. Therefore, the matches can have different levels of relevancy or suitability
1Recall is the proportion of relevant resources actually retrieved in answer to a search request. Pre-cision is the proportion of retrieved resources that is actually relevant.
A single measure combining recall and precision is the F-measure or weighted harmonic mean (Equa-tion 6.4). The general formula for non-negative real α is given by:
Fα=(1 + α)(precision∗ recall)
α∗ precision + recall (6.1)
Choosing α > 1, weights recall more than precision. When α = 1, recall and precision are both equally weighted which gives F or F1 as in Equation 6.4.
as compared to the given request. Depending on this degree of relevancy or suitabil-ity, the matches are either: classified to a number of sets as in [101, 82]; or ranked usually based on a real number score assigned to each resource. To evaluate such a matching approach based on precision and recall, the range of service rankings must be divided into two complementary sets through the use of a threshold value (a minimum acceptable degree of match). All the advertisements having a ranking greater than the threshold will be taken as relevant, and the others will be taken as irrelevant. Similarly the experts/ users of the domain will have to classify the advertisements as relevant or irrelevant as opposed to any other classification or ranking. However, there are several issues associated in using such an evaluation scheme to evaluate a semantic matching approach that employs multiple degrees of match (i.e. the potential matches can have different levels of relevancy or suitability):
• One of the main objectives of a semantic matching scheme is to facilitate flexible matching rather than just classifying the advertisements into two sets correspond-ing to either exact or failed matches. Re-classifycorrespond-ing the resultcorrespond-ing degrees of match into relevant and non-relevant matches for the purpose of employing an evaluation scheme using precision/ recall metrics, will mean that the extra information and semantics provided by the matcher are disregarded.
• In order to re-classify the match results into relevant and non-relevant matches, a suitable threshold has to be agreed. Agreeing on such a threshold value for this purpose is a problematic task since such a threshold value may depend on the context and on how stringent one wants the retrieval process to be.
• To apply precision/ recall metrics, the domain expert or user will have to assign a boolean relevance to each advertisement; determining if an advertisement is relevant or not, may not be straightforward in certain cases and is in contrast to the objective of semantic matching, which aims for more flexible and accurate resource retrieval.
Hence an evaluation approach has to be adopted which takes in to consideration the objectives and semantics of the matching approach and the type of classification or ranking provided. In the following section we discuss some of the approaches employed by related efforts to evaluate their semantic matching frameworks and then explain the evaluation criteria adopted for the evaluation of the proposed semantic matching framework.
6.1.1.1 Evaluation Methods used in other Service Matching Research
The web service matching approaches presented by Dong et. al in [41] and Wang et.al in [141] have adopted precision and recall metrics to evaluate the effectiveness of the
proposed matching solutions. The matcher results are compared against the boolean relevances assigned to the available services (services are deemed to be either relevant or irrelevant by the authors). In [141], these assigned relevances are compared against the resources that receive a score greater than 50% by their matching system2. In [41] again, precision, recall and variants of precision are used to evaluate the matching solution.
The matching system proposed in [139] also adopt precision and recall metrics to evaluate their system. The domain involved in this research is the recruitment of human resources.
The resource request involved is a job specification and the advertisements will be the skills of the available job seekers. Their evaluation study involves five subjects who are human resource experts. The matchmaker will assign a score for each advertisement in relation to the job request, and the advertisements that get a score greater than 60%
are deemed to be the “positive” matches returned by the system. The human experts also assign a score to each advertisement, and those that get a score greater than 60%
are taken as the “relavant” matches for calculation of precision and recall metrics. In this situation however, there is a prior known “marking scheme” for scoring the skills specifications in relation to a given job request. Therefore it has been possible to adopt precision and recall metrics in this research. However, the reason for choosing a threshold value of 60%, is not justified in the literature.
Tsetsos et. al in [130] presented a generalised evaluation scheme to judge retrieval effectiveness of semantic web service matchmaking. However, the evaluation scheme presented is specific to matchmaking systems where, the matches are classified into an agreed set of classes as in [101, 82]. They have used generalised recall and precision metrics (which will be discussed later in Section 6.1.1.2) to measure the correspondence between the relevance assessments delivered by the matching engine and by the experts/
users.
In the semantic matching approach presented in [40] by di Noia et. al, the available advertisements are ranked or ordered by their suitability to satisfy the given request.
They have evaluated their matching approach by comparing the matchmaker results with human perception; the closeness between the ranking produced by the matching engine and the rankings obtained by domain users in the same scenario, is estimated through the use of standard deviation. This gives an understanding of how close the matching engine can approximate human judgement. Also, in the matchmaking work presented in [139], the ranking provided by the matchmaker is compared with the averaged human ranking provided by domain experts to find the closeness between the matcher ranking and human ranking, and thereby estimate the ability of the matching engine to approximate human judgement.
As explained earlier, effectiveness evaluation involves determining how well the matcher
2The authors have assumed that a service receiving a score of less than 50% is likely to be irrelevant to the request
can approximate expert or user judgement. Although there is no accepted “best prac-tice” measure for evaluating semantic matchers, it can be observed that various efforts have adopted precision and recall (or extended and modified versions of precision and recall) from the information retrieval domain for this purpose.
6.1.1.2 Precision and Recall for Generalised Systems
As pointed out earlier in Section 6.1, although precision and recall metrics have been widely accepted to measure effectiveness of information retrieval systems, they are only applicable to systems that return a boolean relevance. These metrics cannot be directly adopted to measure effectiveness of systems that return a fuzzy value for the relevance (i.e. systems that return a value ∈ [0, 1] as the degree of relevance for an advertisement A w.r.t. a request R). Therefore to measure effectiveness in such systems, Buell et. al.
in [22] have presented generalised measures of precision and recall.
Let r(Ai) denote the degree of relevance assigned by the expert or user for the adver-tisement Ai, let e(Ai) denote the degree of relevance assigned by the system for the advertisement Ai3 and n denote the number of advertisements. Then the generalised recall and precision measures are defines as:
Recall =
The purpose of the evaluation is to measure how closely e can approximate r. It can be observed that, when the system estimates for relevance values are more stringent than the expert estimates, precision is maximised. When the expert estimates are more stringent, then the recall is maximised.
3That is, e(Ai) can be seen as the fuzzy membership function (defined by the system) that defines the membership value of Aiin the set of advertisements that satisfy the given request. Similarly, r(Ai) can be seen as the fuzzy membership function (defined by the expert) that defines the membership value of Aiin the set of advertisements that satisfy the request