Knowledge Summarization: Building the Case Base
7.3 Testing Scenarios
TCBR is not a well-studied discipline as IR, so that there are no established pro-cedures for the evaluation of TCBR approaches. In this section, we initially look at the different TCBR approaches that were discussed in Chapter2, to make clear how much they differ in the evaluation procedures. Then, we describe the testing scenario that will be used in our evaluation.
7.3.1 Existing Evaluation Procedures
In the following, four evaluation procedures used in four TCBR approaches discussed in Chapter 2are shortly summarized.
Br¨unninghaus & Ashley (Section2.4.1): The primary goal in this approach is to represent documents of legal cases with a set of factors that capture their meaning.
To assign the factors, the text is automatically transformed into a representation with ProPs (propositional patterns). Although the tests showed that the accuracy of this task on its own is not adequate, Br¨unnighaus & Ashley proceeded with the evaluation by using the cases represented automatically with factors as input for an algorithm that could predict the outcome of the legal cases (whether the legal case was won or lost). The predicted outcomes are compared to the real outcomes in a leave-one-out evaluation scheme that calculates values for the measures of accuracy and coverage. These values are combined in one F-measure for predictions:
Fpred = 2∗accuracy∗coverage
accuracy+coverage , which is similar to the F-measure used in information retrieval and calculated based on precision and recall. Although the automatic assignment of factors to cases was not successful on its own, the predictions based on the set of assigned factors are still successful (Fpred= 0.703).
Knoweldge Layers (Section 2.4.2): In this approach, the focus is on the rich representation of cases that would improve case retrieval beyond that of an IR
system that uses the simple bag-of-words representation. In order to prove the contribution of the knowledge layers, Lenz envisioned the following scenario. He took 25 documents in the form of question/answer pairs, reformulated each question, and used the reformulated questions as queries to the TCBR system. If the system retrieved the document containing the original question, the retrieval was counted as a success. Based on the results of retrieval, a precision–recall curve was built. Then, Lenz performed an ablation study, removing one by one the knowledge layers used for the case representation. He plotted all precision-recall curves together to show that the best performance was achieved when a case representation used the most knowledge layers. The best result of these curves were at the point: precision = 0.7 and recall = 0.9.
Sophia (Section 2.4.3.3): To test the case retrieval efficiency of the Sophia approach, its authors use a scenario known as “query-by-example”. Concretely, to Sophia an entire document is presented, instead of a query of some key words as usual, and for this test document Sophia finds a cluster of training documents that share a context with it. From the items of the clusters, a minimum spanning tree is created. The node (the document) that it is the most similar to the query (called the nearest neighbor, NN) serves as a starting point to explore the tree and retrieve other similar documents. Since the evaluation is performed on the documents of Reuter-21578 corpus, in which all documents are annotated with the set of topics that apply to each document, then, the relevancy of a retrieved document will be based on the number of topics this document shares with the query document. If the two documents share all topics the relevancy is high, when the shared number decreases, so does the relevancy. Since often retrieving only one document (the NN) is insufficient, documents at a distance of k edges from NN are retrieved. The experiments showed that the best retrieval results were achieved when the definition for the relevancy was not stringent (the two documents share at least 1 topic) and the distance k was set equal or higher than 3.
PSI (Section2.4.3.5): The testing procedure of PSI also depends on the concept of the nearest neighbor. PSI uses a set of documents where each one has a label:
its class (for instance, the class was “PC” or “Mac” for the corpus of documents described in Section2.4.3.5). The retrieval process works in the following way: for each document in the testing set, the k most similar documents are retrieved (by comparing their feature vectors). The class of the test documents is predicted as a majority vote among the labels of the k retrieved documents. The accuracy of this prediction is calculated as the ratio of the correctly predicted labels to the total number of test documents. Since the principal goal of PSI is to find an appropriate representation of cases, the accuracy of the approach is plotted against the number of features used in representing each case (from 10 to 120 features). The results
showed that 20 features per documents was a representation that could compete with the 10-features representation of LSI, however, the accuracy values changed widely among different data sets (from 59.9% to 95.8%.
7.3.2 Describing Testing Scenarios
The testing scenarios described in the previous section show clearly how much the evaluation procedures depend on the information available with the documents.
For instance, both Sophia and PSI (which are domain-independent), although are knowledge-lean approaches and don’t use external resources for processing text, when it comes to evaluation rely on information external to the documents, such as their topic or class labels, which were assigned by domain experts. Because these approaches use corpora that were prepared for text classification tasks by community efforts (e.g. the Reuters corpus or the 20 Newsgroups), such kind of information is readily available. Instead, the other two knowledge-rich approaches use corpora of documents for which such an information neither exists nor it makes sense, because the corpora are very homogeneous (e.g., all the documents used by Br¨unninghaus & Ashley are legal cases in the domain of trade secret, thus they share the same topic). On the other hand, these two knowledge-rich approaches depend on information inside the documents: the fact that the document is composed as a pair of question and answer or that the legal case has a defined outcome.
Since we are interested in case-based reasoning and not case-based classification (which is what the knowledge-lean approaches do), we will follow a strategy which is nearer to that used for the evaluation of knowledge-rich approaches. Such an evaluation can be regarded as goal-based. As an example, consider the approach of Br¨unninghaus & Ashley, where the final goal is to be able to predict in advance whether a new case can be won or lost based on the available facts. One can then try to build an argumentation strategy that brings in evidence those factors that contribute to a win.
The goal in our TCBR approach is to assist inexperienced users in successfully performing the task of MONITOR-and-DIAGNOSE. Based on the discussion of the previous chapters, the more relevant questions in the context of this task are:
1. Given an observed object, what are all possible types of findings for it?
2. Given a finding, what are all possible hypotheses for its presence?
The first question contributes to avoiding the problem of “missing the symp-toms”, that is, failing to notice types of findings which are not very obvious or very common. The second question is important in the context of the DIAGNOSE task.
As it is known especially from medical diagnosis, the presence of a symptom can be explained in different ways2. Diagnosis proceeds by checking one by one every hypothesis, starting from the most probable to the least probable.
2For instance, high fever is related to different underlying causes.
At this point it is necessary to stress out again that the TCBR approach for the task of MONITOR-and-DIAGNOSE is different from others discussed till now.
These other approaches limit themselves in retrieving only a few cases that are similar to the query, because they implicitly assume that there is only one solution to a given problem. However, this is not true for the DIAGNOSE task, where several hypotheses might be possible, and it is the task of the human user to choose the one that applies to the situation by ruling out the others3. But in order to do that, these several hypotheses should be known to the user. This is what TCBR does: it supplies the user with different findings or hypotheses from the case base of episodes it has stored. The important role that our knowledge extraction and summarization approach plays in this scenario is that it takes care to supply only unique answers and presents them ordered according to the frequency of occurrence in the corpus. In this way, redundancy turns out to play a positive role by supplying information to the ranking procedure.
Recapitulating the discussion of this section, there will be two testing scenarios for the TCBR approach:
1. Retrieve all finding information related to some type of observed object.
2. Retrieve all explanation information related to some type of finding.
To measure the success in these experiments, evaluation measures are needed, a topic discussed in the following section.