email management) can be transformed into the semantic structured representation that is utilized by traditional CBR techniques. Also, such structured representations must meet NLG requirements for text generation. For example, a corpus containing sample struc- tured data and human-authored textual equivalents must be available for NLG domain analysis (Reiter & Dale 1995).
Our research makes contributions that fall under different categories of text reuse. We propose a structural text reuse technique similar to CG (Lamontagne & Lapalme 2004) which identifies portions of a retrieved textual solution that need adaptation (details in Chapter 3). The technique is formalised with a generic architecture that can cater for different text granularity levels. Novel transformational and compositional text reuse methods whose aims are similar to those used in the GhostWriter systems (Bridge & Waugh 2009, Healy & Bridge 2010) are also proposed for text authoring (see Chapter 4). These techniques are meant to fill some of the gaps we observed in TCBR literature.
2.3
Evaluation techniques
Evaluation is a critical aspect in the design and analysis of any new technique since it helps to demonstrate the technique’s effectiveness and limitations. New techniques can also be compared with existing ones used for identical tasks based on their evaluation results. This kind of evaluation which we call ‘performance evaluation’ is applicable to most research disciplines (TCBR inclusive). For TCBR, performance evaluation involves assessing proposed solutions for their suitability to solve a given set of problems. Tradi- tional CBR evaluation metrics such as accuracy are used when the solution is structured and adapts information retrieval evaluation metrics for textual solutions.
Another form of evaluation peculiar to CBR, especially TCBR, is the measurement of the casebase complexity. Here, an experiential corpus is assessed for its conformity to the similarity assumption which is core to the CBR paradigm. In other words, casebase com- plexity metrics show how well similar problems have similar solutions in a given corpus. However, the complexity of a casebase is dependent on the case representation and simi- larity measure. Therefore, these metrics can be used to assess different configurations to
2.3. Evaluation techniques 35
determine the most suitable for a domain without necessarily designing a complete CBR system. In this section, we examine these two forms of evaluation (Casebase complexity and performance evaluation) and critically analyse previous work related to TCBR.
2.3.1 Casebase complexity evaluation
The basic assumption in TCBR as with CBR is that similar problems have similar solu- tions. Therefore TCBR is not suited to domains in which this assumption does not hold. However, the definition of similarity is highly dependent on the representation which im- plicitly captures the focus and context of the problem-solving scenario. For example, two pieces of texts that mean the same thing but expressed differently might be incorrectly judged to be dissimilar if the system does not capture variability in vocabulary in its rep- resentation (e.g. using text normalisation). Different configurations (representations and similarity metrics) might need to be compared to determine which best captures the in- formation needed for reasoning while designing a TCBR system. A qualitative evaluation although ideal is clearly impractical due to cost and time implications. An alternative is to develop casebase complexity metrics that measure the similarity of solutions in similar problem neighbourhoods. In other words, how well does a cluster of similar problems align to the cluster of their solutions? High values of casebase complexity indicate that the dataset (or domain) is well suited for the CBR paradigm. The hypothesis is that casebases with high complexity values will give better performance evaluation than those with low casebase complexity.
Lamontagne (2006) proposed case cohesion as a casebase complexity measure to aid selection of an appropriate configuration for a TCBR system. Case cohesion estimates how well the problem in a case has similar solutions using its neighbourhoods in the problem and solution spaces. A similar measure called case alignment evaluates the competence of different system configurations (Massie et al. 2007). The major difference between case co- hesion and case alignment is in the level of granularity; cohesion uses the number of neigh- bours common to the problem and solution spaces but alignment uses similarity values of such neighbours. In contrast, global alignment quantifies the overall alignment of all cases
2.3. Evaluation techniques 36
by analysing similarity between problem and solution clusters to determine casebase com- plexity for alternate configurations (Mudambi-Ananthasayanam, Wiratunga, Chakraborti, Massie & Khemani 2008, Mudambi-Ananthasayanam, Chakraborti & Khemani 2009).
These three casebase complexity metrics (i.e. case cohesion, case alignment and global alignment) were applied to our experimental datasets to determine the suitabil- ity of TCBR. The metrics determine their values by approximating the implicit alignment between problems and solutions using case similarity. This is similar to the notion of alignment used in this thesis for determining what to reuse in a textual solution (details in Chapter 3). However, we also propose another method for explicitly aligning problem attributes to some portions of the solution text as discussed in Chapter 4.
2.3.2 Performance evaluation
The need to evaluate natural language texts is common to several research areas in com- puter science. These areas include (but are not limited to) Information Retrieval (IR) (Lenz 1998b, Baeza-Yates & Ribeiro-Neto 1999), TCBR (Br¨uninghaus & Ashley 2005, We- ber et al. 2006), Natural Language Generation (Sripada, Reiter & Hawizy 2005, Belz 2005) and Machine Translation (White & Connell 1994, Hovy 1999). Generally, we can divide text evaluation techniques into two broad categories: qualitative and quantitative.
Qualitative techniques involve user trials (experts and non-experts) to determine the quality of some text produced by a machine. The resulting user feedback is then aggre- gated using statistical methods to judge the average quality of such texts. The major dis- advantages are that these techniques are very expensive especially when expert knowledge is required and identical results are not reproducible as human judgement is subjective. Nevertheless, qualitative techniques have been used for evaluation across many application domains involving natural language processing and generation. For example, Sripada et al. (2005), Belz & Reiter (2006), Zhang et al. (2008), DeMiguel et al. (2008) and Hanft et al. (2008) all report experimental results using qualitative evaluation techniques.
On the other hand, quantitative techniques involve the comparison of machine texts to one or more gold standards written by humans (usually experts). Here quality of the
2.3. Evaluation techniques 37
method is gauged according to similarity at the syntactic or semantic level. Quantitative techniques are typically less reliable as most of them depend on finding matching string patterns between the machine-produced texts and gold standards. However, such tech- niques can be automated, are less expensive and are easily reproducible. This also allows for easy comparison across several algorithms that are designed for the same purpose.
Precision and Recall are two basic quantitative metrics (Br¨uninghaus & Ashley 1998a, Baeza-Yates & Ribeiro-Neto 1999, Lamontagne & Lapalme 2004) widely used for text evaluation across several disciplines especially IR and TCBR. The basic idea is to regard a piece of text as a bag of (key)words and to count common words between the machine and human texts. Proportions of these common words to the machine and human texts give a metric of precision and recall respectively. A major drawback is that the sequence of words in a piece of text is ignored and this can adversely affect the grammatical and semantic meaning. In other words, a machine text with high precision and recall might not necessarily be grammatically and/or semantically correct.
The edit distance, also called Levenshtein distance (Levenshtein 1966), has also been used for text evaluation; for example in (Belz 2005). This technique takes the sequence of words into account and is calculated in the simplest form as the number of delete, insert and substitute operations required to change the machine text into its human solution equivalent. Typically, different costs are associated with each of these edit operations. Nevertheless, the edit distance can be misleading because the same piece of text can be written in several ways without loss of meaning. In particular, machine texts with a longer length will be unfavourably penalized by this technique.
We used the standard TCBR evaluation metrics of precision and recall in most of our empirical experimental evaluations. However, we found out that this was at times inadequate and misleading as these metrics do not correlate well with human qualitative judgements. Therefore, one of the contributions of this thesis is the application of some machine translation (MT) evaluation metrics to TCBR. These evaluation metrics have been tested to correlate highly with human judgements and are used widely for empirical evaluation not just in MT but in other fields such as natural language generation and text