4.2 Reuse Algorithms for Text Authoring
4.2.3 Compositional/ Constructive Text Reuse
Here, a textual solution is assembled in response to a query by combining chunks of text from several similar cases. Hence it is called compositional (COMP) or constructive text reuse because of its similarity to CBR’s compositional (Chang et al. 2004, Bentebibel & Despres 2006) or constructive (Plaza & Arcos 2002) adaptation where a solution is obtained by combining solution elements of several partially similar cases. An important assumption for this approach is that we consider chunks of text to be contextually similar if they are aligned to the same problem attribute and have identical attribute values. For example in our hotel reviews domain, all chunks of text (sentences) aligned to a cleanliness rating of 3 are regarded as similar. These similar chunks of text can be extracted from the
4.2. Reuse Algorithms for Text Authoring 80
query’s k nearest neighbours (COMP k) with k ≤ n where n is the size of the casebase. Therefore the maximum value of k here relates to using all cases in the casebase which we denote as COMP N.
Algorithm 4.3 Algorithm to determine a Prototypical text chunk
Require: S={s1, . . . , sn}, similar text chunks for which a prototype is to be extracted
1: KW={kw1, . . . , kwm}, to contain set of keywords in all text chunks
2: for each si∈ S do
3: keywords= getKeywords(si)
4: add(keywords, KW)
5: end for
6: V={v1, . . . , vn} where length(vi)= size(KW)=m,
<!– to contain term frequency vectors for all text chunks –>
7: for each si∈ S do
8: keywords= getKeywords(si)
9: vi= createV ector(keywords, KW)
10: end for
11: cv= cv[1] . . . cv[m] where length(cv)= size(KW), to contain centroid vector
12: for j = 1 to m do 13: sum ← 0 14: for each vi ∈ V do 15: sum← sum + vi[j] 16: end for 17: cv[j] ← sum ÷ n 18: end for
19: maxSim← -1, similarity of best match to centroid
20: maxIndex← -1, index of best match to centroid
21: for each vi∈ V do
22: sim← getSimilarity(vi, cv)
23: if sim > maxSim then
24: maxSim← sim
25: maxIndex← i
26: end if
27: end for
28: return smaxIndex
We introduce a method that combines several similar chunks of text into a single meaningful and equivalent chunk of text named the prototype or prototypical chunk of text. Aggregating several pieces of similar text chunks into a single meaningful prototype is not trivial. In our methodology, concatenation will be inappropriate since it leads to tautology. This is because the text chunks will be expressing very similar or identical opinions about the same thing. Such concatenated text will be repetitive, unintuitive
4.2. Reuse Algorithms for Text Authoring 81
and boring to a new author. Summarization methods would have been ideal for this task but they are generally applied to identify the central theme of textual contents from a single author. The central theme, which is what makes them similar, in our text chunks is already known, but is expressed using varying lexical forms by different authors. Also, summarization techniques will not be efficient, because deep natural language processing capabilities are typically required (Mitra, Singhal & Buckley 1997). The method we propose here uses the same idea as extractive summary generation (Sparck-Jones 1999, Neto et al. 2002), where a subset of the sentences of the original text are identified as the central theme. However, our text chunks already have identical central theme, but we select the subset (representative) whose syntactic construct is most generic and therefore easily reusable by other authors. Another similarity to extractive summaries is that our prototypes do not guarantee a good narrative coherence when assembled for different problem attributes, but they are sufficient as a starting point for feedback text generation, since the author can edit when required.
Algorithm 4.3 lists the pseudo-code for generating a prototypical text chunk. The algorithm takes a group of similar text chunks, S, and returns one of them as the repre- sentative or prototype. Prototypes are generated by first creating a term frequency vector (vi) for each similar chunk of text; lines 7− 9 of the algorithm listing. The length of each vector is the size (m) of unique keywords in all similar text chunks for which a prototype is being determined. We then compute a centroid vector, cv, which consists of the average value across the cells representing the same keyword in each vector. Lines 11− 18 give details of how the centroid is computed. A prototype is determined as the chunk of text whose term vector is most similar to the centroid vector as indicated on lines 19−28 of the algorithm. Intuitively, a prototype will contain the commonly used keywords across all similar chunks of text. This is because values of such keywords in the prototype’s vector will be closer to the average. It should be noted that other term weighting functions, such as binary or normalised term frequency, can also be used to create the vectors.
The generation of prototypes or prototypical sentences for hotel review authoring is illustrated in Figure 4.3. Aligned sentences across the specified reviews (local or global)
4.2. Reuse Algorithms for Text Authoring 82
1..5: possible rang values- Cluster Cluster centroid 4 Casebase Casebase k-NN reviews k-NN C L R S V
Rating value clusters-
2 3 1 5 1 2 3 4 5 C L R S V Prototypical sentences LEGEND
Aligned sentence Selected Prototypical sentence C: cleanliness L: locaon R: Room S: services V: value
COMP_N COMP_k
Figure 4.3: Generating prototypical sentences in hotel reviews
are grouped into five natural clusters which maps directly to the possible rating attributes. Each cluster is then further re-clustered into five groups using their rating value (i.e. 1 to 5). The smaller group of clusters shown for the value rating attribute also applies to the other four attributes. The outcome of this clustering process is twenty-five smaller clusters and a prototypical sentence per cluster.
Algorithm 4.4 shows the compositional text reuse algorithm. This uses the same con- vention as the baseline and transformational techniques listed in Algorithms 4.1 and 4.2 respectively; that is, CB represents the casebase, Q is a query and Ci is a case in the casebase. This approach is then illustrated in Figure 4.4 with our hotel review author- ing domain where the query has five attribute values (p in algorithm): 2, 1, 3, 5, 2 for cleanliness, location, room, service and value ratings respectively. Five sentences are then obtained from the prototypical sentences with identical rating values to the query and aggregated as proposed text (SOLN ). In the algorithm, each prototypical sentence is generated from an element in the matrix (G) having p× q elements where each element is a cluster of similar text chunks. The pseudo-code for determining a prototype in our
4.2. Reuse Algorithms for Text Authoring 83
Algorithm 4.4 Compositional text reuse algorithm (COMP)
1: G = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ g11, . . . , g1q g21, . . . , g2q . . . gp1, . . . , gpq ⎫ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎭ ,
set of clustered similar text chunks; each cluster belongs to a pair
from p problem attributes and q attribute values
2: CBlocal ← RET (CB, Q, k)
retrieve k similar cases; for COMP N, k← n the size of CB
3: SOLN ={},
<!– to contain text chunks in the proposed solution text –>
4: for each Ci∈ CBlocal (in order of decreasing similarity) do
5: SolutionText← getSolutionT ext(Ci)
6: for each IEj ∈ Ci do
7: rj ← attribute(IEj)
8: vj ← attributeV alue(IEj)
9: gj ← getClusteredSimilarT extChunks(G, rj, vj)
<!– gj ≡ grjvj –>
10: Sj ← selectAlignedT extChunks(rj, SolutionText)
11: addT extChunks(Sj, gj) 12: end for 13: end for 14: for each IEk ∈ Q do 15: rk← attribute(IEk) 16: vk← attributeV alue(IEk) 17: gk← getClusteredSimilarT extChunks(G, rk, vk)
18: psk← getP rototypicalT extChunk(gk)
19: addT extChunks(psk, SOLN )
20: end for
21: Aggregate all chunks of text in SOLN for reuse
context is listed in Algorithm 4.3 and has been explained earlier. Lines 4− 13 of Algo- rithm 4.4 show how cluster of similar text chunks are selected from cases in the query’s neighbourhood, while lines 14− 20 use the clusters to assemble a solution composed of prototypical text chunks. A major difference between COMP k that uses solution texts from neighbours and COMP N that uses all solution texts in the casebase is that COMP k might propose less than p text chunks in its solution if no prototypes are generated for any attribute value in the query. This can occur when k is small and there are no cases in the neighbourhood with this particular attribute value.
4.3. Chapter Summary 84
C
L
R
S
V
2
1
3
5
2
Query 1 2 3 4 5 C L R S V Prototypical Sentences SoluonC le a n lin e ss d e scrip tio n . L o ca tio n d e scrip tio n . R o o m d e scrip tio n . S e rvice d e scrip tio n . Va lu e d e scrip tio n .
Figure 4.4: Compositional text reuse with hotel reviews
4.3
Chapter Summary
This chapter introduced two novel concepts in relation to text reuse: text alignment and prototypical text generation. Text alignment links structured problem attributes to specific chunks of a solution text while prototypical text generation abstracts similar chunks of text into a single meaningful prototype. These concepts are generally applicable in domains where cases consists of pre-defined structured attributes along with written text. We then propose two novel text reuse techniques and a third retrieve-only baseline that generate proposed solution texts related to the problem attributes. Transformational text reuse progressively changes the solution text from the best match case into a more accurate solution, using texts from other nearest neighbours when the query and its best match are not identical. Compositional text reuse on the other hand, proposes a solution text constructed by aggregating chunks of text from several similar cases. Although hotel review authoring was used to illustrate our text reuse techniques throughout this chapter, the formalised algorithms have the advantage of being domain-independent and therefore applicable in any domain containing cases with both pre-defined structured attributes and complementary textual content.
Chapter 5
Evaluating TCBR with Machine
Translation Techniques
Evaluation is the key to measuring the capabilities, effectiveness and efficiency of any proposed technique or algorithm. Although qualitative evaluation is generally regarded as the best method of evaluation, automated evaluation methods where user intervention is minimal, are preferable for initial and intermediate testing during development of any new technique. This is because user evaluations are more expensive and repetition of such evaluations with different users with similar expertise may vary greatly due to sub- jective user judgements. Therefore automated evaluation methods are more commonly used for research experiments to allow for inexpensive, faster testing and comparison to current state-of-the-art techniques. Automated evaluation methods are typically encoded as quantitative metrics which give single numeric values for the evaluation of each test situation.
Our research work proposes novel techniques to aid reuse of textual solutions in Textual Case Based Reasoning (TCBR). This necessitates that we automatically evaluate written natural language (text) proposed or generated by our techniques for its syntactic and semantic correctness. The need for automated text evaluation is not peculiar to TCBR but common to several other disciplines such as Information Retrieval (IR) (Lenz 1998b, Baeza- Yates & Ribeiro-Neto 1999), Text Summarization (Neto et al. 2002, Lin & Hovy 2003),