3.5 From What to Reuse to How to Reuse
4.1.2 Alignment Link Generation
Textual solutions must be parsed into suitable chunks for text alignment to take place. The size of each text chunk could be a clause, sentence, paragraph or even a section of several paragraphs depending on the average size of the textual solutions and typical style of writing in the domain. For instance, if each solution texts contain several paragraphs, it would be more reasonable to align each structured attribute to specific paragraphs rather than sentences. We approximate the alignment between each structured problem attribute and textual solution chunks using the generated seed lists. A chunk of text is aligned to a specific structured attribute if any of the attribute’s seeds occur in the text. In other words, the presence of any term from a structured attribute’s seed list in a section (chunk) of text indicates some relationship (alignment) between the text and problem attribute.
For our hotel review dataset, we worked at the sentence level of text granularity. Each review text is parsed into sentences for text alignment to take place. A review sentence is aligned to a particular rating attribute if it contains any of the terms in its seed list. This is illustrated in Figure 4.1, which shows a review and its alignment of sentences to the five rating attributes. Here, seeds within each sentence are shown in bold and six (6) of the nine (9) sentences are aligned to the pre-defined ratings. Notice how most of the aligned sentences are semantically related. For example, sentence 1 is about the proximity of the hotel to rail station and is correctly aligned to location rating. However, sentence 5 might be better aligned to location than service rating since it highlights the hotel’s proximity to restaurants and local shops. The unaligned sentences (i.e. 2, 3 & 4) seem related to
4.1. Text Alignment 73
But on the whole its a very clean, comfortable and safe hotel; we would rate it 9 out of 10.
Locaon of hotel is perfect, within walking distant to the main JR staon, subway metro, there is a staon just next to the hotel. For shoppers Takashimaya is just across the bridge!
Airport transfer right to doorsteps.
Food, shoppings and train/subway staons are within 5 to 10 mins walk.
5 mins walk to this electric street that not only sell all electrical appliance but with resturants that the locals frequent, that serve very nice and reasonable cheap Japanes dishes.
Hotel staff are efficient and helpful and especially the front desk staff speaks very good english.
Intenet access in the room is superb, shampoo , condioner and body wash come in family size boles, fantasc!
The only minus point is the standard queen bed room has got no cupboard, its beer of choosing the standard double bed room.
Cleanliness Locaon Room Service Value
Re
v
ie
w
s
e
nte
n
ce
s
Rang aributes
1 2 3 4 5 6 7 8 9Figure 4.1: Text alignment in hotel reviews authoring domain
location, but were not linked, because they do not contain any of the seeds generated
from our method as explained in Section 4.1.1. Relevant terms in these sentences such as ‘shop’, ‘airport’ and ‘train’ can be manually added to the seed list. Additional seeds can be manually detected by inspecting the alignment links across a small random sample of cases to improve the alignment process. The alignment generation in this example seems reasonable with about 67.7% accuracy; that is 6 out of 9 sentences were correctly aligned. Unaligned review sentences might be viewed as verbose details (sometimes unrelated to the hotel) that cannot be easily reused without alignment evidence. For example, not every hotel will be in a town with an airport, which means that sentence 3 might not be very useful to authors giving feedback on such a hotel. Nevertheless, the seed generation might be improved further with access to domain ontologies, which currently are not available for this domain.
4.1. Text Alignment 74
The importance of the quality of seeds cannot be over-emphasized as this has a direct influence on text alignment. It is expected that alignment accuracy results will vary when using different subsets of the seeds list. Such subset of seeds can be created by separating seeds generated for the same structured attribute from different knowledge sources. This enables the determination of the knowledge source which gives seeds of the best quality and this should vary across different domains.
Alignment in TCBR can be viewed as a many-to-many relationship, since a sentence can belong to more than one rating and vice versa. This is illustrated in Figure 4.1 where sentences 7 & 8 are linked to room rating, whereas sentence 9 is linked to cleanliness and value ratings. A many-to-many relationship implies that the same sentence might be selected more than once when assembling a proposed solution from the sentences of the best match attribute values to a new problem. This also applies if the relationships from sentence to rating attributes were one-to-many but will not be applicable if each sentence was aligned to only one rating attribute, that is one-to-one or many-to-one relationships. A simple remedy is to use data structures (e.g. set) that do not store duplicate sentences of the proposed solution or to remove such duplicates after solution assembly. Another remedy, albeit less desirable, might be to enforce that each sentence is aligned to only one rating attribute during the alignment generation using a sensible heuristic. One heuristic might be to parse any sentence aligned to multiple attributes into clauses where possible. For instance, sentence 9 in Figure 4.1 could be broken into two meaningful clauses using the semi-colon as a delimiter so that each clause is aligned to only one attribute (cleanliness or value). Another heuristic that can be used to enforce a one-to-one relationship if a sentence is aligned to several rating attributes is to align the sentence to an attribute (out of the several aligned attributes) having no other aligned sentence or the fewest aligned sentences. The sole alignment might also be assigned to the attribute having the highest number of seed in the sentence. Ties from any of these heuristics can be resolved by random selection of one aligned attribute.