7. Conclusion and future work
7.2 Contributions
This section revisits the contributions from the research of this thesis, which were pre- sented in section 1.3. The research of this thesis has made three notable contributions: one major contribution and two minor contributions.
The major contribution of this thesis is the use on NLP techniques, particularly text seg- mentation, to produce a hierarchical structure from text documents and the use of this structure by a content-supply service to enhance content discoverability and reusability for adaptive systems. To build a structure from text documents, this thesis proposed two novel hierarchical text segmentation algorithms based on the semantic representation of content, OntoSeg and C-HTS. OntoSeg used the semantic similarity between text seg- ments based on an ontology and uses a Hierarchical Agglomerative Clustering algorithm to build a hierarchical structure of text based on its semantic representation. Evaluation results demonstrated that although OntoSeg is able to produce a hierarchical structure of text based on its semantic representation, it did not perform well against the state of the art approaches. These findings indicated that the performance of OntoSeg can be im- proved through improved understandability of text, by exploring the semantic relatedness between text blocks rather than using the semantic similarity. As a result, the C-HTS algorithm was proposed. C-HTS used the explicit semantic representation of text to meas- ure the semantic relatedness between text blocks. It represented the meaning of a piece of text as a weighted vector of knowledge concepts automatically extracted from the mas- sive human knowledge repository, Wikipedia. Similar to OntoSeg, C-HTS produced the content of a single document as a tree-like hierarchy. Evaluation results have shown that C-HTS outperformed the state of the art approaches on two datasets that are designed specifically for the evaluation of hierarchical text segmentation. The results also demon- strated that using the semantic relatedness in C-HTS yielded a better hierarchical structure of text than using the semantic similarity employed by OntoSeg.
This thesis also presented a novel content-supply service named CROCC. CROCC is a service which harvests content resources from open and closed corpus in their native form and builds a structure out of each content resource without the reliance upon its original structure. CROCC utilises the C-HTS algorithm to build a structure of the harvested
153
content resources based on their semantic representation. Using this structure, the service delivers content slices that best match the needs and requirements of individual adaptive systems. The aim of CROCC is to enhace content discoverability and reusability for adaptive systems. This thesis also presented a task-based experiment to evaluate the ex- tent to which the CROCC service can enhance the discovery and reuse of content for adaptive systems. The main focus of this experiment is to evaluate the quality of content slices produced by the CROCC service according to the specific requirements of a content request that might be sent by an adaptive system. The experiment focused on a specific application are and specific subject area. Content resources were collected from closed and open corpus in the specified subject area. A baseline system was developed in order to compare its performance against CROCC. Evaluation system was built to present con- tent slices produced by each system to the participant users for evaluation. Experimental results demonstrated that the quality of slices produced by CROCC are highly preferred by users than slices produced by the baseline system.
A minor contribution of this thesis is the concept space that was built from Wikipedia for the purpose of this research. The concept space was built from a Wikipedia snapshot (April 2017) to be used for the explicit semantic analysis of text within C-HTS. This concept space is publicly available15 and can be used by researchers who work on tasks related to explicit semantic analysis. Another minor contribution is the implementations of the two hierarchical text segmentation algorithms proposed in this thesis, OntoSeg and C-HTS. Implementations of both algorithms have been open-sourced and made publicly available16,17.
The contributions of this research have also resulted in the following academic publica- tions:
Bayomi, M., Levacher K., Ghorab, M.R., and Lawless, S. "OntoSeg: a Novel Ap- proach to Text Segmentation using Ontological Similarity”. In the proceedings of
the 5th ICDM Workshop on Sentiment Elicitation from Natural Text for Infor- mation Retrieval and Extraction, ICDM SENTIRE. Held in conjunction with the
15 https://goo.gl/JZhEvm
16 https://github.com/bayomim/OntoSeg 17 https://github.com/bayomim/C-HTS
154
IEEE International Conference on Data Mining, ICDM 2015. Nov 14th, 2015. Atlantic City, NJ, USA.
Bayomi, M. & Lawless, S. “C-HTS: A Concept-based Hierarchical Text Seg- mentation approach”. In the Proceedings of the Eleventh International Confer-
ence on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA).
Bayomi, M. "A Framework to Provide Customized Reuse of Open Corpus Con- tent for Adaptive Systems." In the Proceedings of the 26th ACM Conference on Hypertext & Social Media. HT ’15. Northern Cyprus, pp 315–318. ACM, 2015. Additionally, a publication describing the CROCC service and its evaluation (detailed in Chapter 5 and Chapter 6) is underway and will target the ACM Hypertext conference.