Chapter 9 Semantic Analysis Methods for Text Classification
10.1 Conclusion
The last three Chapters ( i.e. Chapter 7,8,9) have highlighted different methodologies that have been developed and adopted to meet the objectives of the research introduced in Chapter 1. Based on the past literature reported in the areas of textual data mining applications in manufacturing or construction industry, the conceptual development of theory to achieve the objective of the research was mainly focused on application of clustering and then applications of Apriori Association Rule of Mining which has been widely used in various areas of applications. Hence substantial time and efforts have been made in this research in developing and applying these techniques for better selection of number of clusters and then using MKTPKS based matrix model to better classify textual data.
In the first stage Clustering was used to discover the first level of knowledge in terms of finding natural term based relationships defined in the textual data. The discovered knowledge in terms of single key term phrases was difficult to interpret and to use in identifying good or bad information documents available in the form of PPRs. Therefore the 2nd level knowledge refinement process is done with the application of Apriori Association Rule of Mining techniques to find more useful multiple key term knowledge sequences (MKTPKS) using varying level of support values. The one reason for generating the MKTPKS was to find those sequences of terms which refer to the terms which co-occur in the documents to identify good or bad information documents. The results obtained in the form of sequence of terms are useful to map information to the particular document space of information and then these sequences (i.e. MKTPKS) are compared with those identified by the domain experts. F-measure was used to measure the accuracy where the results were (37%) accurate where the value of recall measure was better than the precision measure.
The second part of the implementation of the methodology was to implement the different classification techniques on the discovered MKTPKS based matrix model. The purpose of this implementation was to study the affect on classification accuracies of different classifiers to help the knowledge workers or decision makers to better classify the data into their predefined classes. Since natural relationships in
terms of finding MKTPKS were captured by applying clustering and Apriori Association Rule of Mining techniques, the discovered knowledge must be used properly. This discovered knowledge was used for the classification task of textual data on the basis of its representation in terms of MKTPKS. There is a significant improvement in the classification accuracies obtained and these results are shown in the chapter 8.
A novel aspect of this research is the discovery of knowledge in terms of single or multiple key term phrases which were used to discover the relationships among terms defined in the textual data of PPRs. The discovery of these natural relationships could be used to improve business intelligence solutions as this research provides a means of reducing misclassification errors. The lower the error the better the classification accuracies would be and the previous knowledge thus stored in the form of textual databases would effectively be used for finding the solutions to new unclassified problems. The kind of work presented in terms of application of methodologies in chapter 7 and chapter 8 is a novel integration of several techniques and gave good results so the knowledge discovered might be used on other textual databases which are available in the free formatted text.
Another big advantage of the proposed methodology and its implementation is that it provides help in finding term based relationships among terms and this would be a big advantage to industry in terms of storing information in terms of different clusters where natural relationships among terms is stored. New information could then easily be compared with previously stored information in terms of clusters and its analysis would be easier as it could be put it into different information subspaces. The analysis of a large corpus of information available in textual data formats was made easy by first putting the whole information into multiple subspaces and efforts in analysing would be reduced.
The research proposed in this thesis also concludes with the important fact that the selection of the most appropriate data mining techniques in terms of classifying textual data also depends upon the information selection criteria which vary from simple distance measure to probabilistic methods. So the choice of classifier also
depends on the form of data available and its quality which helps the classifier to govern the rules for classification.
The main contribution of this research are enumerated as under;
• Developing of a generic method of discovering useful knowledge in terms of single key term phrases specific to some key issues discussed in the textual databases.
• Using the single term phrases sequence, multiple key term phrases are generated within each cluster to produce more valuable knowledge sequences and then mapping information to some specific set of documents as good or bad information documents.
• A novel integration of methods for generating multiple key term phrasal knowledge sequences are used to reduce the classification error when compared to simple single term representation methods.
• An introduction of novel integration to perform the text classification task where the path is followed from unsupervised learning to supervised learning for identifying key knowledge areas from textual databases and classifying documents. This technique is a hybridization of different methods and techniques which ultimately supports the generation of useful classification results for textual data.
• The proposal of novel integration of textual data mining techniques to capture key information or knowledge and disseminate it in terms of classification of textual data within any industrial setups.