selection have been used with clustered data, which can be categorised under one of the three algorithms described in Section 2.2.2, such as Spectral Feature Selection (SPEC) in filtering [134], and k -means in clustering [60].
Social networking is one of the new fields recently used by feature selection methods. In this context, social networks are online platforms that allow users to contact each other using different types of data such as text, pictures, videos, and hyperlinks; the text is the most used one [58]. Twitter ∗, Facebook †, and Instagram‡are examples of the social network platforms that are used these days. The text in social networks hold some special attributes compared to traditional texts, such as: time sensitivity, short length, and unstructured phrases [2]. These attributes make selection of features more challenging and require more work. For example, in 2008, Agichtein [3] and his group tried to find high quality content from Yahoo! Answers§, by introducing a classification method to combine the evidence from different sources of information. Another study [59] proposed an algorithm to classify twitter feeds based on a hybrid approach.
2.4
Summary
In this chapter, the background of text mining and feature selection has been discussed, as has the related research work regarding the selection of relevant features and improving the quality of the extracted features. Firstly, knowledge discovery is defined and the typical process of knowledge discovery explained, and then the theory and methodology of text mining and its preprocessing techniques and text representation processes are outlined. The next main section is about
∗http://www.twitter.com/ †https://www.facebook.com/ ‡http://www.instagram.com/ §https://answers.yahoo.com/
general feature selection: the process of feature selection is defined, showing the benefits of feature selection as well as the problems and challenges. Then, a comprehensive review of the current work of feature selection algorithms is pre- sented. Finally, text feature selection, the main focus of this thesis, is described in detail. The relevant text features are defined and different methods and mod- els are explained and illustrated by different applications of text feature selection.
All the reviewed methods in the literature seek to select the best features and reduce noise and redundancy from the text data. However, the studies show that noise and redundant features still exist, and some important features are missing. To deal with these issues, an innovative feature selection models has been proposed to understand the relations between extracted features in order to remove the irrelevant features and improve the quality of the extracted features.
The next chapter will introduce the first model proposed in this research the Pattern Co-occurrence Matrix (PCM). The PCM models studies the semantic relation between closed sequential patterns to improve the quality of patterns and reduce the noisy ones.
Chapter 3
Pattern Co-occurrence Matrix
(PCM)
Text co-occurrence matrices, such as co-citation, co-word and co-link matrices, can define concepts that occur within the same term in a text [15], and which provide us with useful information for understanding the structure of documents.
Not all extracted patterns are useful because extracted patterns usually con- tain noisy patterns and inconsistencies due to the different data mining processes that are used for extracting these patterns. It is clear that there are relationships between patterns in documents based on their appearance in paragraphs.
The co-occurrence matrix method attempts to identify the semantic relation- ships between these patterns and the important relationships between them. Fig- ure 3.1 illustrates the directions of weighted relations between the patterns, based on the pattern co-occurrence matrix. The pattern that has more relations with
other patterns should be assigned a high weight, since it is more important than others; for example P1 and P4 in Figure 3.1.
Figure 3.1: Example of pattern relations based on co-occurrence matrix
In this study, the Pattern Co-occurrence Matrix (PCM) is chosen to find the relationships between patterns in a document and identify the important relation- ships between them. Therefore, we can define the co-occurrence matrix in our research as a matrix that is defined over a document to describe the co-occurrence relation between patterns. For example, let A be the n*n pattern co-occurrence matrix, while the element Aij is the number of times that the pattern Aj occurred after pattern Ai in the paragraphs of the document.
As mentioned in Chapter 2, closed sequential patterns are extracted from doc- uments based on their support and confidence, and this study seeks to re-evaluate the extracted patterns based on the pattern co-occurrence matrix, in order to re- duce the noisy patterns. Further, the extracted patterns are deployed and the weight of their terms calculated based on the pattern co-occurrence matrix.
3.1. Calculating Pattern Co-occurrence Matrix (PCM) 47
3.1
Calculating Pattern Co-occurrence Matrix
(PCM)
This research applies the PCM on top of the closed sequential patterns, with the aim of removing the noisy patterns that have no relation with other pat- terns. Let P = {p1, p2, . . . , pn} be a set of extracted closed sequential patterns with a min sup (e.g. min sup = 0.2 in PTM) from all paragraphs dp∈PS (d ) in document d ∈ D+, where PS (d )= {dp 1, dp2, . . . , dpm}. An∗n = p1 p2 ... pj ... pn p1 A1,1 A1,2 ... A1,j ... A1,n p2 A2,1 A2,2 ... A2,j ... A2,n ... ... ... ... ... ... ... pi Ai,1 Ai,2 ... Ai,j ... Ai,n ... ... ... ... ... ... ... pn An,1 An,2 ... An,j ... An,n
As shown in matrix An∗n, the pattern co-occurrence matrix A with size n*n, where n = |P |, is the number of extracted patterns and Ai,j (read pi → pj) is the number of co-occurrences of patterns pj which occur after pi in the same paragraph.
To calculate the co-occurrence of any two patterns in the matrix, such as patterns Ai,j, we run over all the document paragraphs PS (d ), looking for two patterns in the same paragraph and in the same order (pj occurs after pi). The
occurrence is only calculated once for each of the two patterns in each paragraph. Finally, to calculate the total co-occurrence of pattern pi in document d, we first calculate the total co-occurrence WR(pi) for the row and WC(pi) for the column as follows: WR(pi) = n X j=1 Ai,j (3.1)
Where WR(pi) is the total row co-occurrences of pi
WC(pi) = n X
j=1
Aj,i (3.2)
Where WC(pi) is the total column co-occurrence of pi
And the total co-occurrence for pattern pi will be:
WR(pi) + WC(pi)
Finally, considering the length of the documents, we normalise the the total co-occurrence of a pattern (PCM) as follows:
P CM (pi) =
WR(pi) + WC(pi)
n ∗ m (3.3)