3.5 Towards Topic-Independent Text Classification: a Novel Se-
3.5.1 Learning Document-Category Relationships from Se-
from Semantic Terms Relationships
The component discussed here is the core of the Chonxi method and the major novelty with respect to other works.
While many methods work by extracting features (usually words or con- cepts) from each training document and directly use topics as class labels;
! " #
Figure 3.9 – Illustrative diagram of operations performed by the Semantic Matching Algorithm on a single document-category input couple: for each type of semantic relationship, we only show here the count of its occurren- cies.
Chonxi generates vectors from document-category couples, using counts
and weights of semantic relationships as features and labeling them with the mutual relationship between the document and the category. From a set of training document-category couples, the method learns a knowledge model, which then infers relations between these objects from the semantic relationships between words which compose them.
The resulting model is structurally both dictionary-independent and topic-independent, as it references neither specific words nor specific cat- egories of documents. We suppose that, if an appropriate training set is used, the resulting model actually presents these traits, making it reusable across different contexts. In a base case, we consider that a document is either related to a category if it treats the corresponding topic, or unre- lated otherwise; we will introduce a further distinction later. This scheme, compared with classic approaches, is sketched in Figure 3.8.
In the rest of this section, we see at an high level how the model is trained, how it is used to classify documents and how its independence from the training set can be exploited.
Pre-processing and Learning
We describe here, at an abstract level, the steps followed to learn the model which infers document-category relationships. These constitute the initial training phase of Chonxi.
3.5. Towards Topic-Independent Text Classification: a Novel Semantic
Learning Method for Hierarchical Corpora 69
of training documents D, labeled in a set of possible categories C. In this section, we do not consider the hierarchical arrangement of categories, which will be recalled in Section 3.5.2, as the considerations below hold regardless of the possible hierarchical structure of the taxonomy.
To start the pre-processing phase, structured representations of both training documents and their topics are needed: each document is reduced to a set of words contained in it along with their frequencies, while each cat- egory is represented with a similar set with words taken from all documents belonging to it. Words in each set have associated weights according to their occurrence both in the referred document or category and in the rest of the collection. This representation is based upon the classic bag-of-words ap- proach, with the difference that each set of words is an independent entity: no global set of words (dictionary) is considered and no feature selection is performed.
Then, a set E ⊆ D × C of document-category training couples must be prepared, which will determine the training set. In order to build an accurate knowledge model, these couples should be representative of all the possible relationships of interest between documents and categories: in the case of the related/unrelated distinction cited above, a reasonable amount of couples of both types would be needed.
At this point, for each couple (d, c)∈E, a so-called Semantic Matching Algorithm (hence abbreviated in SMA) finds out the semantic relationships holding between relevant words of d and c. The SMA accepts as input the two sets of words with associated weights representing d and c: for each couple of words obtained from their cross product, by using a source of semantic knowledge like WordNet, the algorithm obtains a set of semantic relationships held between them, such as synonymy, hyponymy, antonymy and so on. Using data obtained from all couples of words, the SMA finally outputs a fixed-size vector of numeric values, indicating how many times each type of semantic relationship has been found and the summed up weights of the involved terms. Figure 3.9 sketches the operation of the SMA through a simplified example using only absolute counts.
Each vector obtained by the SMA is labeled with the effective relation- ship holding betweendandc, such asrelated orunrelated: we refer to these labels as relationship classes. This set of labeled vectors is finally used as an input to a supervised learning algorithm to learn the knowledge model, which can be used to infer relationship classes for subsequent document- category couples, after preprocessing their representations with the SMA
Training corpus Documents Sports Movies Arts Science Categories Documents representations Categories representations coupling Example couples Semantic Matching Algorithm SRS training set supervised learning algorithm Semantic Relationship Scoring Model
Figure 3.10 –Diagram of the Semantic Relationship Scoring Model train- ing process
as done for training couples.
In the most general case, given a document and a category, a model can predict whether they arerelated,unrelated or any other possible relationship class. However, the method uses aprobabilistic classification model, which gives as output a distribution of probabilityP between all possible classes, such as “64% related, 36% unrelated”. We’ll consider the probabilityP(ϕ) of each possible class ϕ as its score, which denotes the estimated degree at which the corresponding relationship holds between the elements of the analyzed couple. This allows, for example, to consider a category as “more related” than another one to some document.
Having just related and unrelated as possible labels, the probability for the former given by the model for a document-category couple could be seen as a measure of similarity between the two elements: this would somehow correspond to measure the relatedness between a document and a category prototype in the Rocchio method, for which cosine similarity is usually employed. A key advantage of the proposed model, derived from consid- ering distinct semantic relationships between terms, is that the relatedness between documents and categories is represented through multiple values rather than a single one: this enables it to more precisely characterize the relationships between these objects, rather than simply evaluating related- ness in a linear scale. This is useful to distinguish documents treating a general topic from those addressing a specific branch thereof, as discussed in the next section.
3.5. Towards Topic-Independent Text Classification: a Novel Semantic
Learning Method for Hierarchical Corpora 71
Semantic Relationship Scoring Engine
Document science Category Semantic Matching Algorithm Semantic Relationship Scoring Model Score: related 0.64 unrelated 0.36 Representations (lists of weighted words)
Figure 3.11 – Diagram summing up how scores for possible relationship classes between a document and a category are estimated by means of the Semantic Relationship Scoring engine.
Summing up, a probabilistic supervised learning algorithm extracts the model used to output a probability distribution among possible relationship classes for document-category couples: we denote this as the Semantic Re- lationship Scoring Model (hence SRS model). The process described above to produce this model is sketched in Figure 3.10.
Classification
The SMA compares two lists of words to extract a vector of values based on mutual semantic relationships, while the SRS model infers a proabability distribution of relationship classes from these values. Their combinations makes a component able to weight the possible relationships between generic documents and categories given their representations: we denote this com- ponent as the Semantic Relationship Scoring Engine (hence SRS engine), as illustrated in Figure 3.11.
Chonxi uses the SRS engine as the core component of a higher-level
algorithm to classify documents in a hierarchy of categories. As explained later, each input document is compared with multiple categories, to pro- gressively find out the most likely one for it through a top-down search in the hierarchy.
As stated above, the SRS model uses a fixed set of features based on known types of semantic relationships, rather than on specific terms or concepts of the training corpus. For this reason, the engine built upon it may handle representations of any document and any category, even with
Training corpus Documents Funds Music Arts Business Categories SRS Model training SRS Model Target categories Sports Music Arts Business Movies Computer Target categories representations Semantic Relationship Scoring Engine
Classification under target categories
New documents
Figure 3.12 – Diagram showing how a SRS model may be trained from one corpus with a set of topics and used in a SRS engine which analyzes and classifies documents on a different set of target categories, given their representations.
different words or dealing with different topics with respect to those used for training. In our view, this topic-independence aspect enables the user to introduce new categories without retraining the SRS model.
Formally, while the SRS model is trained using documents labeled with topics from a set C, we may use the resulting engine to compare new docu- ments with categories from another generic set C∗, which we denote as the set oftarget categories: this concept is sketched in Figure 3.12. This allows us to perform training only considering a part of the taxonomy of topics of interest (C ⊂ C∗) to reduce training times and even to reuse the generated model on a different taxonomy (C ∩ C∗ = ∅). As the training phase is in- dependent from the target categories, these may be initially unknown and introduced later. In any case, once the model is trained with a sufficient amount of training data, it just needs the representations of all categories inC∗ to classify documents in them.
We’ll recall on these possibilities in the next section, after presenting the concrete process for hierarchical classification.