• No results found

3.2 Text Representation

3.2.2 Extraction of Features

In the need of reducing text documents to vectors, is necessary to define a set of predictive features which are effective in representing the original contents of the document. A trivial choice is to consider words as they appear in the document, without considering their form or meaning. Variants of this approach aim to overcome this limitation by exploiting semantics of words, possibly using external knowledge. Other approaches consider other features than single words, such as n-grams.

Lemmas

Given a dictionary of a certain language (e.g. English), at each entry cor- responds oneheadword orlemma, which is the canonical form representing a possibly wider set of words corresponding to the same entry. For many of such lemmas, inflected forms also exist, which express the same concept with differences in number, gender, tense or other categories. For example, waiter and waiters refer to the same dictionary entry, having waiter as its lemma; similarly, served and serving are both inflected forms of (to) serve.

In the basic case presented above, where words are considered as mean- ingless tokens, different inflected forms of a same lemma are considered as distinct words. A different approach would be to represent each word by its lemma, without distinguishing its different forms. While this theoretically implies a loss of information, the advantage is a noteworthy reduction of the number of features which, in many cases, does not lead to a practical loss of accuracy. Taking again text classification as an example, it generally matters to know which lemmas appear and their frequency, while the spe- cific form in which they appear (e.g. how many singular and plural forms of each noun are there) has negligible importance.

The application of this approach, however, is not straightforward, as a computer must be able to infer, for each distinct word encountered in a set of documents, its corresponding lemma. This process, known aslemmatisa- tion, requires prior knowledge of the specific language under analysis. Such knowledge can be very complex, including a complete set of general rules addressing common cases (e.g. -s termination for plurals in many nouns and for singular third person in verbs) and enumerations of specific cases and exceptions to such rules (e.g. micebeing plural of mouse).

Stems

In the most general sense, a stem of a word is a part of it. Here, stem is used to indicate the part of any word which is common to all its inflected variants. Some words have their stem equal to their lemma, but does not apply to any word. Recalling the examples above, waiter and waiters have waiter as their stem, which is also their lemma; however, served and serving haveserv as stem, which is a truncation of the lemma (to) serve. In many cases, such as this latter example, the stem of a word, contrarily to the lemma, is not itself a word of the language (although a human reading a stem should generally be able to infer the lemma of the original word).

Likely to lemmatisation,stemming is the process of extracting the stem of an arbitrary word. Also in this case, the process is dependent from the language and requires proper knowledge of it. However, stemming a word is usually more simple than extracting its lemma: the stem is always contained in the word itself and can usually be extracted just relying on a not too many complex set of rules.

Given the efficiency of stemming algorithms, also known as stemmers, in the many situations where it is not necessary to have complete words as features, stems are instead used to efficiently identify groups of similar words.

Some stemming algorithms have been proposed, even since decades ago. The most commonly used among such algorithms is the Porter stemmer (Porter, 1980). This algorithm, likely to others, is based on suffix stripping: a number of rules divided into some steps is defined so that each input word has its suffix removed or substituted according to rules whose preconditions match the word. For example, the-ingsuffix is removed only if the resulting stem contains at least one vowel, so thatserving is stemmed into serv, but singis instead left as is.

N-grams and Phrases

An n-gram is in general a sequence of n consecutive elements extracted from a wider sequence: common specific cases are forn = 2 (bigrams) and

n= 3 (trigrams). Elements grouped in n-grams are generally either letters or words. For example, considering letters, from the word example can be extracted the bigrams ex, xa, am, mp, pl, le. Some works use n-grams as features in place of or in addition to single words, sometimes mixing

3.2. Text Representation 23

different lengths.

N-grams of letters are not practical to use in place of words, as these serve usually to identify the topic discussed in a documents and groups of letters (especially if short) would generally instead be poorly indicative of what the document discusses about. Instead, sequences of letters are usually employed in particular tasks where they can actually be effective as predictive features, such as classifying documents by language (Cavnar and Trenkle, 1994): for example, English texts can be recognized by the high frequency of some bigrams such as th and er.

N-grams of words as features are instead more similar to words, as can be informative of the topic discussed in the text. The extraction of bigrams and trigrams can be useful to represent recurring compound expressions, such as text categorization or word sense disambiguation. A field where n- grams of words are currently particularly used is sentiment analysis, where some sequences of words expressing a specific polarity are present which would be not accurately represented as single words (Pak and Paroubek, 2010).

Some works also deal with the use of phrases as features, which may refer either to generic sequences of words which often recur in text – which reflects the above definition of word n-grams – or tosyntacticphrases, which are those defined by the language grammar, having a meaning on their own. In (Lewis, 1992) the use of syntactic phrases for text categorization is evaluated; it is observed anyway that phrases “have inferior statistical qualities” with respect to words, because distinct phrases are in higher number and many of them having similar meaning end up being split across many features.

Concepts

Types of features described above are extracted in a relatively trivial way from words, this with minimal processing of them. While words can gen- erally give a good indication of the contents of the document, they anyway are not a perfect representation. A cause of this are the properties of syn- onymy and polysemy of natural language, implying that a meaning may be expressed by more than one word and a word, taken out of its context, may have more possible meanings. An ideal solution would be to represent a document with theconcepts expressed in them, rather than (or in addition to) with the possibly ambiguous words.

Different methods exist to extract features of this type. Two general types of approaches can be distinguished: existing concepts can be extracted statistically from the documents or obtained from an external knowledge base. In the first case, the documents to be processed (or in some cases an external set of different documents) are analyzed in order to identify po- tential concepts according to co-occurrences between words in documents. For example, a set of words which (almost) always appear together in docu- ments are likely to represent a unique meaning or very related ones: this set could be reduced to a single feature with no significant loss of information about the contents of documents. Some methods to extractlatent semantic knowledge from collections of documents, the most common is the latent semantic analysis (LSA) (Deerwester et al., 1990).

The other possible approach is to use some sort of external knowledge base to obtain information about all the existing concepts and to correctly map words found within documents to them. Different sorts of knowl- edge bases can be used for this purpose, in Section 3.5 is proposed a novel method for text categorization making use of external knowledge bases for extracting semantic features from documents.