Theory and Bayesian Modeling
2.4 Text Preprocessing
Prior to any further operations with text data, the data typically has to be preprocessed using several standardized text analysis techniques. Here, we cover only a subset of these techniques which are used later in the thesis.
Tokenization. Typically, it is the first step required for any further text
analysis. The task of tokenization refers to the automatic detection of words’ boundaries in a text document, that is, it divides the text document into individual word tokens. A basic way to achieve tokenization is to rely on clues such as spaces or punctuation. However, this basic approach generates issues with expressions such as named entities or collocations that contain more than
TEXT PREPROCESSING 23
one word. A standard solution to address this problem is to use special lists of named entities or rely on word co-occurrence to decide which expressions should not be split by a tokenizer. Tokenization is a solved problem for many languages (e.g., English, Spanish, Dutch), but it still remains only a partially solved problem for some languages (e.g., Chinese). Another task closely related to tokenization is sentence boundary detection, where the aim is to extract individual sentences again using clues such as punctuation of capitalization.
Stop Words Removal. Words that occur very frequently and do not bear any semantic information may negatively affect final results of NLP and IR systems [245]. These words are known as stop words and are typically filtered out prior to any additional text processing in NLP and IR in a process called
stop words removal. Typically, a list of such words called stop words list is
used for filtering. Stop words removal simply filters out all words from a text document that are contained in the stop words list. Example stop words in English include very frequent function words such as a, the, is, which, that, etc.
Part-of-Speech Tagging. Part-of-speech or POS tagging assigns part-of-
speech (POS) labels or tags to words according to their different grammatical functions or categories (e.g., nouns, verbs, adjectives, adverbs, pronouns). A simple example follows:
He/PRP saw/VBD a/DT girl/NN with/IN a/DT cat/NN .
In the example above, each word is tagged by a label addressing its grammatical category. PRP denotes a personal pronoun, VBD a verb in the past tense, DT a determiner, IN a preposition, and NN a singular noun. The sets of POS tags are typically hand-built (e.g., the Penn Treebank Tagset for English [187], also used in the example above) and differ across languages. However, recent trends in POS tagging strive towards an universal language-independent set of POS tags [239] and completely unsupervised language-independent POS systems [63]. POS tagging is a sub-field of NLP research on its own and it is well beyond the scope of this thesis.
Lemmatization. Another important preprocessing step involves a shallow morphological analysis of the given text data. If one operates with individual words without any additional morphological analysis, a problem of data sparsity may arise due to the fact that some words are actually only different variants (e.g., they differ in tense, person, gender or number) of the same root word (e.g., consider words build, builds, building, built, which are all variants of the same root word build). In order to address this issue, a common preprocessing step is to perform a morphological analysis of text data. This analysis refers to the process of finding stems and affixes for words, and then mapping them to common roots by stemming and lemmatization. A stem is defined as the
24 FUNDAMENTALS
major and the smallest meaningful morpheme of the word and the one that carries the word’s meaning. Stemming techniques are heuristic-based algorithms that remove typical prefixes and suffixes (for instance, a suffix -s in English for third-person singular) and leave only the stem of the word. However, due to their heuristic nature they often remove too much information, and the process results in stems without any meaning at all. In order to tackle this issue, a dictionary-based approach to morphological analysis called lemmatization always results in dictionary forms of the words called lemmas.
We have applied a tokenizer, a POS tagger and a lemmatizer from the TreeTagger project [267] which may be found online.2 We have used stop words lists provided
for the Snowball project, which may also be acquired online.3
2.5
Conclusion
In this chapter, a short overview of fundamental tools has been provided, which are the basis for the further modeling and development in this thesis. We have presented a short introduction to probability theory which serves as the main cornerstone for all further modeling and statistical probabilistic representations of text (e.g., words or documents) discussed later in the thesis. We have introduced and tackled key concepts of probability theory such as random variables, conditional probabilities, joint probabilities, probability distributions, prior and posterior distributions, and sampling from a distribution. Following that, a brief introduction to graphical models and Bayesian networks has been provided, necessary to understand the basics of (multilingual) probabilistic topic modeling, a probabilistic modeling principle which serves as the backbone of this thesis. Bayesian networks are simply descriptions of stochastic processes, networks of conditionally dependent variables through which information is propagated to produce a random outcome. Observed outcomes allow for the estimation of the probabilities even for variables that cannot be observed (i.e., they are latent or hidden). The Bayesian framework allows for discovering a latent structure underlying textual data collections. For instance, latent topics may be observed as a hidden knowledge behind the observed text data which is involved in the generation of the actual observed text data.
We have also covered the very basics of statistical language modeling, necessary to fully understand information retrieval models discussed in part IV. Following that, we have provided a short overview of core text preprocessing techniques: tokenization, stop words removal, part-of-speech tagging and lemmatization, the techniques that we utilized in this thesis to prepare our text corpora for further processing.
2http://www.cis.uni-muenchen.de/∼schmid/tools/TreeTagger/ 3http://snowball.tartarus.org/