Section2.2.1 described several methods to acquire tag classifiers such as NER and POS tag for sequential data. For the offender extractor application the CRF(Conditional Random Fields) method was chosen, because the CRF method is able to achieve a high performance with a simple algorithm with a fast computing time in comparison to the Deep Neural network methods that combine several algorithms together that were described in the paper [Peirsman, 2017] and [Agerri, 2017]. Furthermore most deep neural network methods are using word-embedding which have the disadvan- tage that it needs a lot of training data. A self trained word-embedding from the CONLL02 and SONAR1 dataset is not enough to get reliable word-embeddings and might get a lower score than CRF even with the combination with Bi-LSTM. There- fore a pre-trained word embedding model will be needed. This puts confidentiality at risk, because pre-trained data might use server communications or might have mal-ware in it which should be avoided in this application.
4.2.2
Conditional Random Fields(CRF)
The Conditional Random Field (CRF) is a variation of a graphical model. As in [Sutton and McCallum, 2011] described a graphical model uses a graph to simplify the complexity of probability distributions over many variables. A joint probability of many variables will costO(2n)of storing variables in contrast to graphical models which are able to summarize the probability distribution in a graph, which depends on a much smaller subset of variables by the product of its local functions. The subset of local functions are called factorizations and have the properties of con- ditional independent relations among the variables. Each graph and factorization is able to consider incoming and outgoing paths to form conditional relations in a graph which decreases the amount of functions to calculate its dependency. Figure 4.2shows several graphical models and their factorization function which are shown as Rectangular points that are connected to a graph circle.
Most existing CRF algorithms are using the linear-chain CRF as their graph model, because of its simple design and its fast computation time which is also able to observe current and future observations. Figure 4.3 shows a linear chain model with a factor dependency of its current observation. Such a model is able to extend
Figure 4.3: Linear chain crf with factor dependencies on current observations. to further observations and variables. The formula that is needed to use extended features on observation is cited in [Sutton and McCallum, 2011]:
p(y|x) = 1 Z(x) T Y t=1 ψt(yt, yt−1, xt)
where Z(x) is an independent normalization function: Z(x) = X y T Y t=1 ψt(yt, yt−1, xt)
and where each local functionψt has the log-linear form of:
Ψt(yt, yt−1, xt) = exp K X k=1 δkfk(yt, yt−1, xt)
4.2.3
CRF Toolkit
There are two well known CRF toolkits CRF++ and PyCRFSuite.
• CRF++is a customizable toolkit written in C++ for segmenting and labeling sequential data. CR++ is able to generate a generic amount of features of either segmentation label like chunking and a POS tag or Named entity.This means that each word needs to have the same amount of attributes and it is only able to generate a model which needs to be generated every-time for each change in its features.
• PyCRFSuite instead has a python wrapper function and is able to use an arbitrary amount of features, so each word can have different amount of at- tributes and it is more customizable than crf++.
Since all source code is written in python and a customizable toolkit is preferred the PyCRFSuite was chosen to be used as a CRF toolkit.
4.2.4
CRF Feature extraction
Since a POS tagger and NER classifier are using the same CRF algorithm the features that are used are also the same with the exception that the NER classifier includes the POS tag itself as its feature for each word. Each feature word is either a one-gram, bi-gram or a trigram depending on the words in the sentence that was provided as an input. Furthermore the SONAR1 dataset provided a CRF++ model and its source code for the feature extraction that was created by Bart Desmet. Some of the extracted features that the CRF++ used were applied to the application such as the feature wordshape, ishyphen, function word, and isURL. The idea of using the extracted features of previous and following words of a sentence came from the articles [bogdani, 2016b],[bogdani, 2016a] and [Peirsman, 2017] and were applied as well. The features for each extracted word are shown in Figure4.4that can be seen below:
Figure 4.4: Features of the CRF Model
For each sentence a list of sequences of features is created. It starts with a "BOS" that stands for the beginning of a sentence and ends with an "EOS" for the end of a sentence. Between the two starting features are features of each word in a sentence extracted to form trigrams. First it tries to get the previous word of the provided sentence and extract the CRF features of that word. Afterwards it extracts CRF features of the current word and then CRF features of the following word to form features for all three words in the trigram. At the start of a sentence only two words and their features, a so called bi-gram, is extracted to form the first item in the
feature list. The bi-gram consists of the current word and the following word. Then a new iteration round is created to form another item with trigram features in the feature list until the end of a sentence is reached and the EOS feature is appended at the end.
There is one specific feature called word shape which has several options to choose from that are shown in Figure4.5.
Figure 4.5: options of Feature Word Shape
The wordShape is based on the shape of the word to indicate what kind of character letters the word is containing.