Language Detection - Extract offender information from text

The manual annotation of the FHD data showed that text of some the fraud-incidents occurred in diﬀerent languages and the prediction of the POS tag and Named Entities were completely wrong. Considering that most text was either written in Dutch of English and some text in other languages, the idea of detecting languages was thought up. Furthermore an English POS and NER model might be another good idea since the precision on English text was way oﬀ on the annotated data.

4.6.1 Language detection methods

There are several methods to detect the different languages from text, like using the Google API, using n-gram character classification or using the most used stopwords to identify different languages.

• The Google API is a server based program which will expose the fraud text to a third party, which can’t be used in this case.

• The text cat method that was described in paper [Cavnar and Trenkle, 1994] and [Nolla, 2013b] is a possible option with a high detection rate. It uses the statistical data of all appearing words for each (1-5) character-n-grams to detect the language. The only problem is that there are not much data on the languages.

• The last method of using the stopwords of each language to detect the text of its language is very simple and gets also a high accuracy since each language has a spe- ciﬁc amount of stopwords and are unique to each language. Furthermore the correct detection rate of each language is very high. The stopword method is easily applied since the NLTK library provides all the necessary information which is explained on the blog article of [Nolla, 2013a] to self train a model which can be used to detect text languages and most stopwords for each languages exist in the NLTK corpora. Since the stopword method was chosen to detect the language in the text, the language is determined based on how often a stopword of each language is appearing in the speciﬁed

text. In case a stopword is detected the score for the speciﬁc language increases and the language with the highest score is predicted as the language the text is written in.

Afterwards the language detector was used on the FHD data to count the amount of text in each language. From 28400 fraud-incident text 84% are in Dutch, 12% in English and 4% in other languages. Since 12% of the fraud-incident data were written in English, an English POS tagger was needed. Therefore the CONLL 2003 dataset was used to train an English POS & NER tagger to cover up on miss-classiﬁed POS tags, which were caused by the language barrier of the trained model.

4.6.2 English Tagger Result

The CONLL2003 is the CONLL challenge which provided an English dataset with annotated POS and NER tags in the year 2003. Since the previous NER and POS models have all the same CONLL02 notation for the POS tags and Named Entities, the labels in the CCONLL03 dataset were all transformed to the CONLL02 notation as well. The CONLL03 dataset has also the same data distribution of a training set and a testset to op- timize the algorithm and to conﬁrm the performance of the algorithm and another testset to conﬁrm the winner of the CONLL03 challenge. Figure4.21shows the result of the NER with the same CRF algorithm and the same extracted features for the CONLL03 dataset.

Figure 4.21: NER Results of the CONLL03 testset A

The average F1-score of CONLL03 testset A is 86% and the highest detected NE is I-PER with f1-score of 93%. The B- tags in the NE’s are detected better with a f1-score higher than 85% while the I-tags in the NE’s have all a lower score than all the B-tags which are in the score range of 75-77%. I-MISC has the lowest score of 75% followed by I-LOC and I-ORG with a f1-score of 77%. Except for I-PER and B-LOC all other NE’s are having a lower recall score. The CONLL03 testsetB is more challenging to detect correct NE’s which can be seen by the lower score of 80% in the average F1-score. The ranking order of each F1-score is nearly identical as in the testset A of CONLL03. only the f1-score is lower and the I-ORG is detected better than the I-LOC.

The same CRF algorithm and extracted features were also used on the English POS tag model of which the results are shown in Figure 4.23.

The results show that the average F1-score is 96% on the CONLL03 testset A as well as in testset B. The POS tags model are all very similar, the only diﬀerence is that the English POS tagger is better in detecting conjunctions with 100% while the Dutch one is only able to detect 96%. Furthermore the English POS tagger has more trouble in detecting adjectives and adverbs with F1-scores of 85% and 88%, which are the lowest scores. While the Dutch POS tagger had scores higher than 95% in Adj and Adv. The English POS tagger is also better in detecting Misc tags with a f1-score of 96% in both

Figure 4.22: NER Results of the CONLL03 testset B

Figure 4.23: POS Results of the CONLL03 testset

test-sets. Compared to the Dutch POS tagger, the MISC tag has only a f1-score of 59% and 53%.

Compared to the CONLL03 winners shown in Table 4.8 the CRF method has a lower performance than the top 3 participants. In the challenge the CRF method would be placed at the 14/17 places of all participants.

Table 4.8: CONLL03 Challenge Results Participants Precision recall F1-Score

FIJZ03 88.99% 88.54% 88.76% CN03 88.12% 88.51% 88.31% KSNM03 85.93% 86.21% 86.07% CRF 80% 81% 80% Ham03 69.09% 53.26% 60.15& Baseline 71.91% 50.90% 59.61%

4.6.3 Conclusion

The Named Entity Recognition and POS tagging is getting better performances on the English language than on the Dutch language. The reason might be that English has easier grammar rules which are easier to detect than Dutch. With such high performance results for the English NER and POS tags as well as for the language detector, all three models are ready to be used to separate the fraud-incident text in either a Dutch or English model to apply its corresponding POS and NER tags.

In document Extract offender information from text (Page 56-60)