4.3 Experimental Set-up
4.3.2 Feature Types
As explained in Chapter 2 (Section 2.2), most text mining studies include (combinations of) lexical, character and syntactic features in their experiments. In this study, three new feature types are
introduced, which together are referred to associolinguisticfeatures. This section describes the
NLP procedure for extracting these feature types.
Lexical features.Unlike some previous studies on age or gender prediction (see Section 4.2), we did not create a dictionary of hand-picked features that are likely to distinguish between different age or gender categories. Instead, we applied a data-oriented approach by extracting all token unigram features (i.e., a bag-of-words model), including punctuation marks, emoticons, pre-processed email addresses, hyperlinks and text formatting tags, allowing for a more complete picture of the age and gender related linguistic variation in the data. To enable a similar data- oriented approach for content and function words individually, a list of function words is required. Although such a list in itself is limited in Standard Dutch, when analysing Flemish chatspeak the approach is confronted with numerous linguistic variations: aside from concatenations and
abbreviations such as “kheb” for “ik heb” (I have) and “gn” for “geen” (no), the NETLOGcorpus
also exhibits a wide variety of regiolectal function words that deviate from Standard Dutch on the
phonological or morphological level (e.g., “veur” instead of “voor” (for), “werre” instead of “worden”
(to become)) or even on the lexical level (e.g., “golle” in Brabant versus “junder” in West-Flanders,
meaning “jullie” (you)). Therefore, we extended the list of standard function words with the
most commonly used non-standard function words from each region, which we extrapolated manually from the qualitative analysis described in Chapter 3 (Section 3.4.2). The content word features included all other words together with the emoticons, but without punctuation marks. Additionally, to include context information in the experiments, we extracted token bi-, trigrams
and skip-grams6.
Character features.As charactern-grams have been shown to be useful for tracing stylometric evidence beyond topic and genre [185] and proven to be reliable when dealing with limited data [131, 159], we included character bi-, tri- and tetragrams in the user profiling experiments.
6Skip-grams are a variety ofn-grams in which the features are not necessarily consecutive, but may leave gaps that are “skipped” over [77].
Syntactic features. We generated Part-of-speech tags using Frog [27], a Dutch morpho- syntactic analyser and dependency parser. However, because this NLP tool was trained on Standard Dutch, messages containing non-standard language forms proved to be problematic. This is illustrated in Table 4.1 with an example of the Frog output of a Netlog posting containing non-standard language use and the output of its equivalent in Standard Dutch. However, given that the same tagging errors are made both when analysing the training and test folds, it is possible that these errors will only marginally affect the features’ performance.
Semantic features.This feature type has been suggested by [167] for predicting age and gender. More specifically, they used the Linguistic Inquiry and Word Count (LIWC) word category lexicon [166], which includes sixty-four different categories of language, among which topical categories, such as family, affect, occupation and body. For example, words like “father”, “mother”, “brother” and “sister” all belong to the “family” category. We used a semi-automatic Dutch translation of the LIWC2015 lexicon [165] as described in [209] to extract semantic features during the experiments.
Table 4.1: Frog analysis of a Netlog posting and its Standard Dutch Equivalent. The labels marked in blue are correct.
Example7 Frog Output Dutch Frog Output
kzeg N(soort,ev,basis,onz,stan) ik VNW(pers,pron,nomin,vol,1,ev)
zeg WW(pv,tgw,ev)
wel BW( ) wel BW( )
dak N(soort,ev,basis,onz,stan) dat VG(onder)
ik VNW(pers,pron,nomin,vol,1,ev)
ziek ADJ(vrij,basis,zonder) ziek ADJ(vrij,basis,zonder)
zn VNW(bez,det,stan,red,3,ev,prenom,zonder,agr) ben WW(pv,tgw,ev)
4.3.2.1 Extracting Sociolinguistic Features
SNS features.The statistical analyses described in Chapter 3 demonstrates that sociolinguistic principles such as Age Grading and the Adolescent Peak Principle also apply on CMC as can be
ness”: a word was labelled non-standard when (a) it was analysed as such by GNU Aspell or (b) it occurred in our list of manually collected non-standard chatspeak words that have a standard
Dutch homonym (see Chapter 3, Section 3.4.2). Next, for each message in the NETLOGsubset,
each word was represented as either a “[standard_content]”, “[non-standard_content]”, “[stan- dard_function]”, “[non-standard_function]” feature based on the output of this operationalisation, together with the manually extended list of function words described above. When evaluating the
first 1,001 words from the first test fold of the NETLOG_SUBSET2 dataset (see Section 4.4.1), the
methodology for extracting SNS features yielded an overall accuracy of 92.7% with a classification error of 7.3% +/- 1.6%. Moreover, the results showed that this method also performed well for three of the four labels individually: with an F-score of 98.1%, the best results were achieved when detecting standard Dutch function words. With regard to detecting standard and non-standard content words, this resulted in an F-score of 92.1% and 92.5%, respectively. For the non-standard function words, however, the approach only achieved a 77.2% F-score. The precision, recall and F-scores for each label are stated in Table 4.2.
Table 4.2: Precision, recall and F-scores (%) for SNS feature engineering method.
Features Prec. Rec. F-score
[non-standard_function] 74.2 80.3 77.2
[standard_function] 96.3 100.0 98.1
[standard_content] 92.1 91.9 92.0
[non-standard_content] 94.4 90.7 92.5
When performing a manual error analysis on the data itself, it appeared that there was some minor confusion between non-standard words and non-standard function words, which mostly can be linked back to either spelling variations of the non-standard function words that were
not included in the manually composed list, such as “daz” for “da’s” (that’s), or to abbreviated
words that could also refer to a non-standard word that is not a function word (e.g., “nr” can be
an abbreviated form of both “naar” (to) and “nummer” (number)). Nonetheless, these words were
still correctly labelled as non-standard language use. However, there are also nine cases in which a function word was incorrectly labelled as standard Dutch. After a closer inspection of the data,
Table 4.3: Confusion matrix (absolute values) for the sociolinguistic features’ engineering method. CM (Abs.) [NS_function] [NS_content] [S_function] [S_content]
[NS_function] 49 3 5 4
[NS_content] 5 303 1 25
[S_function] 0 0 237 0
[S_content] 12 15 3 339
these errors mostly entailed non-standard usage of a standard (function) word. Similar results were found for most of the words that were incorrectly labelled as standard (e.g., homonymy
between the non-standard infinitive form and the standard simple past form “zette” (put) and
vice versa, even across different languages (e.g., homonymy between the standard English and
the non-standard Dutch “my” in “my abv: brintleey!”8) (see also Chapter 3, Section 3.4.2). Finally,
six personal names were incorrectly labelled as non-standard, because they did not occur in the list of named entities.
Paralinguistic features.In chatspeak, paralinguistic and non-verbal cues — that are present in spoken discourse, but absent in the formal written repertoire — are often compensated by other linguistic features, such as emoticons, character flooding and the use of upper-case and punctuation flooding to express emphasis [49]. Hence, information on the presence of these paralinguistic features were also included in the document representations. More specifically, prior to pre-processing, the occurrence of character and punctuation flooding were extracted and represented as “[char_flood]” and “[punct_flood]” features, and the occurrence of non-standard capitalisation as “[char_upper]”. Additionally, emoticons and other combinations of characters
and punctuation (e.g., “<3” representing the shape of a heart) were represented as “[punct_play]”
features. Together with the standard/non-standard operationalisation described above, these
feature are referred to associolinguistic features.