Social network messages differ from other CMC text genres, such as e-mails or blogs, in several aspects. The length of each instance is usually much shorter, their vocabulary and grammatical structure are often non-standard and the distribution of lengths is very similar: the average post
length in the NETLOG_SUBSET2 is 12.7 tokens, with 90% of the posts containing maximum 30
tokens. In this section, we examine the behaviour of a text mining approach to automatically identify Netlog users’ age group and gender under the complex conditions of short social network postings containing non-standard language use. To evaluate different aspects of methodological design, first, we conducted a series of 576 experiments, in which we examined the performance
of three feature selection techniques (DF,χ2 and MI), four feature representation methods (tf-
idf, bin., Abs. andl2-norm.) and six different machine learning approaches (C-SVC, SGD, NB,
k-NN, RF and NN) for both age and gender prediction on the message-level, using the highly
skewed NETLOG_SUBSET2 in a ten-fold cross validation set-up. When analysing the effect of
each aspect on the performance of the model, the other factors were kept constant to allow for a valid comparison. Next, we used the best performing model to compare the performance of lexical, character, syntactic, semantic and sociolinguistic features for both tasks.
4.4.1 Age Group Identification
We calculated random baselines by generating predictions according to the following strategies: (a)
stratified, (b) uniform and (c) majority.Stratifiedrandom classification (RC) generates predictions
by respecting the training data’s class (imbalanced) distribution;uniformRC yields predictions
uniformly at random and amajorityRC strategy labels each instance with the most frequent
category in the training set. The random baselines according to each strategy are displayed in Table 4.4. Given the objective of detecting adults posing as adolescents, we focus on the performance of each combination for the adult class.
In the first series of experiments we only included token unigrams (BOW) as features. As can be observed in Table 4.5, all five machine learning approaches were able to improve upon
Table 4.4: Random baselines for age group identification in the NETLOG_SUBSET2 dataset.
Strategy PLUS20 MIN16
Prec. Rec. F-sc. Prec. Rec. F-sc.
Stratified 18.2 18.5 18.4 83.0 82.8 82.9
Uniform 16.9 49.0 25.1 82.5 50.0 62.2
Majority 0.0 0.0 0.0 82.8 100.0 90.6
the baseline performance with regard to the minority class (PLUS20) — despite the high class
imbalance and the short text samples attested in the dataset. The highest F-score for both the
PLUS20 and the MIN16 category — hence, the best overall performance — was obtained when
using a Multinomial Naïve Bayes classifier. When representing the 50,000 most discriminative features with their absolute, tf-idf or binary values in each document instance, the NB classifier achieved a 66.5% F-score for identifying adults (DF) and a 94.2% F-score for detecting children
between 11 and 15 years old (χ2) based on only a single message per user. Both implementations
of SVMs (C-SVC and SGD) and the NN classifier produced slightly lower best F-scores for the
older age group (64.6%, 63.9% and 63.6%, respectively), but the k-Nearest Neighbor and the
Random Forest algorithms yielded a notably lower performance when identifying the PLUS20
category (52.1% and 52.9%). With regard to the younger age group, all algorithms were able to improve upon the random baselines, demonstrating very similar results: the best performance was achieved by the NB classifier with an F-score of 94.2%, followed by both SVM algorithms
(93.5%), the NN classifier (93.3%), the RF algorithm (92.6%) andk-NN (92.2%).
With regard to feature selection, almost every machine learning approach benefited from reducing its dataset’s dimensionality to its 50,000 most discriminative features, except for the
k-NN classifier. Applying Document frequency led to the best F-scores for the older age group
when using the Naïve Bayes and the Random Forest Classifier, while performing feature selection based on the Chi Square method improved the performance of the C-SVC and the NN algorithms for the same category. With reference to the younger age category, the best results were achieved in almost all cases by applying Chi Square. When analysing the different feature representation methods, it is clear that the performance varies according to the machine learning algorithm
they are combined with. More specifically, the NB classifier produced similar results when
incorporating tf-idf, binary and absolute feature values and marginally lower results whenl2-
normalisation was applied. Conversely, the other ML algorithms tended to yield slightly better results when using the latter representation method. Table 4.5 provides an overview of the best performing combination of feature selection and representation methods per machine learning algorithm. The best performing model showed a classification error of 17.5% +/- 0.2%.
Table 4.5: Results (%) for the best performing combinations of feature selection and representation
methods per machine algorithm for age group identification in the NETLOG_SUBSET2 dataset.
ML Feat. Rep. Feat. Sel.
Scores
PLUS20 MIN16
Prec. Rec. F-sc. Prec. Rec. F-sc.
C-SVC l2 χ2(50K) 72.3 58.3 64.6 91.7 95.4 93.5
SGD Bin./l2 MI (75K)/χ2(10K) 70.4 58.6 63.9 90.4 97.0 93.5
NB Bin. DF (50K)/χ2(50K) 77.1 58.4 66.5 90.9 97.7 94.2
NN Tf-idf/l2 χ2(75K)/χ2(10K) 68.8 59.2 63.6 91.0 95.8 93.3
RF Tf-idf/Abs. DF (10K)/χ2(50K) 66.0 44.1 52.9 88.1 97.6 92.6
k-NN Tf-idf All feats. 67.0 42.6 52.1 88.9 95.6 92.2
After this first series of experiments in which different aspects of experimental design were examined for their robustness towards short social network communications, a second series of experiments was performed to investigate whether other feature types could improve upon the performance of the best performing model from the first series. More specifically, the performance of three traditional feature types was examined, which could potentially be informative for predicting age in Flemish chatspeak: lexical (token unigrams, content and function words),
charactern-gram (bi-, tri- and tetragrams) and syntactic features (POS uni-, bi- and trigrams).
Aside from these types, this second series of experiments also investigated whether training on a new type of features, i.e., sociolinguistic features (see Section 4.3.2), could potentially enhance the performance of the profiling system. For reasons of comparability, each feature was represented by its binary value and feature selection was performed using DF.
When analysing the results of the single feature types, the BOW features outperformed all other feature types when identifying the adult class, yielding a precision of 77.1%, a recall of 58.4% and an F-score of 66.5%. Stemming each token in the model increased the precision to
94.0%, but had a negative effect on recall and F-score (28.9% and 44.2%, respectively). Focusing on either content or function words only, the NB classifier was not able to produce better results than the bag-of-words model. Additionally, including context information in the experiments (i.e.,
by using tokenn-grams and skip-grams) did not enhance the performance. With regard to the non-
lexical feature types, character tetragrams achieved the best results, yielding a higher precision of 81.8%, but a lower recall and F-score of 54.4% and 65.3%, respectively. Syntactic features (POS) performed very poorly and training on semantic (LIWC) or sociolinguistic features as a single feature type resulted in the classifier attributing the majority label to each instance. Finally,
the best performing single feature types to identify the MIN16 category were the character
tetragrams and BOW features. An overview of the results is displayed in Figure 4.2 and 4.3 in Section 4.5.1.
4.4.2 Gender Detection
We performed a similar series of experiments to investigate the effect of methodological design on a text mining approach when distinguishing between male and female Netlog users in the NETLOG_SUBSET2 dataset. Random baselines for gender detection can be found in Table 4.6. Additionally, because research in psychiatry and behavioural sciences has shown that most individuals that engage in paedophilia are male (e.g., [78, 192]), we focus on the performance of each combination for the male category.
Table 4.6: Random baselines for gender detection in the NETLOG_SUBSET2 dataset.
Strategy MALE FEMALE
Prec. Rec. F-sc. Prec. Rec. F-sc.
Stratified 26.8 27.3 27.1 75.6 75.2 75.4
Uniform 25.2 51.0 33.7 75.2 49.5 59.7
Majority 0.0 0.0 0.0 75.0 100.0 85.7
As can be seen in Table 4.7, the results from the gender classification experiments are similar to those of the age prediction models with regard to experimental design: the best F-scores for both gender categories were achieved by the Naïve Bayes classifier when including the 50,000 most
frequent token unigrams (DF). Representing features with their absolute values resulted in a best
F-score of 38.8% for the MALEclass, while using tf-idf feature values led to 86.3% for the FEMALE
category. All models were able to improve upon the random baselines when identifying the minority class, but only the NB and the SGD classifiers produced better results than the majority random baseline when detecting female Netlog users. Again, both SVM models (C-SVC and SGD), together with the NN classifier achieved slightly lower results for gender prediction than the NB
classifier, while the performance of the Random Forest and thek-Nearest Neighbor algorithms
was considerably lower for both categories. Furthermore, applying Document Frequency for performing feature selection yielded the best results when detecting the minority class. The best combinations of feature selection and representation methods per classifier are shown in Table 4.7. The best performing model showed a classification error of 23.5% +/- 1.0%.
Table 4.7: Results (%) for the best performing combinations of feature selection and representation
methods per machine algorithm for gender detection in the NETLOG_SUBSET2 dataset.
ML Feat. Rep. Feat. Sel.
Scores
MALE FEMALE
Prec. Rec. F-sc. Prec. Rec. F-sc.
C-SVC Bin./l2 χ2(50K/10K) 43.3 29.5 35.1 77.5 97.0 86.1 SGD Tf-idf/l2 DF/χ2(10K) 42.9 29.6 35.0 77.9 93.4 85.0 NB Abs./Tf-idf DF (50K) 55.9 29.8 38.8 77.0 98.2 86.3 NN Bin. DF(10K)/χ2(50K) 36.9 17.2 23.5 75.6 98.3 85.5 RF Abs./l2 MI(50K/75K) 45.8 15.3 22.9 76.4 95.7 85.0 k-NN Bin./l2 χ2(75K/10K) 40.7 30.5 34.9 78.4 91.7 84.5
Next, to investigate which feature types were most informative when predicting gender, a NB classification model was trained on other lexical, character, syntactic, semantic and sociolin- guistic features, which were extracted by applying the Document Frequency selection metric and selecting the 50,000 most discriminative features using absolute feature values. Contrary to the
age detection experiments, charactern-gram features outperformed all other single feature types
for the minority class, with character tetragrams achieving a best F-score of 44.1%. Of the other feature types, only the bag-of-words model and the BOW skip-grams were able to improve upon
the baseline performance for the MALEcategory, yielding a 38.8% and 38.3% F-score, respectively.
the classification models trained on semantic (LIWC), syntactic (POS) or sociolinguistic features were not reduced to a majority label classifier (see Section 4.4.1), they scored considerably lower
than the lexical and character n-gram features. With regard to identifying the female Netlog
users in the NETLOG_SUBSET2 dataset, all single feature types produced considerably higher
F-scores compared to the stratified and uniform random baselines. However, aside from the BOW
model described earlier, they were not able to improve upon the majority baseline for the FEMALE
category. The results of each feature type can be found in Figure 4.4 and 4.5 in Section 4.5.1.