3.3 Framework Modules
3.4.4 Experiments: Advanced Neural Network Techniques
In order to further investigate the potential of embedding visual and textual features in the NER extraction task, we extended and performed a full investigation of the impact of such features. First, we also included other relevant NER features, such as Brown Clusters [134]. Then we also adapted the proposed methodology implementing recent neural architectures (e.g., B-LSTM [116]). We detail modifications and configurations as follows:
Brown Clusters
The usefulness of Brown clusters (B) in NER [135] and the sensitivity of the number of clusters in the NER task has been recently studied in [134]. We explore these findings altering the number of clusters (320, 640 and 1000). In these configuration settings, the brown clusters are then used as features along with Standard (S) and [78] features. Table3.5 details the configurations.
Cfg Features cfg09 S + Brown 64M c320 cfg10 S + Brown 64M c640 (Bbest) cfg11 S + Brown 500M c1000 cfg12 S + Lemma + Brown 64M c320 cfg13 S + Lemma + Brown 64M c640 cfg14 S + Lemma + Brown 500M c1000 cfg15 S + Bbest + CV cfg16 S + Bbest + T X cfg17 S + Bbest + CV + T X
Table 3.5: Exploring Brown Clusters along with [78] features. Bbest stands for the best found number of
3.4 Experimental Setup
Neural Network-based Features
We finally explore and analyze the impact of several new proposed features considering state of the art methods for image recognition and text classification. In order to evaluate the impact of each new designed feature, we also split them into distinct experimental configurations (Table
3.6). Cfg Features Cfg Features cfg18 S+CVcnn cfg30 =18+Bbest cfg19 S+T Xcnn cfg31 =19+Bbest cfg20 S+T Xemb cfg32 =20+Bbest cfg21 S+T Xstats cfg33 =21+Bbest cfg22 S+T Xcnn+T X cfg34 =22+Bbest cfg23 S+T Xcnn+T X +T Xe cfg35 =23+Bbest +T Xstats cfg24 S+T Xcnn+CVcnn cfg36 =24+Bbest cfg25 S+T Xcnn+T X +CV cfg37 =25+Bbest cfg26 S+CVcnn+CV cfg38 =26+Bbest cfg27 S+CVcnn+CV+T X cfg39 =27+Bbest cfg28 S+CVcnn+CV+T Xcnn cfg40 =28+Bbest +T X cfg29 S+CVcnn+CV+T Xcnn cfg41 =29+Bbest +T X +T Xemb+T Xstats
Table 3.6: The highlighted row depicts the best results w.r.t. Brown Clusters (“Brown Best”) (cfg10)
(a) “S” stands for Standard features; (b) “CNN CV” for state of the art neural network
computer vision); (c) “Standard CV” for traditional computer vision algorithms and techniques;
(d) “Brown” (xM cL) the Brown cluster configuration (x=vocabulary size and c=number of
clusters); (e) “Standard” means the classic NER features; (f) “CNN TX” for state of the art text classification; (g) “TX Embeedings” calculates the intersection rate of a given word embeeding and a common-sense set for a given NE class; (h) “Standard TX” for traditional text classification algorithms and techniques; (i) “CNN TX Statistics” computes the basic statistics per NE class (“sum”, “max”, “min” and “average”) from predictions of (f) and (g).
Results
The complete benchmark configuration has the following dimensions: cf g ×(DStrain+(DStrain× DStest)) × A; where cf g is the total of feature sets (i.e., distinct configurations), Dtrain is the total of training sets, Dtest is the total of test sets and finally A is the total number of algorithms. This leads to the following number of experiments: 41 × (4 + (4 × 3)) × 9 = 5.9049. Demystifying
the impact of images and news: Figure 3.2 shows the performance of CRF in different
9
41 experiment configurations, 4 training sets (Ritter, WNUT-15, WNUT-16 and WNUT-17), 3 test sets (WNUT-15, WNUT-16 and WNUT-17) and 9 NER architectures (DT, RF, CRF, CRF-PA, LSTM, B-LSTM+CRF, Char+B- LSTM+CRF and B-LSTM+CNN+CRF)
datasets/feature sets. The x-axis represents the different feature sets (Tables3.4,3.5and 3.6), while y-axis average of F1-measure10. To highlight the impact of the different groups of features, we categorize F1‘s in four ascending scales, from worse to the best: red, yellow, gray and green. Some patterns w.r.t. the addition of images and text as input features are clearly observable. First, standard textual features (T X ) have often a much worse performance when compared to standard image features (CV) as well as in the combination of both, as observed in the following sets cfg02×cfg03×cfg04, cfg06×cfg07×cfg08 and cfg15×cfg16×cfg17. This is at some extend expected since one-vs-all strategy to classify news data is not an easy task. In this sense, a better solution might be taking into account probabilities instead of binary values. Moreover, we notice our improvement in the T X component (cfg19, cfg20 and cfg21) outperform the similar features proposed by [78]. Among those, it is worth noting that the text correlation
(T Xstats) has a greater impact than any other textual feature. This is due to the higher level of
abstraction when computing word embedding distances across seeds in a distance supervision fashion. Regarding the image detection component, introducing state-of-the-art computer vision algorithms (CVcnn) has also been beneficial to beat previous strategy (CV), although without bringing major improvements as in the T X . This is due to the common-sense rules proposed by [78] in this layer. Brown Clusters: Finally, the inclusion of tuned Brown clusters (Bbest, cfg30-41) along with proposed features shows to be beneficial to the performance. Overall, the best results were obtained from the concatenation of the previous and proposed features along with Brown clusters (cfg41). Architectures: We benchmark distinct NER architectures comparing the following feature sets: cfg10 (weak baseline), cfg04 [78] as images-and-news baseline and cfg41 (HORUS).Table3.7presents results. As expected, CRFs and state-of-the-art NNs architectures performed best. The comparison shed light on the impact of our proposed features (best configuration, cfg41) when compared to the broadly implemented NER features (cfg10) and the architecture proposed by [78] (cfg04). We can see that overall the additional features introduced in this work clearly improves the performance of the majority of the models (DT, RF, CRF, B-LSTM+CRF) in all data sets.
Dataset Decision Trees Random Forest CRF B-LSTM B-LSTM B-LSTM
CRF C+CRF C+CRF+CNN [116] [71] [132] cfg −→ 10 04 41 10 04 41 10 04 41 10 04 41 10 04 41 10 04 41 P 0.48 +2% +4% 0.51 +1% +24% 0.73 +5% +7% 0.77 +1% −3% 0.81 −5% −1% 0.81 −5% −5% Ritter R 0.49 +1% +3% 0.48 −1% −2% 0.58 −8% −2% 0.63 +5% +5% 0.59 +5% +4% 0.62 +3% +5% F 0.49 +1% +3% 0.49 +4% +7% 0.58 +2% +7% 0.68 +1% +1% 0.67 +1% +1% 0.69 −1% +1% P 0.49 +2% +5% 0.52 +7% +25% 0.72 +7% +9% 0.72 −4% −2% 0.77 −3% −4% 0.78 −4% −5% WNUT-15 R 0.50 +0% +5% 0.49 +0% +1% 0.48 −1% +6% 0.69 +1% +1% 0.65 +2% +2% 0.66 +2% +2% F 0.50 +0% +5% 0.50 +5% +9% 0.56 +2% +8% 0.68 +0% +0% 0.69 +0% −1% 0.71 −1% −2% P 0.49 +1% +6% 0.52 +14% +23% 0.72 +7% +9% 0.72 −4% −2% 0.77 −3% −3% 0.78 −4% −6% WNUT-16 R 0.50 +1% +6% 0.48 +0% +2% 0.48 −1% +6% 0.69 +0% +1% 0.65 +2% +2% 0.66 +2% +2% F 0.49 +1% +6% 0.50 +5% +10% 0.56 +2% +8% 0.69 −1% +0% 0.69 +0% +0% 0.71 −1% −2% P 0.44 +3% +7% 0.47 +13% +24% 0.76 +2% +1% 0.76 −2% −2% 0.76 +0% −2% 0.77 −3% −3% WNUT-17 R 0.45 +4% +6% 0.44 +3% +4% 0.50 +0% +5% 0.63 +1% +1% 0.64 +0% +1% 0.62 +1% +1% F 0.44 +4% +6% 0.45 +6% +12% 0.60 +0% +4% 0.67 +0% +0% 0.69 +0% −1% 0.67 +0% −1%
Table 3.7: The performance measure‘s improvements (green) and decreases (red) in different datasets, feature sets (cfg) and architectures are represented in a color gradient of 5 points interval. 0% represents a tiny improvement i (0.1% ≤ i ≤ 0.99%), which is not representative, although technically not zero.
Sampling sets: Figure3.3 depicts the impact of images and news on F1-measure for the best
10
3.4 Experimental Setup
Figure 3.2: The CRF performance (cfg × F1) in different datasets/feature sets
architecture of a weak baseline (CRF) and a strong baseline (B-LSTM+CRF), defined according to previous experiments (Table3.7). The results confirm that the proposed features consistently boost the performance of the models in the majority of the experiments. It is worth noting the substantial impact in the CRF-based model. Our proposed features (cfg41) improves Lexical + Brown Cluster and [78] in more than 90% of the cases (and at least similar in 100% of the
cases).
The precision and recall trade-off : Moreover, we notice that a basic CRF architecture with
the best feature configuration (cfg41) outperforms a state-of-the-art B-LSTM architecture w.r.t. precision. The same feature set also positively impacted recall of B-LSTM in all experiments.
Increasing training data: Finally we trained a B-LSTM+CRF architecture with an expanded
set created merging all data sets. We removed duplication from the union of the respective training, dev and test sets, i.e., occurrences of overlap sentences. The results are presented in Table3.8, which support the claim that we can go beyond lexical features and further investigate the use of images and news to benefit NER on noisy data.
Figure 3.3: The positive impact of images and news through distinct training-test pairs sets. A comparison between the best weak (CRF) and the best strong (B-LSTM+CRF) baselines.
+cfg10 +cfg04 +cfg41 B-LSTM+CRF 0.5217 0.5352 ↑ 0.5376 ↑
Table 3.8: B-LSTM+CRF F1-measure with expanded training/dev/test data over different feature sets.