Empirical Study: Comparing Methods Based on Single Model Per-

on Single Model Performances

We demonstrate that Evaluation 1 and Evaluation 2 fail to identify that two learning approaches are the same. By implication, a significant difference in test score does not allow the conclusion that one approach is better than the other.

We compare a learning approach A against itself, which we call approach A and ˜A hereafter. Approach A and ˜A use the same code, with the same configuration and are executed on the same computer. The only difference is that the sequence of random number changes each time the approaches are trained.

A suitable evaluation method should conclude that there is no significant difference between A and ˜Ain most cases. We use p = 0.05 as the threshold, hence, we would expect that a significant difference between A and ˜A only occurs in at most 5% of the cases.

5.3. Empirical Study: Comparing Methods Based on Single Model Performances

As datasets we use the datasets described in section 3.3for common NLP sequence tagging tasks. As learning approach, we use the BiLSTM-CRF architecture described in section 3.1. We use 2 hidden layers, 100 hidden units each, variational dropout (Gal and Ghahramani, 2016) of 0.25 applied to both dimensions, Nadam as optimizer (Dozat, 2015), and a mini-batch size of 32 sentences. For the English datasets, we use the pre-trained embeddings by Komninos and Manandhar (2016). For the German datasets, we trained word embeddings using word2vec on about 116 Million sentences from German Wikipedia and German news articles. The German embeddings were published inReimers et al. (2014).

Training

In total, we trained 100,000 models for each task with different random seed values. We randomly assign 50,000 models to approach A while the other models are assigned to approach ˜A.

For simplification, we write those models as two matrices with 50 columns and 1,000 rows each:

[A(j)_i ] [ ˜A(j)_i ]

with i = 1, . . . , 50 and j = 1, . . . , 1000. Each model A(j)

i has a development score

Ψ(dev)

A(j)_i and test score Ψ (test) A(j)_i .

Model A(j)

∗ marks the model with the highest development score from the row A(j)_1≤i≤50

and ˜A(j)∗ is the model with the highest development score from ˜A(j)_1≤i≤50. Hence, we

test Evaluation 2 with n = m = 50.

Statistical Significance Test

We use the bootstrap method byBerg-Kirkpatrick et al.(2012) with 10,000 samples to test for statistical significance between test performances with a threshold of p < 0.05.

For Evaluation 1, we test on statistical significance between the models A(j)

i and

A(j)_i for all i and j. For Evaluation 2, we test on statistical significance between A(j)∗

and ˜A(j)∗ for j = 1, . . . , 1000.

Results

We compute in how many cases the bootstrap method finds a statistically significant difference. Further, we compute the average F1 test-score difference τ for pairs with

an estimated p-value between 0.04 and 0.05. This value can be seen as a threshold: If the F1-score difference is larger than this threshold, there is a high chance that the

Further, we compute the differences between the test performances for approach A and ˜A. For Evaluation 1, we compute ∆(test),(i,j)_{= |Ψ}(test)

A(j)_i − Ψ (test) ˜ A(j)_i |. For Evaluation 2, we compute: ∆(test),(j) = |Ψ(test) A(j)∗ − Ψ(test)_˜ A(j)∗ |. For those delta values we compute a 95% percentile ∆(test)

95 . The value indicates that

a difference in the test score for a given task should be higher than ∆(test)

95 , otherwise

there is a chance greater 5% that the difference is due to chance for the given task and the given network architecture.3

Single Run Comparison

Table 5.1 depicts the main results for Evaluation 1. For the ACE 2005 - Events task, we observe in 34.48% of the cases a significant difference between the models A(j)_i and ˜A(j)_i . For the other tasks, we observe similar results and between 10.72% and 33.20% of the cases are statistically significant.

The average F1-score difference for statistical significance for the ACE 2005 - Events

task is τ = 1.97 percentage points. However, we observe that the difference between A(j)_i and ˜A(j)_i can be as large as 9.04 percentage points F1. While this is a rare

outlier, we observe that the 95% percentile ∆(test)

95 is more than twice as large as τ

for this task and dataset.

Task Threshold τ % significant ∆(test)₉₅ ∆(test)_{M ax}

ACE 2005 - Entities 0.65 28.96% 1.21 2.53 ACE 2005 - Events 1.97 34.48% 4.32 9.04 CoNLL 2000 - Chunking 0.20 18.36% 0.30 0.56 CoNLL 2003 - NER-En 0.42 31.02% 0.83 1.69 CoNLL 2003 - NER-De 0.78 33.20% 1.61 3.36 GermEval 2014 - NER-De 0.60 26.80% 1.12 2.38 TempEval-3 - Events 1.19 10.72% 1.48 2.99

Table 5.1: The same BiLSTM-CRF approach was evaluated twice under Evaluation 1. The threshold column depicts the average difference in percentage points F1-score

for statistical significance with 0.04 < p < 0.05. The % significant column depicts the ratio how often the difference between A(j)

i and ˜A (j)

i is significant. ∆95 depicts

the 95% percentile of differences between A(j)

i and ˜A (j)

i . ∆

(test)

M ax shows the largest

difference.

We observe those variances not only for our implementation of the BiLSTM-CRF architecture. We observe this issue also for two recently published BiLSTM-CRF systems for Named Entity Recognition fromMa and Hovy (2016) and fromLample et al. (2016). Lample et al. reported an F1-score of 90.94% and Ma and Hovy

reported an F1-score of 91.21% for English NER. Ma and Hovy draw the conclusion

3 _{Note that ∆}(test)

5.3. Empirical Study: Comparing Methods Based on Single Model Performances

that their system achieves a significant improvement over the system by Lample et al.

We re-ran both implementations multiple times, each time only changing the seed value of the random number generator. We ran the Ma and Hovy system 86 times and the Lample et al. system, due to its high computational requirement, for 41 times. The score distribution is depicted as a violin plot in Figure 5.4. Using a Kolmogorov-Smirnov significance test (Massey,1951), we observe a statistically significant difference between these two distributions (p < 0.01). The plot reveals that the quartiles for the Lample et al. system are above those of the Ma and Hovy system. Using a Brown-Forsythe test, the standard deviations for the two distributions are different with p < 0.05. Table 5.2 shows the minimum, the maximum, and the median performance for the test performances.

Figure 5.4: Distribution of scores for re-running the system by Ma and Hovy (left) and Lample et al. (right) multiple times with different seed values. Dashed lines indicate quartiles.

System Reported F1 # Seed values Min. F1 Median F1 Max. F1

Ma and Hovy 91.21% 86 89.99% 90.64% 91.00%

Lample et al. 90.94% 41 90.19% 90.81% 91.14%

Table 5.2: The system by Ma and Hovy (2016) and Lample et al. (2016) were run multiple times with different seed values.

Liu et al. (2017a) repeated our experiment and found similar variances for the two architectures. They further found that the performance from Ma and Hovy increases if it is trained on a GPU instead of a CPU. This difference between our scores and the reported scores from Ma and Hovy might be due to a difference between running the code on a CPU or a GPU.

In conclusion, training two non-deterministic approaches a single time and comparing their test performances is insufficient if we are interested to find out which approach is superior for a task. Large differences can occur due to better or worse sequences of random numbers.

In a usual setup, approaches are often trained multiple times, e.g., for tuning hy- perparameters, and the model with the highest development score would be used for labeling the test data, i.e., be used to report the test performance. For the Lample et al. system we observe a Spearman’s rank correlation between the development and the test score of ρ = 0.229. This indicates a weak correlation and that the performance on the development set is not a reliable indicator. Using the run with the best development score (94.44%) would yield a test performance of mere 90.31%. Using the second best run on the development set (94.28%) would yield state-of-the-art performance with 91.00%. This difference is statistically significant (p < 0.002).

Best Run Comparison

Non-deterministic approaches can produce weak as well as strong models as shown in the previous section. Instead of training those a single time, we might want to compare only the “best” model for each approach, i.e., the models that performed best on the development set. This idea is formalized in Evaluation 2.

Table 5.3 depicts the results of comparing only the models that performed best on the development set. For all tasks, we observe small Spearman’s rank correlation ρ between the development and the test score. The low correlation indicates that a run with high development score does not have to yield a high test score.

Task ρ τ % significant ∆(dev)₉₅ ∆(test)₉₅ ∆(test)_{M ax} ACE - Entities 0.153 0.65 24.86% 0.42 1.04 1.66 ACE - Events 0.241 1.97 29.08% 1.29 3.73 7.98 CoNLL - Chunking 0.262 0.20 15.84% 0.10 0.29 0.49 CoNLL - NER-En 0.234 0.42 21.72% 0.27 0.67 1.12 CoNLL - NER-De 0.422 0.78 25.68% 0.58 1.44 2.22 GermEval - NER-De 0.333 0.60 16.72% 0.48 0.90 1.63 TempEval - Events -0.017 1.19 9.38% 0.74 1.41 2.57 Table 5.3: The same BiLSTM-CRF approach was evaluated twice under Evaluation 2. ρ is the Spearman’s rank correlation coefficient between the development and the test score. The threshold τ column depicts the average difference in percentage points F1-score for statistical significance with 0.04 < p < 0.05. The % significant

column depicts the ratio how often the difference between A(j)

∗ and ˜A(j)∗ is significant.

∆95 depicts the 95% percentile of differences between A (j)

∗ and ˜A(j)∗ . ∆(test)_{M ax} shows

the largest difference.

For the ACE 2005 - Events task, we observe a significant difference between A(j) ∗

and ˜A(j)∗ in 29.08% of the cases. We observe for this task that the difference in test

score can be as large as 7.98 percentage points F1-score between A (j)

∗ and ˜A(j)∗ .

As before, we observe that ∆(test)

95 is much larger than τ, i.e., test performances of A∗

In document Universal Machine Learning Methods for Detecting and Temporal Anchoring of Events (Page 116-121)