Discussion of Evaluation Results - Universal Machine Learning Methods for Detecting and Tempora

Bergstra and Bengio(2012) analyzed seven design choices and hyperparameters for a feedforward network on various image datasets. They showed that a small number of parameters make the difference between mediocre and state-of-the-art performance. However, the relative importance of each hyperparameter varies between the datasets. Certain parameters were highly relevant for one dataset, but not for other datasets. This varying importance is a challenge for tuning the network for new tasks, as it is a priori not known which parameters will be important, hence, all parameters must be tuned.

In our experiments, we could confirm that only some parameters are relevant for achieving a good performance. For example, we observed large performance differences between the different pre-trained word embeddings, but only minor differences for the number of recurrent units. Further, we observed that several techniques can significantly boost the performance, but their hyperparameters are of minor importance. For example, gradient normalization improved the performance for all tasks, but the normalization threshold τ was of minor relevance. Similar with variational dropout, using it improved the performance, but the dropout rate p was not that important.

In contrast toBergstra and Bengio(2012), the relative importance of each parameter was comparably consistent across all tasks. Different pre-trained embeddings had a large impact for all tasks, and the number of recurrent units had a minor impact for all tasks. Only for two parameters, we observed changing relative importance: The classifier (Softmax or CRF) and character-based word representations.

The CRF classifier did not yield an improvement for tasks with no or low dependencies between tags, which was expected. For tasks that had stronger dependencies between tags, the CRF classifier yielded a significant improvement. We observed that there was still a difference between a CRF and a softmax classifier with stacked LSTM-layers, even though the difference decreases with more LSTM-layers. It appears that stacked layers are better in capturing dependencies between words in a sentence.

The two character-based word representation mechanisms proposed byMa and Hovy

(2016) andLample et al. (2016) yielded a statistically significant improvement only for the POS, chunking, and event detection task. For NER and entity recognition, no statistically significant difference was observed. This is contrasting the conclusions fromMa and Hovy (2016) and Lample et al. (2016), which claimed that character- based word representations are helpful for English NER.

Character-based word representations can address two challenges: Creating a meaningful representation for unknown words and usage of sub-word information for the classification, e.g. from morphology. Unknown words, with no pre-trained word em- bedding, was a minor issue for the analyzed tasks. The ratio of unknown words was only between 0.5% and 3%. For other domains or other languages, unknown words are a bigger challenge, where character-based embeddings could add more value. Sub-word information can be beneficial for syntactical tasks like POS, but for tasks

3.6. Discussion of Evaluation Results

like NER, they provide lower or no benefit. From the characters of a word, it is often not possible to decide whether it is named entity or which type of entity it is, especially for rare words. The characters in (unseen) stage names, company names or product names usually do not provide much information about the type of entity, hence, deriving a meaningful representation is not possible.

In the experiments of Bergstra and Bengio, the optimal configurations for the different image datasets were fairly distinct, i.e., a value that worked well for one task can be a bad choice for a different task. In our experiments, we observed that the optimal values are rather consistent across the datasets. For example, the embeddings byKomninos and Manandhar(2016) were the best option for all tasks, Nadam (Dozat, 2015) was the optimal optimizer and two stacked BiLSTM-layers achieved the best performance for all datasets or was on-par with the best option.

It is up to future research if this consistency remains true in an evaluation with more diverse datasets, e.g. datasets in different languages or from different domains. A high consistency would be desirable, as it significantly reduces the needed effort of tuning.

Explaining why certain design choices and hyperparameters work well is difficult. The parameter that had the largest impact was the pre-trained word embeddings. This was an important factor for all datasets. High-quality embeddings allow the network to use a lot of background knowledge, that is incorporated into the embeddings. For example, word embeddings can provide information about syntactical and semantic relationships between words. This is especially beneficial for tokens that are not observed during training.

In our experiment, we can confirm that embeddings based on dependencies better capture functional properties of words and window based embeddings capture better topical similarity of words (Levy and Goldberg, 2014; Komninos and Manandhar,

2016). For example, the dependency based embeddings by Levy and Goldberg

(2014) worked well for part-of-speech tagging and chunking, but less well for NER or entity recognition. GloVe embeddings (Pennington et al.,2014), which are based on context windows, worked well for NER.Komninos and Manandhar(2016) combined the idea of dependency-based embeddings with a context-based approach. As it appears, this gives a good representation of the functional and of the semantic properties of words. As a consequence, these embeddings performed well in the evaluated tasks.

Our results insection 3.5.9 show, that the capacity of the network, i.e., the number of recurrent units, is of minor importance. The BiLSTM-CRF architecture performs similarly well if we choose the capacity too small or too large. Other aspects of the architecture were far more important: The incorporation of background knowledge using word embeddings and the process to find a local minimum. To which local minimum the network converges is influenced by the optimizer, the dropout mech- anism, gradient normalization, and the mini-batch size.

The mini-batch size can influence whether the network converges to a flat or a sharp minimum (Keskar et al., 2016). Hochreiter and Schmidhuber (1997b) (informally) defined that a minimum can be flat, when the error function remains approximately

constant for a large connected region in weight-space, or it can be sharp, when the error function increases rapidly in a small neighborhood of the minimum. A conceptual sketch is given inFigure 3.12. The error functions for training and testing are typically not perfectly synced, i.e., the local minima on the train or development set are not the local minima for the held-out test set. A sharp minimum usually depicts poorer generalization capabilities, as a slight variation results in a rapid increase of the error function. Hence, it is desirable that a network converges to a flat minima, as those usually generalize better to unseen data (Keskar et al.,

2016).

Flat Minimum Sharp Minimum

Train Error Test Error

f (x)

Figure 3.12: A conceptual sketch of flat and sharp minima fromKeskar et al.(2016). The Y-axis indicates values of the error function and the X-axis the weight-space.

Keskar et al. (2016) observed that a neural network tends to converge to sharp minima when it is trained with large mini-batches9_{. However, when the network is}

trained with small mini-batches, it tends to converge to flat minima. They conclude that training with small mini-batches is favorable to achieve better performances. We conclude from our results in section 3.5.10 that training a network with a too small mini-batch size, namely with a size of 1, can also be a disadvantage. For the tasks of Chunking, NER, and Entities recognition, a mini-batch size of 1 achieved far worse results than training the same network with larger mini-batches, for example mini-batches of 8 or 16 sentences. However, for the tasks of POS tagging and event decection, a mini-batch size of 1 was optimal. So far, it is not clear why small mini-batches sizes lead to far worse results on some datasets. How to determine the optimal batch size is part of future research.

In document Universal Machine Learning Methods for Detecting and Temporal Anchoring of Events (Page 70-72)