Chapter 4 Experiments on modern corpora
4.1.1 Configurations
The most important parameters in my experiments are those related to grid construction and grid scoring. Additional parameters cover choice of representative point, smoothing, filtering and logistic regression.
1This chapter is partly based on Wing (2011), Wing and Baldridge (2011) and Wing and Baldridge (2014). Jason
Grid construction For grid construction, the possibilities are either a uniform or k-d tree grid. For uniform grids, the main tunable parameter is grid size (in degrees), while for k-d trees it is bucket size(BK), i.e. the number of documents above which a node is divided in two.
Grid scoring For grid scoring, the options are:
• RAND: Random baseline
• PRIOR: Cell prior maximum • NB: Naive Bayes
• KL: KL divergence
• ACP: Average cell probability
• IGR: Naive Bayes using features selected by information gain ratio • FLATLR: Logistic regression model over all leaf nodes
• HIERLR: Product of logistic regression models at each node in a hierarchical grid (eq. 3.17)
Some of these methods are associated with additional parameters, which must be tuned on the dev set:
• For IGR, there is one additional parameter, the cutoff (CU), a percentile. For a given value c, we eliminate the bottom (100 − c)% of words, as measured by information gain ratio. • For HIERLR, there are three additional parameters: subdivision factor (SF), beam size (BM),
and hierarchy depth (D). See §3.7 and §4.2.5 for more discussion. All of our test-set results use a depth of three levels.
Choice of representative point Once grid cells have been scored, a single point representing the top-ranked cell needs to be chosen. This can be done using the geographic center of a cell or the centroid of the training documents in the cell. The latter produces consistently better results and is used in further experiments (§3.3), but has a significant dependence on the particular set of training
documents, which especially matters when this set is small (§5.2.2). Another possibility is to take into account cells further down in the ranking, using an algorithm such as mean shift (§1.2, §7.2.2), although preliminary experiments with this algorithm were not promising.
Smoothing As discussed in §3.4.1, I consider three types of smoothing of language models: Dirichlet, Jelinek, and my own method pseudo-Good-Turing. Based on preliminary experiments, I choose Dirichlet smoothing in conjunction with Naive Bayes, with the Dirichlet parameter set to m = 1, 000, 000. For KL divergence, I did not have good luck with Dirichlet smoothing, and instead use pseudo-Good-Turing, which has no tunable parameter.
Filtering For the most part, I do not pre-filter words out of a language model, except for applying standard language-dependent sets of stopwords. Some methods that I compare against, however (e.g. GEOTEXT, §4.2.1), do pre-filter words, and I investigate whether this is needed.
Logistic regression Due to its speed and flexibility, I use Vowpal Wabbit (Agarwal et al., 2014) for logistic regression.2 I estimate parameters with limited-memory BFGS (Nocedal, 1980; Byrd
et al., 1995), as I found that stochastic gradient descent (SGD) (Bottou, 2010) yielded significantly worse results.3 Unless otherwise mentioned, I use 26-bit feature hashing (Weinberger et al., 2009)
and 40 passes over the data (optimized based on early experiments on development data). For the subcell classifiers in hierarchical classification, which have fewer classes and much less data, I use 24-bit features and 12 passes.
Vowpal Wabbit has a hold-out mechanism, which holds out a portion of the training data and uses it to determine when to stop training, to avoid potential overfitting problems. I turn this mechanism off due to poor performance with it enabled. This means I have to carefully optimize the number of passes using the dev set, to avoid both underfitting (not enough passes) and overfitting (too many passes), both of which cause significant decreases in accuracy. This is in contrast to the
2I also investigated some other tools, including the mlogit package of R (Croissant, 2013) and Rob Malouf’s TADM
(Tools for Advanced Data Modeling) package (Malouf, 2002).
3SGD holds out the promise of being faster than BFGS. However, I found that attempting to tune SGD to achieve similar
results to BFGS produced even slower running times than BFGS. One possibility I did not consider, which may produce comparable accuracy and faster running time, was to use SGD to produce a preliminary solution and optimize further with BFGS.
Feature bits Passes 22 23 24 25 26 27 16 394 355 363 380 390 391 24 346 309 287 302 299 287 32 277 266 250 259 254 257 40 267 259 256 247 249 255 48 275 266 267 254 254 253 64 301 281 286 286 276 277
Table 4.1: Median prediction error (km) on the TWUS dev set for various combinations of feature- hashing bit size and number of BFGS passes.
number of bits used for feature hashing, where it is merely necessary to use a large enough feature space to avoid clashes, and using more bits than necessary does not materially hurt performance.
The effect of different numbers of feature bits and passes can be seen in Table 4.1, which shows median prediction error on TWUS-LARGEwith a uniform 5◦grid under FLATLR. In this case 25 bits is slightly better than 26, but in other experiments (e.g. in HIERLR, and for ENWIKI13, which has more features) I found better performance from 26 bits, which is what I ultimately se- lected.