Chapter 3 Document geolocation models
3.2 Grid types
In the context of the general grid-based approach to geolocation followed by this dissertation and described in §1.2, there are several options for constructing the grid and for modeling.
3.2.1
Uniform grid
The simplest grid is a uniform grid with rectangular cells of equal-sized degrees, such as 1◦by 1◦ or 100 km by 100 km, a strategy followed by Serdyukov et al. (2009) and O’Hare and Murdock (2013) for Flickr, Cheng et al. (2010) and Wing and Baldridge (2011) for Twitter, and Wing and Baldridge (2011) for Wikipedia. Compared to a grid that takes document density into account, it over-represents rural areas at the expense of urban areas. Furthermore, the rectangles are not equal-
area, but shrink in width away from the Equator. (However, the shrinkage is mild until near the poles. For example, at 45◦latitude, the ratio of width to height is better than 0.7 to 1.)
Figure 1.3 in Chapter 1 shows a choropleth map demonstrating the uniform grid construc- tion. The rank of cells for the test document Pennsylvania Avenue (Washington, DC) in ENWIKI13 is plotted, for a uniform 0.1◦ grid. The top-ranked cell is the correct one. The highest-ranked cells are near Washington, DC, but other culturally similar areas — nearby large cities (Baltimore, Philadelphia, New York City, Pittsburgh) and suburban northern Virginia — are also highly ranked. The importance of certain words in the article is visible in the delineation of the states of Penn- sylvania (due to “Pennsylvania” occurring in the article’s topic) and Maryland (three-fourths of Pennsylvania Avenue is in Maryland).
A truly equal-area grid can be constructed by means of a quaternary triangular mesh (Dut- ton, 1996). Dias et al. (2012) used such a construction for Wikipedia, but it did not yield consistently better results. For this reason, as well as ease-of-implementation reasons and the fact that most of the populated regions of interest for this dissertation are far from the poles (where the worst distortion occurs), I construct rectangular grids.
3.2.2
Adaptive k-d tree grid
Roller, Speriosu, Rallapalli, Wing and Baldridge (2012) introduced an adaptive grid based on k-d trees (Bentley, 1975), which I make use of in this dissertation. The idea is to use variable-sized cells so that the number of documents per cell is approximately the same. A k-d tree in 2 dimensions starts out with a single grid cell and adds documents to this cell one by one. When the number of documents reaches a threshold termed the bucket size, the cell is split in two along the dimension with the greatest range of points seen, following Friedman et al. (1977). Roller et al. (2012) con- sidered splitting at either the midpoint of the range of points or at the median of the dimension in question for all points in the cell, and found that neither method was clearly superior. In my prelimi- nary experiments I found midpoint splitting to work at least as well, and I use that in my subsequent experiments.2
Figure 3.1: k-d tree grid construction. Relative Naive Bayes rank is shown for cells for ENWIKI13 test document Pennsylvania Avenue (Washington, DC), surrounding the true location. (Constructed with assistance from Grant DeLozier.)
Figure 3.1 shows a sample k-d tree grid in the form of a choropleth map. Increased cell density with correspondingly smaller cells occurs on land compared with over the sea, especially in coastal regions of the Northeast of the United States. Map callouts zoom in on Washington, DC and New York City, showing the particularly increased concentration of cells in city centers.
3.2.3
City-based grid
Some researchers have used a city-based representation, either with a full set of cities covering the Earth and taken from a comprehensive gazetteer (Han et al., 2014) or a limited, pre-specified set of cities (Kinsella et al., 2011; Sadilek et al., 2012). This is somewhat comparable to k-d trees in that it adapts to areas of greater population. Han et al. (2014)’s construction, for example, determines a set of city attractors by reducing the total set of cities in a gazetteer through amalgamating cities into nearby larger cities in the same second-level administrative district (in the same state, in the case of the United States). Training documents are then assigned to a pseudo-document corresponding to the nearest city. An even more direct method would use census-tract boundaries when available.
An advantage of city-based grids compared especially with coarser-scale rectangular grids is that in the latter, the boundary between cells may run through the middle of a city. This has the effect of splitting a presumably unitary linguistic area, and grouping the different parts of the city with the heterogeneous linguistic areas of other cities. For example, a coarse grid that passes through the middle of Austin, Texas might group one half with San Antonio and the other half with Houston, making it more difficult to correctly geolocate a document whose location is in Austin. The resulting statistical bias is known as the modifiable areal unit problem (Gehlke and Biehl, 1934; Openshaw, 1983). With finer grids, however, this is less likely to be an issue. It is also possible to mitigate this issue in k-d trees by dividing a cell in such a way as to produce the maximum margin between the dividing line and the nearest document on each side. (This was implemented in Roller et al. (2012)’s code but not investigated in their paper.)
A disadvantage of city-based grids is that they are unable to resolve locations at a finer scale than an entire city, whereas rectangular grids can be made as fine-scale as desired. This is a particular advantage of k-d trees, which will naturally increase their resolution in the vicinity
of populated regions, leading to grids that may be able to distinguish cities from suburbs or even identify individual neighborhoods in a city, as shown in Figure 3.1.
Other disadvantages of these methods are the dependency on time-specific population data, making them unsuitable for some corpora (e.g. 19th-century documents); the difficulty in adjusting grid resolution in a principled fashion; and the fact that not all documents are near a city. Han et al. (2014) in fact find that 8% of tweets are “rural” and cannot predicted by their model. This may be worse for Wikipedia, which includes coverage of many small towns and villages.
For these reasons, I do not consider city-based grids in my experiments.