Chapter 5 Document geolocation for the digital humanities
5.1.1 The applicability of domain adaptation
Why would domain adaptation be beneficial? In this case, for example, the assumption would be that many of the words that are geographically indicative of certain places in Wikipedia are indicative of those same places in BEADLE or WOTR. This assumption appears reasonable in many instances. For instance, many of the most geographically indicative words are toponyms or other geographically-salient proper nouns, such as names of Native American tribes or groups such as the Mormons.
It is true that some of these names have changed. For example, Beadle refers to the Hopi tribe as the “Moqui”, and collectively terms the mountains of New Mexico the “Sierra Madre”, whereas nowadays there is no collective term for those mountains and the term “Sierra Madre” itself normally refers to a different mountain range in Mexico. Beadle also terms the Purgatoire River in Colorado the “Las Animas River”, whereas the modern-day ”Animas River” is a different Colorado river. Furthermore, none of these older usages can be found in Wikipedia. A similar situation obtains in WOTR with places such as “Keatsville, Missouri” (modern-day “Keytesville”, whose Wikipedia entry does not list the older spelling).
A different issue comes from toponyms referring to places that no longer exist. Beadle, for example, mentions a number of railroad ghost towns that had already ceased to exist in his time, such as Deadfall, Last Chance, Murder Gulch and Painted Post in Utah, and Benton, Wyoming. Of these, only Benton can be found in Wikipedia (and not in its own article but in the article concerning the nearby town of Sinclair). Other towns are mentioned that existed at the time but no longer do, such as Red Dog, California (has its own Wikipedia article) and Hazard Station, Wyoming (no mention in Wikipedia).
lan’s Store” in Missouri (apparently a settlement containing a store) or various army camps. These camps may be given names such as “Camp McIntosh” (named after the commander James McIn- tosh) and may appear in the bylines of letters in WOTR, but have a strictly temporary existence and disappear as soon as the army occupying them moves on. Only somewhat less temporary are numerous forts such as Fort Lyon in Missouri and Fort Jackson in Arkansas, which existed only for a few years during the Civil War. (Beadle similarly mentions a Fort Lancaster in Colorado, which existed only from 1837-1844.) For the most part none of these places can be found in Wikipedia.
However, this is less of an issue than it may appear. For one thing, the large majority of places mentioned in both BEADLEand WOTR still exist with the same names they had 150 years ago. This includes places that may have changed their nature, such as the former territories of Arizona, Utah, Colorado and Dakota, which have since transitioned into states but largely kept the previous names. Similarly, most ethnonyms, such as the Mormons, Navajos, Apaches and Utes have remained the same. In many cases where names have changed (e.g. Davisville, California was renamed to Davis in 1907, and City Point, Virginia was annexed into Hopewell, Virginia in 1923), the old names are prominently mentioned in Wikipedia. Some civil war forts, and most places associated with battles, likewise are either featured in their own Wikipedia articles or mentioned prominently in other articles, often the article describing the battle taking place at that location.
A different and perhaps more significant issue comes with terms that have distributions that differ significantly in Wikipedia vs. BEADLEand WOTR. One issue is with names that may have a most prominent sense in Wikipedia that is different from the usage in BEADLEor WOTR. Some examples:
1. Many forts that existed during the 19th century bear the same names as modern forts in differ- ent locations (e.g. the modern-day Fort Lyon in Colorado, Fort Lancaster in Texas, and Fort Jackson in South Carolina, compared with the above-mentioned forts of the same names in Missouri, Colorado and Arkansas, respectively).
2. The place name “Columbus” tends to refer in Wikipedia to Columbus, Ohio, but in WOTR to Columbus, Kentucky, which saw significant fighting, whereas Columbus, Ohio did not. 3. “Grant” in WOTR is likely to refer to General Ulysses S. Grant, whereas its distribution in
Wikipedia is due not only to General Grant but to numerous other people and places with the same name. It is also affected by lowercase “grant”, due to case-folding in the algorithm I use; this is done due to many inconsistencies in case usage, such as all-lowercase text in Twitter, all-capital text in WOTR, and of course capitalized words at the beginning of a sentence. Even if all occurrences of “Grant” in Wikipedia that refer to General Grant could be separated out, there still remains the issue that at least half of Grant’s Civil War service, and hence mention in WOTR, was in the Western Theater (e.g. in Missouri and Tennessee), whereas the majority of the text in Wikipedia on Grant appears to covers his two terms as a U.S. President, during which he was located in Washington, D.C.
4. “Sherman” in Wikipedia (and WOTR) concerns General William T. Sherman, whereas “Sher- man” in BEADLEis primarily the name of a town in Wyoming.
5. Most mentions of “Washington” in Wikipedia actually refer to Washington State (the term “Washington” is linked 17,127 times in the November 4, 2013 Wikipedia to the article on Washington State, but only 3,581 times to the article on Washington, D.C.), whereas nearly all mentions of “Washington” in WOTR refer to Washington, D.C.
Note that all of the above examples concern the U.S. Things get even worse when the possibility of terms referring to places, people, etc. across the whole world is considered. In practice, however, this isn’t an issue: When using Wikipedia as a source I limit the regions considered to those within a bounding box surrounding the United States, due to the fact that nearly all locations in both BEADLE
and WOTR are within the U.S.