• No results found

2 Properties of Casual CitizenScience Observation Data and OSM: The VG

2.1 Project and Data Properties

2.1.3 OpenStreetMap

The OSM project, its history, goals, and general properties have been described in a large number of publications (e.g., Neis & Zipf 2012, Budhathoki & Haythornthwaite 2013, Neis & Zielstra 2014). Only the most important facts will therefore be reiterated here, but more will emerge in the discussion following this overview (section 2.2).

Founded by Steve Coast, the OSM project started out in 2004 at University College London. The pro- ject was soon organized in a foundation and gained worldwide scope. Creating a new object in OSM basically requires the same steps as in iNaturalist or ArtenFinder, namely creating a geometry object and adding attribute information to it in the form of tags. Tags are pairs of a key and a value, e.g. “highway=residential”. At first, contributions were mainly based on GPS tracks collected by volun- teers in the field, but modes of data production changed when several Internet companies and public authorities provided satellite and aerial imagery for tracing objects for OSM (Neis & Zielstra 2014). Besides contributions by individual volunteers, OSM data are sometimes also created by bulk imports of geographic data from open data sources. For the data used here, the most important of these imports occurred in 2008, adding to OSM parts of the U.S. census bureau’s TIGER (Topologically Integrated Geographic Encoding and Referencing) data23, such as roads, railroads, and administrative boundaries, but also statistical geographic areas (Zielstra et al. 2013).

OSM’s quality assurance strategy is a classic example of what Goodchild & Li (2012) call the “crowd- sourcing approach”, which means that quality control is placed in the hands of the community of vol- unteers, and based on the principle that errors cannot persist if many volunteers work on the data and

23

eventually correct errors other volunteers might have made. OSM’s quality assurance strategy is there- fore closely related to iNaturalist’s, where the community of observers reviews, comments, confirms, or contradicts the information given in an observation record (especially species identification), albeit with important differences concerning the nature of information, which will be discussed in section 2.2. It is, however, in strong contrast to ArtenFinder’s quality assurance strategy, which employs privi- leged experts who validate observations, while normal users are not able to comment directly on other observers’ reports.

Thematic data properties

Thematic information in OSM is represented by tags in the form of key-value pairs, see above. The- matic information in OSM is different and also more varied in nature than observations of organisms. This makes describing relevant thematic properties of OSM data more complex, compared to describ- ing properties of biodiversity observation data. For instance, an observation of a species has just one species identification, belonging to just one species group (e.g., a bird or a butterfly). It is therefore straightforward to aggregate data by calculating portions of observations belonging to a certain species group. An OSM element, on the other hand, may carry several tags at the same time. Analyzing the thematic properties of the OSM data used in this work requires some form of aggregation (analog to aggregation by species group), which was done by key. The data structure explained here must be kept in mind when interpreting statistics describing these data. Also, not all available OSM tags were used in this work, but only tags holding information which was judged relevant for the intended use of OSM information as a geographic context source. Details are explained in the Methods chapter, sec- tion 3.3.4, and tags selected as relevant listed in the appendix, section 7.2. The following analyses refer only to the data carrying these selected tags.

The Heidelberg Center for Geoinformation Technology (HeiGIT24) provides a service called the Ohsome API25. One of its functionalities is counting, in a set of OSM elements, the elements which carry a certain tag. This tool was used to examine the thematic properties of OSM data in the two areas of interest concerning the relevant tags used in this work. Due to the fact that an OSM element may carry several relevant tags at the same time, the basic population of such a count is not the number of elements, but the number of occurrences of tags (where some elements are counted several times). This count is able to convey some information on which tags can be encountered more often in the data than others. This is what Figure 2.1.19 tries to do. It shows a dominance of “building=*” and “highway=*” tags in all relevant tag occurrences in both Rheinland-Pfalz and California OSM data. Rheinland-Pfalz data also have notable portions of “surface=*” (mostly of some kind of highway, street or path), “landuse=*”, “natural=*”, “amenity=*”, “barrier=*”, and “waterway=*” tags. In Cali- fornia OSM data, “waterway=*” tags are more prominent than in Rheinland-Pfalz, and a relatively high portion of “intermittent=*” tags (all of them “intermittent=yes”) reveal seasonally dry conditions in large parts of California. “landuse=*”, “natural=*”, and “surface=*” tags make up relatively small portions of all tags here.

24

https://heigit.org/

25

a) Rheinland-Pfalz b) California

Figure 2.1.19: Portions of the most frequent keys in OSM extracts from Rheinland-Pfalz and Califor- nia. State of OSM: 2017-07-01. n(a, occurrences of tags) = 2,380,703. n(b, occurrences of tags) = 9,071,835. (Analysis includes only tags used in this work.)

Figure 2.1.20 shows the development of yearly numbers of tags added to OSM (aggregated by keys as well). While most tags were added at steady and rather low rates to the Rheinland-Pfalz OSM, “high- way=*” tags declined steadily from relatively high rates. “building=*” tags experienced a strong in- cline peaking in 2015 with a much higher rate than the other keys, and have declined since. The Cali- fornia extract of OSM data shows a more recent increase in “building=*” tags. “waterway=*” tags were mostly added before 2012-07-01, followed by “intermittent=*” tags. “highway=*” tags experi- enced a raise in recent years. Another difference to Rheinland-Pfalz data is that negative numbers oc- cur, that is, a net removal of “intermittent=*” and “landuse=*” tags from the data in some years.

a) Rheinland-Pfalz b) California

Figure 2.1.20: Yearly net numbers of tags from the most frequent keys added to OSM elements in Rheinland-Pfalz and California. (Analysis includes only selected tags used in this work).

However, caution has to be used with these analyses. Difficulties are introduced by the complex geo- metrical nature of OSM data, which consist of point elements (“nodes”), line elements (“ways”) and polygon elements (represented by some closed “way” objects or some “relations”). While “node” ele- ments have properties which are comparable to point objects, ways and relations may or may not be segmented into a larger or smaller number of geometric objects. For instance, a highway may be rep-

resented in relatively long segments of several kilometers of length, or be split into many relatively short segments of just a few tens of meters. The latter case boosts tag counts in the above statistic. Also, buildings are relatively small and many, and consequently represented by many small, individual geometric elements, while land use units are usually relatively large and often also represented by large geometric elements in the map, boosting tag counts for the “buildings” key relatively to that of the “landuse” key. It makes sense, therefore, to also look at other ways of examining thematic data properties, especially by using area or length instead of counts, but always keeping in mind that a sin- gle element may carry several relevant tags and sums of areas or lengths incorporate overlaps, themat- ically as well as geometrically. This view on the data was implemented by looking only at elements of type way and relation and by looking at the most important keys representing mostly polygon ele- ments. Figure 2.1.21 presents results. They may contain thematic overlaps between elements: a poly- gon my carry tags from several of the relevant keys and may therefore be contained in several of the area sums per key. Also, a polygon may carry several tags of the same key and may therefore be con- tained several times in the area sum for a key. The same considerations apply to linear objects which were examined by using lengths of way type elements of the most important keys representing mostly linear elements. Results ae presented in Figure 2.1.22.

Results of area calculations for Rheinland-Pfalz show that “landuse=*” tags, not very prominent in element counts, are by far the most important in terms of coverage of area, while tags which were prominent in element counts, especially “building=*”, are less important. Results with California OSM data are similar: “landuse=*”, “leisure=*” or “natural=*” tags cover large areas, while “build- ing=*” tags (and others) are insignificant. In element length results for Rheinland-Pfalz and for Cali- fornia, keys are prominent which also reached high tag count results, such as “highway=*” or “water- way=*”.

a) Rheinland-Pfalz b) California

Figure 2.1.21: Area of OSM elements, aggregated by keys, in Rheinland-Pfalz and California. (Sums of areas of way and relation elements, no correction for thematic and geometric overlaps). State of OSM: 2017-07-01. (Analysis includes only selected tags used in this work.)

a) Rheinland-Pfalz b) California

Figure 2.1.22: Length of OSM elements, aggregated by keys, in Rheinland-Pfalz and California. (Sums of length of way elements, no correction for thematic and geometric overlaps). State of OSM: 2017-07-01. (Analysis includes only selected tags used in this work.)

Numbers of active mappers were more or less constant in Rheinland-Pfalz in the years the OSM data used here were produced, with a rise in mapper numbers in earlier years and a slight drop towards later years (see Figure 2.1.23). In California, numbers of mappers have been rising constantly and have recently reached numbers comparable to Rheinland-Pfalz, but in a far larger area with a much higher population. In absolute numbers, participation in OSM is far superior to numbers of volunteers in Ar- tenFinder and iNaturalist. The fact that in OSM, just as in all VGI projects, a small portion of volun- teers is responsible for most of the contributions, has been found and documented by many research works on OSM data (e.g., Neis & Zipf 2012). It can certainly be expected for the OSM data used in this work as well and was not examined here in detail.

a) Rheinland-Pfalz b) California

Figure 2.1.23: Numbers of active OSM mappers per year in Rheinland-Pfalz and California. (For technical reasons, only node and way objects were evaluated here.)

Spatial data properties

Spatial properties of OSM data cannot be described in the same way as spatial properties of casual citizen science observation data, due to reasons already explained, which are rooted in the OSM data’s complex data structure. It is, of course, possible to count numbers of occurrences of a tag in an area, but the sum of these counts over all tags will be larger than the number of objects involved, because any object may carry more than one tag (and many do). A density map of this kind may still convey some relevant information about spatial properties of OSM data and is represented in Figure 2.1.24 (parts a and c). Keep in mind, however, that counts of tags depend on the structure of the objects which carry them, as was already explained above. If a tag is attached to strongly segmented objects, this will raise its count. This effect can be avoided by looking at the number of distinct tags in an area, a parameter which represents spatial information density, rather than actual density of geometric ob- jects or tag occurrences. This is represented in Figure 2.1.24 as well (parts b and d).

Parts a and c of Figure 2.1.24 illustrate the well-known fact that OSM data tend to concentrate in ur- ban areas (e.g., Haklay 2010). In both areas of interest, counts of tag occurrences very distinctly high- light the most important urban centers, such as Mainz, Ludwigshafen, Kaiserslautern, Trier and Ko- blenz in Rheinland-Pfalz, and San Francisco and Los Angeles in California. Maps of spatial infor- mation density (number of distinct tags in a region) do the same, but high-value regions include areas adjoining urban centers, and smaller cities also stand out, such as Landau or Neustadt an der Wein- straße in Rheinland-Pfalz, and San Diego, the Indio/Palm Springs area, Oxnard/Ventura, Santa Barba- ra, San Louis Obispo/Grover Beach, and Sacramento in the Long Valley in California. Contrast is weaker here, because spatial information density is still higher in urban centers, but contrast to other areas is not so strong in this parameter. However, spatial properties are basically the same with both parameters.

In Rheinland-Pfalz OSM data, there is no pronounced spatial trend like it was found in observation data. OSM data therefore are possibly able to provide sufficient geographic context for plausibility estimation also in regions of Rheinland-Pfalz where observation data are scarce. California OSM data, however, show a pattern which is quite similar to iNaturalist observation data spatial distribution, with concentrations in the San Francisco Bay and Los Angeles areas in both tag count and spatial infor- mation density. The latter parameter does also show (smaller) areas with higher values outside these areas.

a) Rheinland-Pfalz, tag occurrence counts (n(tag occurrences) = 2,427,361)

b) Rheinland-Pfalz, spatial information density (distinct tags per region) (n(tags) = 534)

c) California, tag occurrence counts (n(tag occurrences) = 9,178,991)

d) California, spatial information density (distinct tags per region) (n(tags) = 550)

Figure 2.1.24: Spatial distribution of OSM tags in Rheinland-Pfalz and California. (Classified by Nat- ural Breaks. Source of Rheinland-Pfalz state line: LANIS Rheinland-Pfalz. Source of CA state line: U.S. Geological Survey 2016.)

2.2 Comparative Discussion of Projects and Data Use Cases: Com-