Acquisition, Preprocessing, Selection, and Partitioning of Observation Data for Evaluation

3 Methods

3.3 Evaluation Methods

3.3.3 Acquisition, Preprocessing, Selection, and Partitioning of Observation Data for Evaluation

ArtenFinder data used for evaluation

ArtenFinder provides several possibilities of retrieving observation data and other information. Users can download their own observations as a csv file. There is also a REST API. Depending on the user’s role, it allows for downloading public data of all users, but also for changing or deleting data in the database. The API was used to download all public ArtenFinder observations up to 2016, with their validation status on February 24th, 2017 (latest download of the data). The data were placed in a local spatial database for further processing and analysis. ArtenFinder receives a few observations from regions adjacent to the federal state of Rheinland-Pfalz, which were discarded. In cooperation with naturgucker.de, another German citizen science initiative collecting observations of organisms, Arten- Finder regularly imports naturgucker observations from Rheinland-Pfalz. These were also removed from the dataset used for analysis, because they have, at least in part, different properties, such as ras- terized observation locations (coordinates representing map quadrat center points rather than the origi- nal observation locations to protect occurrences of sensitive species), or locations referring to the center point of an arbitrary area rather than to an exact observation location.

Accepted observations up to 2015 (216,316 observations) were used to generate observed communities and OSM environments. Accepted observations of the year 2016 (68,646 observations) were used for producing a set of candidate observations expected to contain predominantly plausible observations (set AF_A, see Table 3.3.1). This approach of partitioning the data into older observations for extraction of observed communities or OSM environments, and new observations used as candidates, reflects the fact that the assessment of the plausibility of a recent candidate observation with the observed communities or OSM environments approach is always and necessarily based on older approved observations. This way of partitioning the data therefore is more appropriate here than, for instance, selecting a random sample of accepted observations to be used as candidate observations from the whole time period. Evaluation should show whether the approaches to plausibility estimation are suitable for estimating the plausibility of new observations based on older pre-existing observations, which can be achieved in this way. Moreover, using a whole year of observations as candidates in evaluation minimizes any seasonal biases in evaluation results: if only data of a certain part of a year would be used, this would introduce a bias towards species observed in that season.

The data properties of the two data portions of approved observations (up to 2015, and from 2016) do not exhibit important differences. This shows that the observation process did not change in a critical way between these two time periods. In both sets of observations, birds, butterflies, dragonflies and plants make up most of the observations, with the same ranking of species groups. In 2016, birds make up a somewhat higher part of observations when compared to the data up to 2015, while the rate of butterflies (and some groups with smaller portions in the data) was slightly lower, see Table 3.3.2. More information about the composition of the dataset concerning species groups and development over time can be found earlier in section 2.1.1.

Table 3.3.2: Portions of species groups in sets of accepted ArtenFinder observations. Comparison of values for accepted observations up to 2015 and accepted observations from 2016.

Species group Portion (%) up to 2015 Portion (%) 2016

plants 6.7 5.5 fungi 1.9 3.4 mammals 2.4 1.9 birds 41.8 48.9 reptiles 1.7 1.1 amphibians 1.6 1.0

modern bony fishes 0.1 0.0

butterflies and moths 28.9 24.2

hymenopterans 0.7 0.8

beetles 0.8 0.6

dragonflies and damselflies 9.2 10.3

mantids 0.1 0.1

locusts 3.4 1.5

mollusks 0.3 0.2

true bugs 0.1 0.1

spiders 0.0 0.1

The ArtenFinder API was also used to retrieve quality assurance protocol data. These data store status changes for all observations processed in the project’s quality assurance process, providing the oppor- tunity to find IDs (identification numbers) of rejected observations. This is rarely possible in projects of this kind, because observations which are rejected are usually quickly corrected or deleted, or re- ferred back into the private data spaces of the observers, where they cannot be easily accessed. In the case of ArtenFinder, it was possible to access the IDs of rejected observations, by means of quality assurance protocol data (kindly made available to the author by the project lead). They do not, however, provide the positions of these observations, or their species identification. It was therefore necessary to collect, at regular intervals, observations not yet validated by experts (available via the project’s public API), which contain coordinates and species identifications, and later harvest from this list the observations which were eventually rejected using the observation ID numbers from the quality assurance data. It was also possible to filter out observations which were rejected in the first place, but later accepted, e.g., because the observer provided more information. In this way, observations coukld be retrieved which were permanently rejected by the experts in the validation process to form set AF_R. They provide a valuable basis of analysis, especially for the evaluation of the plausibility estimation approaches laid out here, because they allow for comparing similarity values of real observations which were accepted as correct, with real observations which were rejected as incorrect (or for other reasons, see section 2.1.1). These two sets of candidate observations allow for analyzing whether approved or rejected observations differ in their plausibility estimations.

Extraction of rejected observations used all available rejected observations, including 2016 as well as earlier observations. Therefore, there is no clear partition between recent and older observations here. However, this was necessary to arrive at useful numbers of valid observation cases: 6,845 rejected observations were available, and for 2,733 of them coordinates could be retrieved. As rejected observations are not at all used for extracting observed communities, there was no conflict here concerning data partitioning. The composition of species groups in this set is markedly different from, e.g., approved observations: butterflies are leading at 25%, followed by observations of “other species” (24%), dragonflies (18%), birds (6%), plants (6%), mushrooms (5%), and locusts (5%). This may be

caused by a higher rate of species which are hard to identify in the insect species groups, causing a higher rate of rejections in observations of these groups when compared, for instance, to birds.

All sets of candidate obsevations presented here show a similar overall spatial distribution of observations, reflecting the same northwest to southeast trend of increasing observation density. This is demonstrated by comparing quadrat count maps for these sets of observations, see Figure 3.3.2. Sec- tors of maximum concentration of observations are, however, slightly different.

a) All accepted observations (n = 284,962)

b) Accepted observations up to 2015 (n = 216,316)

c) Accepted candidate observations (2016) (n = 68,646)

d) Rejected candidate observations with coordinates (n = 2,733)

Figure 3.3.2: Spatial distribution of observations in sets of ArtenFinder data. (No. of points in 10x10 km raster. Classified by Natural Breaks. Source of Rheinland-Pfalz state line: LANIS Rheinland- Pfalz.)

iNaturalist data used for Evaluation

iNaturalist provides the possibility to download their data as a csv file. This was used to retrieve all iNaturalist observations for the state of California on March 3rd 2017. California was chosen as the area of interest of this data use case, because the iNaturalist data record is strongest there (see also section 2.1.2). Observation numbers are comparable to those of ArtenFinder, albeit spread over a much larger area. Again, the data were placed in a local spatial database for further processing and analysis. iNaturalist obscures coordinates of certain observations, mostly to protect rare or sensitive species. Also, observers can choose to obscure coordinates of observations for privacy reasons. Obser- vations with obscured coordinates were removed from the dataset used here, because their coordinates do not represent the true location of observation.

Data partitioning followed the same principles for iNaturalist data, as already used for ArtenFinder data. Research grade observations from California up to 2015 (242,833 observations) were used to generate observed communities or OSM environments. Research grade observations of the year 2016 (167,723 observations) were used as a set of plausible candidate observations based on approved observations. There is no way of identifying observations rejected in the iNaturalist dataset, due to the differing quality assurance strategy employed here.

Table 3.3.3: Portions of species groups in sets of research grade iNaturalist observations. Compari- son of values for research grade observations up to 2015 and research grade observations from 2016.

Species group Portion (%) up to 2015 Portion (%) 2016

plants 33.4 35.1 fungi 2.3 4.7 mammals 4.2 3.9 birds 32.0 26.7 reptiles 5.2 5.1 amphibians 1.6 1.8

modern bony fishes 0.4 0.3

butterflies and moths 6.7 6.1

hymenopterans 1.1 1.5

beetles 1.2 1.8

dragonflies and damselflies 1.7 1.4

earwigs 0.1 0.1 mantids 0.1 0.1 cockroaches 0.0 0.1 locusts 0.3 0.5 crustaceans 0.8 1.3 mollusks 5.1 6.0 other species 2.1 2.3 true bugs 0.6 0.9 flies 0.2 0.3 spiders 0.8 0.0

Although yearly observation numbers are strongly increasing in iNaturalist (see section 2.1.2), data properties of the two sets of research grade observations (data up to 2015, and from 2016) are not crit- ically different, which shows, again, that the observation process did not change over time in a critical way (see Table 3.3.3). In 2016, plants make up a slightly larger part of observations when compared to the data up to 2015, while the rate of birds was somewhat lower. Figure 3.3.3 demonstrates that these two sets of data also show a similar spatial structure, with similar regions of higher and/or lower point density.

a) All research grade observations (n = 410,556)

b) Research grade observations up to 2015 (n = 242,833)

c) Research grade observations in 2016 (n = 167,723)

Figure 3.3.3: Spatial distribution of observations in sets of iNaturalist data. (No. of points in 20x20 km raster. Classified by Natural Breaks. Source of state line: U.S. Geological Survey 2016).

In document Data Quality of Citizen Science Observations of Organisms: Plausibility Estimation Based on Volunteered Geographic Information Context (Page 86-90)