Methods for data management and analysis - TECHNICAL REPORT. Guidelines for the surveillance of

The agencies funding IMS surveillance may consider themselves the sole owners of the collected data and may therefore be reluctant to share them. However, it is important that IMS surveillance data, including pathogen detection data, are made freely available to the public health authorities. Data exchange between competent authorities should be promoted, for example between local and regional/provincial authorities, and between national and international authorities. This is even more crucial if important changes are detected (e.g. introduction of a new IMS, pathogen transmission, or MBD outbreak). The comparison and interpretation of shared data is greatly simplified if data are collected and stored through harmonised procedures.

Basic data management features for storage and analysis

Data collection involves a number of steps including (a) trapping, (b) sample handling, (c) labelling, (d) transportation, (e) diagnostics, (f) populating the database, (g) rearranging and preparing acquired data for analysis. All steps are prone to errors, for example the inaccurate recording of sample locations (can be minimised by using GPS coordinates instead of names, descriptions, or addresses), lost samples, improperly handled/stored samples so further processing and analysis is impossible, labelling errors, unlabelled samples, diagnostic errors, sample mix-ups (‘Did this mosquito come from this vial or that vial?’), data-entry errors, and row/column mix-ups when working on a database spreadsheet. Each of these mistakes will affect the final analysis. Well-thought- through methods and good data management, combined with a heightened awareness of possible mistakes during all steps can help prevent most of these errors.

Data analysis is the end point in a series of steps: (1) defining the type of outputs needed (e.g. presence/absence maps of an IMS in a defined geographical area, and in a defined time period); (2) defining the type of data that is required to produce the desired output (in this example geo-referenced presence/absence data of an IMS in a

Box 10: Field sample labelling

Small scale surveillance studies with relatively few samples can probably do with hand-written labels, but if large numbers of samples are expected, labelling needs to be more rigorous. Each label must record a range of information, such as collection date, location (as precise as possible), and type of sampling. One can decide to initially record only the basic sample data on the label and later add the identification results. This approach has the disadvantage that only some information will fit, as space on the label is limited. Consequently, an initial decision has to be made on what to record on the label.

A preferred method is to use unique computer-generated label numbers. When a sample is collected in the field (e.g. a strip with/without mosquito eggs from an oviposition trap), the strip is placed in a sealed bag (only one sample per bag), which is then immediately labelled with a unique number on a sticker. Basic sample data (type of surveillance, date, location) are filled in on either (a) a paper form with a unique number sticker to prevent incorrect numbering, or (b) are entered electronically via a smartphone app connected to a data management system like Modirisk (http://www.modirisk.be) or VecMap (http://iap.esa.int/vecmap projects). In the laboratory, diagnostics and other useful information derived from the sample should also be entered into the data management system, thus ensuring that each mosquito data set is tied to a unique number and can be traced back.

defined geographical area, and in a defined time period; in most cases of IMS surveillance, this represents the basic required data); (3) defining the type of surveillance necessary to obtain the required data (see scenarios in

Chapter 1.3); and (4) collecting data, which represents the start of data management. Table 8 suggests minimal sets of data for key procedures as well as for some optional procedures. It is suggested that Member States use these standardised data sets to facilitate the collation of data at the EU level.

Table 8: Suggested basic set of variables to be included in databases of surveillance of IMS

8a. Data set for key procedures

Data label Format

Level 1 – per sampling

Type of surveillance Name

Type of sampling or trap Method or trap model ID

Date DD/MM/YYYY Country Name NUTS Code Geo-referenced latitude1 _DD.NNNNN Geo-referenced longitude1 _DD.NNNNN Altitude N

Data entering Name (person)

Level 2 – per mosquito species

Mosquito species Species name or ID

Presence/absence 1/0 Female N Male N Pupa N Larva N Egg N

Identifying person Name (person)

Method of identification2 _{Name (from a list)}

Validator Name (person)

1 _{Latitude and longitude of t}_{he sampling point, UTM WGS 84 system, decimal degrees units} 2 _{Morphology/molecular (gene)/MALDI-TOF}

8b. Data set for optional procedures (additional to previous data)

Data label Format

Level 1 – per sampling

Temperature N

Relative humidity N

No. of mosquitoes tested N

Data entering Name (person)

Level 2 – per mosquito species

Physiological status Blood fed/unfed/gravid/nulliparous

No. of mosquitoes tested N

No. of pools tested N

Pathogen name, positive pools N

N = numeric field

Database

Most researchers are comfortable with spreadsheet software such as Microsoft Excel. Such software is useful for the extraction of data from a large database, but for more complex analysis and mapping, it is recommended that data are stored in a more specific database management system that allows networking. Examples of frequently used relational database management systems (RDBMS) are Microsoft Access, SQL Server, DB2, and Oracle Database. By using an RDBMS, data integrity can be ensured, e.g. a properly defined rule can ensure that data are entered in a valid format. Additionally, these systems can handle an almost unlimited amount of data. These

programmes also allow users to be allocated various levels of access rights to use/import/see the data, by assigning them login and passwords (e.g. the analyst may be allowed/not allowed to modify surveillance data). An example of a database application is given in Annex 8, Figure A.

For GIS analysis or modelling, data can be imported into statistical software packages or dedicated geographic information systems software such as ArcGIS, GRASS, or QGIS. Currently, a new system named VecMap is under development, integrating the entire process of producing vector risk maps into a single package by combining various functions such as data collection, risk map production, data storage, data analysis, and statistical

distribution modelling based on weather data and satellite imagery. The system is funded by the European Space Agency and currently in its demonstration phase. It is expected to be available in 2013

(http://www.esa.int/esaCP/SEMIZ5MSNNG_index_0.html).

In addition, most groups that carry out IMS surveillance simultaneously develop more than one type of surveillance programme. It is advisable for each group to standardise data management and processing as much as possible, entering and storing all surveillance data into one well-designed customised and centralised database with online access, managed by a group of experts with database administration rights. The advantages of a centralised database are that (1) there is no need to extract data from different databases to be able to compare data from separate studies; (2) it is less likely that data get lost; (3) there is a consistent user interface for database access; and, perhaps most importantly, (4) data integrity is ensured (e.g. avoiding ‘wrong doubles’).

Complementary information on data management and analysis is given in Annex 8.

In document TECHNICAL REPORT. Guidelines for the surveillance of invasive mosquitoes in Europe. (Page 33-35)