Data archiving and data policies

(1)

Data archiving and data policies

Benjamin Pfeil, Stephane Pesant, Michael Diepenbroek, Hannes Grobe

Bjerknes Centre for Climate Research/ University of Bergen, Norway

WWW.BJERKNES.UIB.NO

(2)

(3)

(4)



 

Often data shows a snapshot of the

environment at that time/space



 

Sampling can be very expensive (average

of over 100.000 € for one dataset for

bio-, geoscience - including costs for

expeditions, laboratories, etc)



 

Therefore is data very valueable for

future scientific work and has to be

archived and made available

(5)

Why do we need data?

 

Verification of research results

 

Comparison of results

 

Indication of trends

 

Model input

 

Remote sensing

(6)

Some facts about data in the

scientific community



 

Scientific instruments and computer

simulations create large amount of

data



 

Due to new measurement techniques

(and better precision) are data

volumes doubling each year



 

Scientific data has to be archived

according to ”Good scientific practise

in research and scholarship”

(European Science Foundation 2000)

(7)

?

(8)

(9)

For a long time this was common

practise:

(10)

In 1958 the World Data Center

system was established

Mission Statement of the World Data Center

System

• Data constitute the raw material of

scientific understanding. The World Data

Center system works to guarantee access

to solar, geophysical and related

environmental data. It serves the whole

scientific community by assembling,

scrutinizing, organizing and disseminating

data and information

(11)

Network of ICSU WDCs • Nuclear Radiation Tokyo, Japan WDC Co-ordination Offices Washington DC, USA Beijing, China • Meteorology Asheville NC, USA Beijing, China Obninsk, Russia • Oceaography Obninsk, Russia Silver Spring MD, USA Tianjin, China

• Paleoclimatology

Boulder CO, USA

• Marine Geology and Geophysics

Boulder CO, USA Moscow, Russia

• Remotely Sensed Land Data

Sioux Falls SD, USA

• Renewable Resources and Environment

Beijing, China

• Recent Crustal Movements

Ondrejov, Czech Republic

• Airglow

Mitaka,Japan

• Astronomy

Beijing, China

• Atmospheric Trace Gases

Oak Ridge TN, USA

• Aurora Tokyo, Japan • Cosmic Rays Toyokawa, Japan • Geology Beijing, China

• Human Interactions in the Environment

Palisades NY, USA

• Ionosphere Tokyo, Japan • Earth Tides Brussels, Belgium • Geomagnetism Copenhagen, Denmark Edinburgh, UK Kyoto, Japan Colaba, India • Glaciology

Boulder CO, USA Cambridge, UK Lanzhou, China

• Marine Environmental Sciences

Bremen, Germany

• Rotation of the Earth

Obninsk, Russia Washington DC, USA

• Satellite Information

Greenbelt MD, USA

• Rockets and Satellites

Obninsk, Russia

• Seismology

Denver CO, USA Beijing, China

• Solar Radio Emission

Nagano, Japan

• Space Science

Beijing, China

• Space Science Satellites

Kanagawa, Japan

• Solar Activity

Meudon, France

• Soils

Wageningen, The Netherlands

• Sunspot Index

Brussels, Belgium

• Solar Terrestrial Physics

Boulder CO, USA Didcot Oxon, UK Moscow, Russia Haymarket, Australia

• Solid Earth Geophysics

Beijing, China Boulder CO, USA Moscow, Russia

(12)

(

Some) important WDCs for

environmental data

  WDC for Atmospheric Trace Gases Carbon Dioxide

Information Analysis Center USA

  WDC for Climate Model and Data

Max-Planck-Institute for Meteorology GERMANY

  WDC for Glaciology, Boulder University of Colorado

USA

  WDC for Marine Environmental Sciences Center for

Marine Environmental Sciences (MARUM) GERMANY

  WDC for Marine Geology & Geophysics, Boulder

USA

(13)

But WDC is a status!

There are many national and international data

centres as well which are not a WDC e.g.

ICES – International Council for the

Exploration of the Sea, Denmark

BODC – British Oceanographic Data Centre, UK

BADC – British Atmospheric Data Centre, UK

NODC – National Oceanographic Data Center,

USA

(14)

But there are different data archiving

systems in use e.g.

 

Data server (eg ftp server)

 

Different project websites

 

Different long term data archives (World

Data Centers (WDC), National Data

Centers (NODC)) as mentioned

(15)

Data server like a ftp server

+ very fast to archive data (data dump) + very cheap

+ easy to archive large data sets (model output for example)

- Data is not structured (different file formats, units, etc)

- Not easy to search for data

- Difficult to know about the existence - Different versions of data

- Not a long term archive – data can be lost! - Maintenance

(16)

Data archived at project websites

Normally a small database or ftp server is used

+ useful for members of the project since relevant data is available at one site

+ easy way to inform about the project and achievements

- websites only represent data coming from the projects

- can take a long time to get all relevant data

- links will not work after a while – data will be lost! - no more funding – no more maintenance!!!

(17)

Data archived at a Data Center (WDC or

NDC)

+ Data is long term archived and online available + many WDC are linked to each other data can

be found at different websites (GCMD, WDC-CLUSTER, data portals, etc)

+ often a relational database is used which enables Google like queries and the extraction of large amount of data – data and metadata is

structured!

+ data sets get a DOI and are citable - Time and cost intensive

- Depending on the type of data it can be very expensive

(18)

The next step at data centers and

between data centers:

- Data portals

- LAS

(19)

What happens in data portals?

  All relevant data centers are searched daily for

new data – by searching one website you search many at once

  All metadata is available at the data portal

  Changes at the different data centers are

automatically applied - always the latest version is used

  Scientists can use it like Google and get a direct

(20)

CARBOOCEAN data

portal

(21)

Live Access Server (LAS)

 

A web server for visualizing gridded and

in-situ data

 

Can offer a wide range of data products

for interpolation, comparison,

visualization, and analysis

(22)

(23)

(24)

Data warehouse



 

Enables online retrieval of data archived

in a relational database



 

Queries can be limited by parameters,

(25)

(26)

(27)

If data is structured in the same

format…



(28)

Data publications

• following the OECD principles and guidelines for access

to research data (2007)

• peer-reviewed citable data sets referenced by

persistent identifiers (DOI)

 DOI registry -> crossref for scientific data

• Collaborations with publishers

 with data journals

 crossreferencing supplementary data with

traditional publications

(SCOR working group, Elsevier, Nature, Springer, Thompson Reuters)

(29)

(30)

Data policies



 

Are based on “Good scientific practise

in research and scholarship” by the

International Council for Science and

European Science Foundation which

(31)

36. Data are produced at all stages in experimental research and in scholarship. Data sets are an important resource, which enable later verification of scientific interpretations and conclusions. They may also be the starting point for further studies. It is vital, therefore, that all

primary and secondary data are stored in a secure and accessible form.

Good scientific practice in research and scholarship

European Science Foundation (ESF), 2000

37. Institutions may pay particular attention to documenting and archiving original research and scholarship data. Several codes of good practice recommend a minimum period of 10 years, longer in the case of especially significant or sensitive data. National or

regional discipline-based archives should be considered where there are practical or other problems in storing data at the institution where the research was conducted.

(32)

4. Scientific advances rely on full and open access to data. Both

science and the public are well served by a system of scholarly research and communication with minimal constraints on the availability of data for further analysis. The tradition of full and open access to data has led to breakthroughs in scientific understanding, as well as to later economic and public policy benefits. The idea that an individual or organization can control access to or claim ownership of the facts of nature is foreign to science.

Principles for dissemination of scientific data (International Council for Science/CODATA)

5. The interests of database owners must be balanced with

society’s need for open exchange of ideas. Given the substantial investment in data collection and its importance to society, it is equally important that data are used to the maximum extent possible. Data that were collected for a variety of purposes may be useful to science. Legal foundations and societal attitudes should foster a balance between

(33)

There are different data

policies in use

 

They all state when and how data (and

metadata) have to be made available for

project members, the general public, where

data have to be archived and who shall get

the credit

 

Funding agencies, institutes, projects,

organizations, etc have often their own

policies when data have to be publicly

released (ranges from 0,5 – 3 years!) and

where it has to be archived

(34)

Example: data from one cruise

can fall under different data

policies



 

One of the project



 

Institute



 

Owner of the vessel



 

National funding agency

(35)

Even though it is slightly

confusing but it sounds like

everything is in place and data is

available.

(36)

(37)

(38)

Why is data often not reported

in time or at all or not available

to the community?

(39)

(40)

Why is data archiving important?

  Data sets are an important resource, which

enable later verification of scientific interpretations and conclusions

  Data has been lost over in the past due to no or

insufficient data management

  Essential for syntheses

  Several codes of good practice recommend a

minimum data storage of at least 10 years

  Funding agencies require data to be long-term

archived

(41)

What is high impact for data?

  Making a data set available is a publication!

  Make data sets citable and get cited (use of DOI)

  Make data available to internationally agreed

standards (for the data reporting itself and infrastructure being used)

  Use established data institutions which people

search

  The more scientists find and use your data the

more your paper will get cited

(42)

Possible problems in retrieving

data from different sources

  Version conflicts (data is archived in many data centres – in different

stages e.g. raw data, quality controlled, etc.)

  Bad documented metadata and data (methods, units, unclear parameter

definitions, etc)

  Just metadata is available online – data has to be requested

  Naming of cruises varies in many countries > hard to identify same

cruises

  Date formats (mm/dd/yyyy; yy/mm/dd; dd/mm/yyyy etc)

  Ways to report the position (Lat/Long, UTM)

  Different export formats (plain text, xml, netCDF, etc)

  Different entities (one data set = data from one cruise or data from

one station or data from one sample)

  Data set is too large to be downloaded (e.g. model data)