Data archiving and data policies
Benjamin Pfeil, Stephane Pesant, Michael Diepenbroek, Hannes Grobe
Bjerknes Centre for Climate Research/ University of Bergen, Norway
WWW.BJERKNES.UIB.NO
Often data shows a snapshot of the
environment at that time/space
Sampling can be very expensive (average
of over 100.000 € for one dataset for
bio-, geoscience - including costs for
expeditions, laboratories, etc)
Therefore is data very valueable for
future scientific work and has to be
archived and made available
Why do we need data?
Verification of research results
Comparison of results
Indication of trends
Model input
Remote sensing
Some facts about data in the
scientific community
Scientific instruments and computer
simulations create large amount of
data
Due to new measurement techniques
(and better precision) are data
volumes doubling each year
Scientific data has to be archived
according to ”Good scientific practise
in research and scholarship”
(European Science Foundation 2000)?
For a long time this was common
practise:
In 1958 the World Data Center
system was established
Mission Statement of the World Data Center
System
•
Data constitute the raw material of
scientific understanding. The World Data
Center system works to guarantee access
to solar, geophysical and related
environmental data. It serves the whole
scientific community by assembling,
scrutinizing, organizing and disseminating
data and information
Network of ICSU WDCs • Nuclear Radiation Tokyo, Japan WDC Co-ordination Offices Washington DC, USA Beijing, China • Meteorology Asheville NC, USA Beijing, China Obninsk, Russia • Oceaography Obninsk, Russia Silver Spring MD, USA Tianjin, China
• Paleoclimatology
Boulder CO, USA
• Marine Geology and Geophysics
Boulder CO, USA Moscow, Russia
• Remotely Sensed Land Data
Sioux Falls SD, USA
• Renewable Resources and Environment
Beijing, China
• Recent Crustal Movements
Ondrejov, Czech Republic
• Airglow
Mitaka,Japan
• Astronomy
Beijing, China
• Atmospheric Trace Gases
Oak Ridge TN, USA
• Aurora Tokyo, Japan • Cosmic Rays Toyokawa, Japan • Geology Beijing, China
• Human Interactions in the Environment
Palisades NY, USA
• Ionosphere Tokyo, Japan • Earth Tides Brussels, Belgium • Geomagnetism Copenhagen, Denmark Edinburgh, UK Kyoto, Japan Colaba, India • Glaciology
Boulder CO, USA Cambridge, UK Lanzhou, China
• Marine Environmental Sciences
Bremen, Germany
• Rotation of the Earth
Obninsk, Russia Washington DC, USA
• Satellite Information
Greenbelt MD, USA
• Rockets and Satellites
Obninsk, Russia
• Seismology
Denver CO, USA Beijing, China
• Solar Radio Emission
Nagano, Japan
• Space Science
Beijing, China
• Space Science Satellites
Kanagawa, Japan
• Solar Activity
Meudon, France
• Soils
Wageningen, The Netherlands
• Sunspot Index
Brussels, Belgium
• Solar Terrestrial Physics
Boulder CO, USA Didcot Oxon, UK Moscow, Russia Haymarket, Australia
• Solid Earth Geophysics
Beijing, China Boulder CO, USA Moscow, Russia
(
Some) important WDCs for
environmental data
WDC for Atmospheric Trace Gases Carbon Dioxide
Information Analysis Center USA
WDC for Climate Model and Data
Max-Planck-Institute for Meteorology GERMANY
WDC for Glaciology, Boulder University of Colorado
USA
WDC for Marine Environmental Sciences Center for
Marine Environmental Sciences (MARUM) GERMANY
WDC for Marine Geology & Geophysics, Boulder
USA
But WDC is a status!
There are many national and international data
centres as well which are not a WDC e.g.
ICES – International Council for the
Exploration of the Sea, Denmark
BODC – British Oceanographic Data Centre, UK
BADC – British Atmospheric Data Centre, UK
NODC – National Oceanographic Data Center,
USA
But there are different data archiving
systems in use e.g.
Data server (eg ftp server)
Different project websites
Different long term data archives (World
Data Centers (WDC), National Data
Centers (NODC)) as mentioned
Data server like a ftp server
+ very fast to archive data (data dump) + very cheap
+ easy to archive large data sets (model output for example)
- Data is not structured (different file formats, units, etc)
- Not easy to search for data
- Difficult to know about the existence - Different versions of data
- Not a long term archive – data can be lost! - Maintenance
Data archived at project websites
Normally a small database or ftp server is used
+ useful for members of the project since relevant data is available at one site
+ easy way to inform about the project and achievements
- websites only represent data coming from the projects
- can take a long time to get all relevant data
- links will not work after a while – data will be lost! - no more funding – no more maintenance!!!
Data archived at a Data Center (WDC or
NDC)
+ Data is long term archived and online available + many WDC are linked to each other data can
be found at different websites (GCMD, WDC-CLUSTER, data portals, etc)
+ often a relational database is used which enables Google like queries and the extraction of large amount of data – data and metadata is
structured!
+ data sets get a DOI and are citable - Time and cost intensive
- Depending on the type of data it can be very expensive
The next step at data centers and
between data centers:
- Data portals
- LAS
What happens in data portals?
All relevant data centers are searched daily for
new data – by searching one website you search many at once
All metadata is available at the data portal
Changes at the different data centers are
automatically applied - always the latest version is used
Scientists can use it like Google and get a direct
CARBOOCEAN data
portal
Live Access Server (LAS)
A web server for visualizing gridded and
in-situ data
Can offer a wide range of data products
for interpolation, comparison,
visualization, and analysis
Data warehouse
Enables online retrieval of data archived
in a relational database
Queries can be limited by parameters,
If data is structured in the same
format…
Data publications
• following the OECD principles and guidelines for access
to research data (2007)
• peer-reviewed citable data sets referenced by
persistent identifiers (DOI)
DOI registry -> crossref for scientific data
• Collaborations with publishers
with data journals
crossreferencing supplementary data with
traditional publications
(SCOR working group, Elsevier, Nature, Springer, Thompson Reuters)
Data policies
Are based on “Good scientific practise
in research and scholarship” by the
International Council for Science and
European Science Foundation which
36. Data are produced at all stages in experimental research and in scholarship. Data sets are an important resource, which enable later verification of scientific interpretations and conclusions. They may also be the starting point for further studies. It is vital, therefore, that all
primary and secondary data are stored in a secure and accessible form.
Good scientific practice in research and scholarship
European Science Foundation (ESF), 2000
37. Institutions may pay particular attention to documenting and archiving original research and scholarship data. Several codes of good practice recommend a minimum period of 10 years, longer in the case of especially significant or sensitive data. National or
regional discipline-based archives should be considered where there are practical or other problems in storing data at the institution where the research was conducted.
4. Scientific advances rely on full and open access to data. Both
science and the public are well served by a system of scholarly research and communication with minimal constraints on the availability of data for further analysis. The tradition of full and open access to data has led to breakthroughs in scientific understanding, as well as to later economic and public policy benefits. The idea that an individual or organization can control access to or claim ownership of the facts of nature is foreign to science.
Principles for dissemination of scientific data (International Council for Science/CODATA)
5. The interests of database owners must be balanced with
society’s need for open exchange of ideas. Given the substantial investment in data collection and its importance to society, it is equally important that data are used to the maximum extent possible. Data that were collected for a variety of purposes may be useful to science. Legal foundations and societal attitudes should foster a balance between
There are different data
policies in use
They all state when and how data (and
metadata) have to be made available for
project members, the general public, where
data have to be archived and who shall get
the credit
Funding agencies, institutes, projects,
organizations, etc have often their own
policies when data have to be publicly
released (ranges from 0,5 – 3 years!) and
where it has to be archived
Example: data from one cruise
can fall under different data
policies
One of the project
Institute
Owner of the vessel
National funding agency
Even though it is slightly
confusing but it sounds like
everything is in place and data is
available.
Why is data often not reported
in time or at all or not available
to the community?
Why is data archiving important?
Data sets are an important resource, which
enable later verification of scientific interpretations and conclusions
Data has been lost over in the past due to no or
insufficient data management
Essential for syntheses
Several codes of good practice recommend a
minimum data storage of at least 10 years
Funding agencies require data to be long-term
archived
What is high impact for data?
Making a data set available is a publication!
Make data sets citable and get cited (use of DOI)
Make data available to internationally agreed
standards (for the data reporting itself and infrastructure being used)
Use established data institutions which people
search
The more scientists find and use your data the
more your paper will get cited
Possible problems in retrieving
data from different sources
Version conflicts (data is archived in many data centres – in different
stages e.g. raw data, quality controlled, etc.)
Bad documented metadata and data (methods, units, unclear parameter
definitions, etc)
Just metadata is available online – data has to be requested
Naming of cruises varies in many countries > hard to identify same
cruises
Date formats (mm/dd/yyyy; yy/mm/dd; dd/mm/yyyy etc)
Ways to report the position (Lat/Long, UTM)
Different export formats (plain text, xml, netCDF, etc)
Different entities (one data set = data from one cruise or data from
one station or data from one sample)
Data set is too large to be downloaded (e.g. model data)