Data Formats for Long-term Archiving
of Climate Model Data
at WDC Climate and DKRZ
Michael Lautenschlager and Jörg Wegner
WDC Climate / Max-Planck-Institute for Meteorology, Hamburg
MPG e-Science Seminar
DKRZ:
• Earth system model development
• Simulations of past, present and
future climate
WDC Climate:
• Long-term data archiving
• Inter-disciplinary data
dissemination
Diagram of the
Hamburg
IPCC-Climate Model
Near surface
temperature change
for the scenarios
A1B und B1.
Presented is the
difference of the
30-year-means
2071-2100
minus 1961-1990.
Comparison of the
present-day sea ice
cover
In March and
September
(oben) with the climate
projection for the
scenario A1B (unten)
in 2100.
Additionally the snow
over land can be
Spatial resolution of the North Atlantic sector
in ECHAM5/MPI-OM
Data Volumina in Climate Projections:
z
IPCC AR4:
ECHAM5[
T63L19
]/MPI-OM produces 23 TB/year
Climate projection over 240 years (1860-2100):
5,5 TB
and appr. 2 months computer run time
z
Future:
ECHAM5[
T106L31
] produces 44 GB/year
Climate projection over 240 years (1860-2100):
106
Extrapolated HLRE2 linear archive increase (
10 times HLRE
)
Compute server architectures:
C90: Cray C90 / HLRE: NEC SX-6 / MPP: SUN-Cluster / HLRE2: new system
lokale
Systeme
lokale
Systeme
CS
DS
NW
entfernte
Systeme
entfernte
Systeme
GFS
DKRZ System Diagram
x 32 LAN x 16 x 35 UCFM Cache 17 TB 9840C x 7 9940B x 18 T10000 x 8 LTO2 x 2 x 16 GFS Disk 70 TB x 32 x 48 DBMS Disk 30 TB x 20 x 112 x 36 x 24 x 12 SX SX--66 SXSX--66 SXSX--66 SXSX--66 SXSX--66 SXSX--66 SX SX--66 SXSX--66 SXSX--66 SXSX--66 SXSX--66 SXSX--66 SXSX--66 SXSX--66
IXS
24 nodes
x 2 DXUL-DB Oracle9i 6 * 4/8 6 * 4/8 3 * 16/32 3 * 16/32--4848 x 12 x 6 GFS/UVDM UDSN UCFM 3 * 4/8 3 * 4/8 SUN ApplSrv x 6 x 8 x 6 DS test 8/16 8/16 UDSN 2 * 16/32 2 * 16/32 UCFM GFS/UVDM HSM DBMS 8/16 8/16 Az Az archive backup X compile user appl x 2 x 12 2 * 8/16 2 * 8/16 GFS GFS x 4 x 12z
Data classes
Test data
from model code development,
life cycle: weeks to months
Project data
from scientific model evaluation and research
projects (DKRZ resources at project level),
life cycle: 3 – 5 years
Final results
as data products for international projects
(IPCC) and scientific publications,
life cycle: 10 years and longer
z
Data hierarchy levels
Temp
(orary)
scratch discs at compute server
Work
fixed disc space at project level for evaluation
Arch
(ive)
tape storage space (single copy) with expiration date for
project data beyond available disc space
Docu
(mentation)
documented, long-term tape archive
(security copy) for data
products, focus on interdisciplinary data utilisation,
Tape space distributon to archive classes at DKRZ begin
of 2007:
• part of the “work” space on tape because GFS too small
• “docu” domain consists of WDCC
• no expiration dates in “arch” domain, parts of “arch”
domain belongs to “docu” but not yet documented
z
Data documentation requirements are accomplished by
using the WDCC infrastruture
CERA-2 metadata model developed in 1999
Catalogue interface:
cera.wdc-climate.de
Input interface:
input.wdc-climate.de
CERA-2 metadata content is
complete with respect to
browse, to discover and to use climate data
which are
stored in the database system or outside in flat files
The WDCC matches international description
standards like
ISO 19115, Dublin Core or GCMD
and is
integrated in international data federations
Data storage structure assembles storage of climate
time series per variable in
BLOB data tables
. This
allows for web-based data catalogue search and data
CERA Data Model
Entry
Reference
Status
Distribution
Contact
Coverage
Parameter
Spatial
Reference
Local Adm.
Data Access
Data Org
Coloured columns correspond to
BLOB data tables
in
WDCC.
Collections of matrix rows represents storage in
model
raw data files
(complete model output storage time step
by storage time step).
Data infrastructure integrates
data stewardship
in the
long-term archive
• Bit-stream preservation
• Quality assurance
Long-term archive data stewardship
z
Bit-stream preservation
Secondary tape copies
on different tapes and
technology at separate location
Copy to new tapes after maximum number of tape
accesses are reached (
Refreshment
)
z
Quality assurance
Semantic examinations:
behavior of a numerical model
compared to observations and to other models, part of
the scientific evaluation process
Syntactic examinations:
formal aspects of data
archiving and ensurance that data archiving is free of
errors as far as possible
Consitency
between metadata and climate data
Completeness
of climate data
Standard range
of values
Long-term archive data stewardship (continued)
z
Usability enabling
Complete and
searchable documenation
of climate
data entities (database tables and flat files) in the
catalogue system of the WDCC
WDCC offers
web-based data access
to small data
granules (individual entries in BLOB DB tables)
Archive technology transfer must be
downward
compatible
to keep old data technically readable
Data processing tools and data format access libraries
Standard Data Formats (SDFs) at WDC Climate
z
Requirements
Self-descriptive (use metadata)
Machine independent
Should contain compression or packing
z
Benefits
SDFs support long-term data preservation
SDFs support data exchnage and dissemination
SDFs allow for application of standardized data
Data Form a ts at W D C C
G RIB 1
G RIB 2
NetCDF 3.x
NetCDF 4.x
cdo, cdat, xconv, IDV
cdo, cdat, nco, ncl
cdat, grads, ncview, G M T
convert
m anipulate
visualize
tools:
G RIB 1
-'GRIB' |
Section 0 -length of message, edition nu m ber | Section 1 - product description section |
Section 2 - grid description section | repeated Section 3 - bit map section |
Section 4 - binary data section | -'7777' |
ds8 55 %grib -ginfo zzz.grb
Rec : Position Size : V PDS GD S BMS BDS : Code Level : LType GType 1 : 0 36948 : 1 28 32 0 36876 : 133 20000 : 100 4 2 : 36948 36948 : 1 28 32 0 36876 : 133 20000 : 100 4 3 : 73896 36948 : 1 28 32 0 36876 : 133 20000 : 100 4 4 : 110844 36948 : 1 28 32 0 36876 : 133 20000 : 100 4 5 : 147792 36948 : 1 28 32 0 36876 : 133 20000 : 100 4 ds8 56 %grib -gdsinfo zzz.grb
Rec : GDS NV PVPL Typ : xsize ysize Lat1 Lon1 Lat2 Lon2 dx dy 1 : 32 0 255 4 : 192 96 88572 0 -88572 358125 1875 48 ds8 57 %
- co mpressed data -> s mall file size
- every 2d field (record) is a GRIB file -> UNIX co m m ands for catenating -library support for FO RTR A N & C
- strong restrictions for header informations
- header information coded (nu mbers) - need of tables for decoding
G RIB 2
Section 0 -'GRIB' indicator section Section 1 -identification section*
Section 2 -local use section (optional) |
Section 3 - grid definition section* | |repeated
Section 4 - product definition section* | |repeated|
Section 5 - data representation section |repeated | |
Section 6 - bit map section | | |
Section 7 - data section | | | Section 8 - end section '7777'
* Sections 1,3,4 represent the GRIB1 product description section.
This splitting, com bined with the option for iterating sections and the concept of templates make the main difference to GRIB1 and keeps GRIB2 very flexible.
Concept of templates: You can define templates for grid definition, product definition and data representation by your o wn.
A 500 hPa height field forecasts on a Northern He misphere polar stereographic grid produced by a particular num erical m o del at forecast hours 12, 24, 36, and 48. These four fields could be represented by a single GRIB2 message by repeating the sequence of Sections 4 to 7 four times, making the appropriate forecast time changes in the Product Definition Section in each iteration of the sequence.
Section 0: Indicator Section Section 1: Identification Section
Section 2: Local Use Section (optional) Section 3: Grid Definition Section
Section 4:Product Definition Section (hour = 12)| repetition 1 Section 5: Data Representation Section |
Section 6: Bit-Map Section | Section 7: Data Section |
Section 4:Product Definition Section (hour = 24)| repetition 2 Section 5: Data Representation Section |
Section 6: Bit-Map Section | Section 7: Data Section |
Section 4:Product Definition Section (hour = 36) | repetition 3 Section 5: Data Representation Section |
Section 6: Bit-Map Section | Section 7: Data Section |
Section 4:Product Definition Section (hour = 48) | repetition 4 Section 5: Data Representation Section |
Section 6: Bit-Map Section | Section 7: Data Section | Section 8: End Section
Note that since the Grid Definition Section is not repeated, it re mains in effect for all four forecast hours.
NetCDF 3.x
-
dimensions (1 unlimited possible)
-
variables & attributes
-
global attributes
- data
netcdf simple_xy { dimensions: x = 6 ; y = 12 ; variables: int data(x, y) ; // global attributes::C D O = "Climate Data Operators version 0.9.5 " ; :source = "E C H A M5.2" ; data: data = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71 ; }
- no co m pression for data -> file size bigger than GRIB1 - data stored n-dim ensional
-library support for FO RTR A N & C -file size => 2 GByte with NetCDF3.6 - no restrictions for header informations
NetCDF 4.x, HDF5
NetCDF-4/HDF5 For mat
With version 4.0 of netCDF, another new data format was introduced: netCDF-4/HDF5 format. This format is HDF5, with full use of the new dimension scales, creation
ordering, and other features of HDF5 added in its version 1.8.0 release.
•
Multiple unlimited dimensions.•
Groups to organize data.•
New types, including com p ound types and variable length arrays.•
Parallel I/O.netcdf4 "exa mple" { group "/" { group "group1" { dataset "set1" { dimension variables data} dataset "set2" { dimension variables data} } group "group2" { ... }} netcdf3.x file