Big Data at ECMWF
Providing access to multi-petabyte datasets
Past, present and future
Baudouin Raoult
Principal Software Strategist
ECMWF
An independent intergovernmental organisation established in 1975 with 20 Member States 14 Co-operating StatesMajor assimilated datasets
Surface stations Radiosonde balloons Polar, infrared Polar, microwave Geostationary, IR AircraftERA-20C completed: Climate monitoring of the 20th Century
●
Using >5% of ECMWF’s computing power●
Assimilating billions of observations●
Producing 2,400 global forecasts per day●
Generating 1 PB of reanalysis data in 200 daysSurface fluxes: greenhouse gases, fires, emissions Global atmospheric composition http://atmosphere.copernicus.eu
Online catalogue, quick-looks and data
Radiation and ozone layer
European Air Quality
ECMWF products
●
ECMWF currently receives 300 million observation from 130 sources daily.●
ECMWF operational models produce13 millions fields daily, for a total of
around 8 TB.
●
77 million products disseminated ever day, for a total of 6 TB.What is Big Data?
●
Wikipedia: “Big Data is the term for a collection of data sets so large andcomplex that it becomes difficult to process using on-hand database
management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and
visualization.”
●
Gartner: “Big Data is high volume, high velocity, and/or high varietyinformation assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” (The 3 “V” of Big Data).
V
is for
V
olume (or coping with exponential growth)
V
is for
V
elocity (or coping with exponential growth, part 2)
●
ECMWF’s archive grows exponentially:– At ECMWF, r is around 0.5, which is a 50% increase per year
– The daily amount of data added to the archive grows exponentially at the
same rate!
●
In 1995, the size of the archive was increasing at a rate of 14 TB/year.●
In 2014, the size of the archive increases at a rate higher than 60 TB/day (with peaks at 100 TB).V
is for
V
ariety (or coping with complexity)
T106L16 T106L19 T213L31 T319L31 T319L50 T319L60 T511L60 T799L91 T1279L91 FC Model levels SSTs Errors in FG Waves EPS Clusters Waves FG ProbabilitiesEnsemble means & stdev Other centers Sensitivity NCEP EPS OI Errors in AN and FG 4D-Var Tubes Wave EPS Errors if FG, surface Wave proba. SCDA Analysis PT and PV levels SCDA Forecast SCDA Forecast Wave 4V SCDA Waves Multi-Analysis 4D-var increments EFIs DCDA DCDA Wave SCDA 4D-Var EPS PT levels Overlap, CalVal Wave EFIs Vareps/Monthy
4d-Var Model errors Ensemble data assimilation
X-MP/4 Y-MP/8 C90/12
C90/16
VPP700-48
VPP700-112
VPP5000 IBM-P4 IBM-P5 IBM-P5+ IBM-P6
100M 1G 10G 100G 1T 10T Weekend EPS Weekly Monthly
Extra fields, new gaussian grid 00Z EPS 00Z 10 day FC 00Z Run 00Z Run End sensitivity
ECMWF’s
M
eteorological
A
rchival and
R
etrieval
S
ystem
●
28 years in existence●
A managed archive●
MARS is not a file system– Users are not aware of the location of the data
– Retrievals are expressed in meteorological terms
●
An archive, not a database– Metadata online
ECMWF’s
M
eteorological
A
rchival and
R
etrieval
S
ystem
●
1014 directly addressable objects– Unique hypercube-based indexing
●
Data is kept forever:– For many studies, a dataset becomes useful once enough data has been accumulated
– Deleting old data in an exponentially growing archive is meaningless
●
200 million objects/65 TB added daily●
7000 registered users●
650 active users, 100 TB retrieved per day, in 1.5 million requestsScalability and sustainability
●
Distributed service-oriented architecture, to scale out●
Queues and priorities to ensure quality of service and scaling with the demand●
Indirection is the key to scalability:– Allow services to be modified/redeployed…
– Allow data to be moved to different media/storage…
– …without any impact on users
●
It has allowed us to migrate several times during the past decades:– MVS to AIX, AIX to Linux, single server to clusters
Continuous evolution
●
Change is driven by new needs and expectations– Users are not domain experts anymore…
– …users expect information at “Google speed”
●
ECMWF is continuously evolving its data delivery methods and services to cater for the new user requirements●
Interoperability is key– Follow standards and governance to enable interoperability
– OGC, INSPIRE, ISO 19xxx series, NetCDF-CF, WMO Information System, GEOSS
●
Provide high-level services on the data– Data portals for data discovery
– Web services, REST APIs for data retrieval and manipulation
The next steps…
●
The era of pushing the data to the user is coming to an end ● The volumes involved are too large●
We live in a post-PC era…● Users want to access data from smart devices… ● …anywhere, anytime, any device…
● … and share their results.
●
We need to bring the user processing to the data:– Cloud Computing is now mature enough to implement operational services
– We need to build a “platform as a service” (PaaS), on which to“software as a service” (SaaS) solutions for environmental data and products