• No results found

Big Data at ECMWF Providing access to multi-petabyte datasets Past, present and future

N/A
N/A
Protected

Academic year: 2021

Share "Big Data at ECMWF Providing access to multi-petabyte datasets Past, present and future"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data at ECMWF

Providing access to multi-petabyte datasets

Past, present and future

Baudouin Raoult

Principal Software Strategist

(2)

ECMWF

An independent intergovernmental organisation established in 1975 with 20 Member States 14 Co-operating States
(3)
(4)
(5)

Major assimilated datasets

Surface stations Radiosonde balloons Polar, infrared Polar, microwave Geostationary, IR Aircraft
(6)
(7)

ERA-20C completed: Climate monitoring of the 20th Century

Using >5% of ECMWF’s computing power

Assimilating billions of observations

Producing 2,400 global forecasts per day

Generating 1 PB of reanalysis data in 200 days
(8)

Surface fluxes: greenhouse gases, fires, emissions Global atmospheric composition http://atmosphere.copernicus.eu

Online catalogue, quick-looks and data

Radiation and ozone layer

European Air Quality

(9)

ECMWF products

ECMWF currently receives 300 million observation from 130 sources daily.

ECMWF operational models produce

13 millions fields daily, for a total of

around 8 TB.

77 million products disseminated ever day, for a total of 6 TB.
(10)

What is Big Data?

Wikipedia: “Big Data is the term for a collection of data sets so large and

complex that it becomes difficult to process using on-hand database

management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and

visualization.”

Gartner: “Big Data is high volume, high velocity, and/or high variety

information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” (The 3 “V” of Big Data).

(11)

V

is for

V

olume (or coping with exponential growth)

(12)

V

is for

V

elocity (or coping with exponential growth, part 2)

ECMWF’s archive grows exponentially:

At ECMWF, r is around 0.5, which is a 50% increase per year

– The daily amount of data added to the archive grows exponentially at the

same rate!

In 1995, the size of the archive was increasing at a rate of 14 TB/year.

In 2014, the size of the archive increases at a rate higher than 60 TB/day (with peaks at 100 TB).
(13)

V

is for

V

ariety (or coping with complexity)

T106L16 T106L19 T213L31 T319L31 T319L50 T319L60 T511L60 T799L91 T1279L91 FC Model levels SSTs Errors in FG Waves EPS Clusters Waves FG Probabilities

Ensemble means & stdev Other centers Sensitivity NCEP EPS OI Errors in AN and FG 4D-Var Tubes Wave EPS Errors if FG, surface Wave proba. SCDA Analysis PT and PV levels SCDA Forecast SCDA Forecast Wave 4V SCDA Waves Multi-Analysis 4D-var increments EFIs DCDA DCDA Wave SCDA 4D-Var EPS PT levels Overlap, CalVal Wave EFIs Vareps/Monthy

4d-Var Model errors Ensemble data assimilation

X-MP/4 Y-MP/8 C90/12

C90/16

VPP700-48

VPP700-112

VPP5000 IBM-P4 IBM-P5 IBM-P5+ IBM-P6

100M 1G 10G 100G 1T 10T Weekend EPS Weekly Monthly

Extra fields, new gaussian grid 00Z EPS 00Z 10 day FC 00Z Run 00Z Run End sensitivity

(14)

ECMWF’s

M

eteorological

A

rchival and

R

etrieval

S

ystem

28 years in existence

A managed archive

MARS is not a file system

– Users are not aware of the location of the data

– Retrievals are expressed in meteorological terms

An archive, not a database

– Metadata online

(15)

ECMWF’s

M

eteorological

A

rchival and

R

etrieval

S

ystem

1014 directly addressable objects

– Unique hypercube-based indexing

Data is kept forever:

– For many studies, a dataset becomes useful once enough data has been accumulated

– Deleting old data in an exponentially growing archive is meaningless

200 million objects/65 TB added daily

7000 registered users

650 active users, 100 TB retrieved per day, in 1.5 million requests
(16)

Scalability and sustainability

Distributed service-oriented architecture, to scale out

Queues and priorities to ensure quality of service and scaling with the demand

Indirection is the key to scalability:

– Allow services to be modified/redeployed…

– Allow data to be moved to different media/storage…

– …without any impact on users

It has allowed us to migrate several times during the past decades:

– MVS to AIX, AIX to Linux, single server to clusters

(17)

Continuous evolution

Change is driven by new needs and expectations

– Users are not domain experts anymore…

– …users expect information at “Google speed”

ECMWF is continuously evolving its data delivery methods and services to cater for the new user requirements

Interoperability is key

– Follow standards and governance to enable interoperability

– OGC, INSPIRE, ISO 19xxx series, NetCDF-CF, WMO Information System, GEOSS

Provide high-level services on the data

– Data portals for data discovery

– Web services, REST APIs for data retrieval and manipulation

(18)
(19)
(20)

The next steps…

The era of pushing the data to the user is coming to an end ● The volumes involved are too large

We live in a post-PC era…

● Users want to access data from smart devices… ● …anywhere, anytime, any device…

● … and share their results.

We need to bring the user processing to the data:

– Cloud Computing is now mature enough to implement operational services

– We need to build a “platform as a service” (PaaS), on which to“software as a service” (SaaS) solutions for environmental data and products

(21)

References

Related documents

We have spent two decades helping organisations to make the right commercial and technical choices, exploiting technologies ranging from smart cards and mobile phones

Dari hasil pengujian diperoleh: (1) Semakin besar kadar abu terbang pada adukan beton maka akan meningkatkan kelecakan beton, (2) penggunaan abu terbang memperlambat

Finally, by setting parameter values equal to their estimates from the second step, the smoothed estimates of cumulated changes in each unit’s fric- tionless state variable are

To better evaluate the predictive performance of DMA, we have the following seven variants of dynamic Nelson-Siegel models: recursive estimation of factor dynamics using standard

The main issues have to do with: (1) the dangers of labeling an adolescent as a psychopath; (2) the implications of the PCL: YV for classification, sentencing, and treatment; (3)

The 28-day compressive strength   f c ' 28 of GGBS- based mortars activated by the combination of sodium silicate and sodium hydroxide was generally.. lower than that of

Study on aircraft maintenance training in South Asian countries highlighted that most of the maintenance training methods are ineffective and not performance based, with

Hertel and Martin (2008), provide a simplified interpretation of the technical modalities. The model here follows those authors in modeling SSM. To briefly outline, if a