• No results found

Big Data and Cloud Computing for GHRSST

N/A
N/A
Protected

Academic year: 2021

Share "Big Data and Cloud Computing for GHRSST"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data and Cloud Computing for

GHRSST

Jean-Francois Piollé ([email protected])

Frédéric Paul, Olivier Archer

CERSAT / Institut Français de Recherche pour l’Exploitation

de la Mer

(2)

Facing data deluge

Today’s LTSRF archive : 49 TB

⇒ Increasing number of operational satellites, forthcoming Chinese /

Indian programs

⇒ increasing sensor spatial and temporal resolution

Challenges

How to allow high revisiting rate of

historical (and present) data ?

How to perform data intensive processing ?

How to afford large online archive ?

How to transfer data to user?

How to store locally data?

Processing

bottleneck

Storage

bottleneck

Network

bottleneck

Can new big data and cloud computing technologies help with that ?

(3)

How to cope with data volume?

⇒ usage of high-resolution data ok for case studies : limited amount of data

⇒ current solution for long time series : generation of high-level fusion products (L3

/ L4)

⇒ involves data transformation : averaging, smoothing,….

⇒ suitable for some applications only

⇒ what about more data intensive applications ?

⇒ highest spatial and temporal resolution

⇒ feature detection (front, eddies, ….)

⇒ data merging and synergy

Are massive central static and one-way archive centers still relevant ?

Data center User

Data – Tbytes !

Processings – Mbytes…

(4)

How to cope with data volume?

⇒ usage of high-resolution data ok for case studies : limited amount of data

⇒ current solution for long time series : generation of high-level fusion products (L3

/ L4)

⇒ involves data transformation : averaging, smoothing,….

⇒ suitable for some applications only

⇒ what about more data intensive applications ?

⇒ highest spatial and temporal resolution

⇒ feature detection (front, eddies, ….)

⇒ data merging and synergy

Are massive central static and one-way archive centers still relevant ?

Data center User

Data – Tbytes !

Processings – Mbytes…

(5)

Main aspects to consider

Storage (hardware)

File system

Workflow management

Virtualization

Cloud computing

Data organization and format

Data analysis User services Big data : very confusing term…

how to deal with data volume

growth and complexity to extract

fast and relevant information ?

Approach to design and strategies

for large volume of data

Issues with data management,

organization, storage, processing

Cloud computing : also very

confusing

In our context, offering flexible

remote processing capability

Virtualization + dynamic allocation

of resources

(6)

storage

Online storage on disk required for archives

Restoration from tapes : 500 GB / day

What technologies considered ?

Big data centers (Google, Facebook,

…) rely on cheap hardware => weaker reliability balanced by

duplication/redundancy

Strongly inter-related with file system (ex : management of redundancy, distribution,…)

Connection strategy with

processing nodes to be considered (data intensive architecture taking into account data topology for job distribution « closest to the data ») – processing and network

performances while keeping low budget

(7)

File systems

⇒ Parallel and distributed

⇒ Large volume : disk cluster seen as one virtual space

Lustre

Complex administration (scalability, …)

No redundancy

Bad fault tolerance

MooseFS

Simple administration

Scalability

⇒Reliability and robustness (redundancy implemented through replicates, and soon parity bit)

No quota (soon)

GlusterFS ⇒Complex maintenance and administration

Bad reliability

Not suitable for large number of files

HDFS (Hadoop)

Performance for streaming and massive distributed processing Requires specific API for data access

Hadoop optimized for key/value data structure, not image/swath type structure

(8)

Cloud computing

Providing remote access and resources to users

Previous solutions :

• Ssh to server : limited allocation of resource (unmanageable), strong security

issues

• Ssh to supercomputer

• Expensive solution for data intensive applications (no communication between processing nodes)

• Strong environment constraints => specific system/software/libs/…

• Often not at the same location than data centers

• Grid technology

• Quite complex to use

• Strong environment constraints => specific system/software/libs/…

(9)

Cloud computing

Virtualization

=> deploy user dedicated and customized system environment (os and libraries, softwares,...)

=> remote machine close to user familiar environment

Cloud computing

=> management of ressources, allocation/deployment of virtual servers

IAAS : infrastructure

PAAS : platform (server + tools for processor integration, scheduling or reprocessing taks,...) SAAS : software

=> sustainability of processing environments

Private/public clouds

=> public clouds (Amazon S3,...) : expensive to be revised according to Ken), not adapted for large volume of data, concerns with sustainability

=> private cloud : restricted to within institute

=> hybrid clouds : private cloud with controlled access for external users. Security issues to be solved.

(10)

CERSAT Nephelae platform

Storage (hardware)

File system

Workflow management

Virtualization

Cloud computing

Data organization and format

Data analysis User services

400 TB – 414 Cores

Moose FS – full replication

netCDF4 – conversion effort for

existing datasets 15.8 TB for GHRSST PBS Pro – Torque Maui

data topology not taken into account KVM – Ubuntu / Cent-OS

w/ Matlab, scientific python

OpenStack, inherited from Nebula. Access through ssh.

Possible remote desktop with tried also Eucalyptus

OpenStack, inherited from Nebula tried also Eucalyptus

(11)

Feedback and experiences

Engineering perspective

1. Cost of commercial solutions and lack of optimization of storage vs processing

strategy

2. reliability of file systems (not to loose any data) is variable depending on the

file system. Longer assessment (and mistakes) is needed.

3. virtualization and input/output performances : drop by 50 %, about to be

solved

4. still completely to be addressed : using storage topology to distribute

processing to closest node

5. access security issues for external opening

6. stability status of most components, lack of documentation

7. lack of available expertise for our specific requirements

(12)

Feedback and experiences

Usage perspective

1. Used for reprocessing campaigns :

=> deployment of external partner's processor on platform matching developer's requirements and reprocessing

⇒ also allowed to save the processing environments and replay some part of the reprocessing later in the exact same conditions

⇒ Continuous re-processing capabality

2. Sandbox for various project contributors using and sharing the same data

=> product intercomparison and merging

=> test of new algorithms, perturbating initial conditions or settings

3. Systematic analysis of a dataset

=> detection of features in SST images

=> conversion to NetCDF4

Great help of the batch processing tools we have implemented (take a list of data files as input)

(13)

Questions for GHRSST

These technologies are quite new and unstable. Limited real expertise is available,

technical challenges are yet to be tackled – especially for scientific data – but many

initiatives are popping up (physics, space agencies,…). Is it a new paradigm for data

centers ?

Will only help with some applications : not an answer to everything (traditional

technologies still works) ! Complementary tool to current data center services

GHRSST should be concerned about the capability building around its data heritage and

the user services for the exploitation of past data => from user perspective (not data

producer)

What are the experience and prospects at GHRSST main data nodes (PODAAC, NODC) ?

=> Necessary to share and possibly homogenize or interconnect the available services

Should these aspects be part of GHRSST strategic plan ?

References

Related documents

The Court shall have exclusive jurisdiction with respect to all matters arising from or related to the interpretation or implementation of this Stipulation, and the

The secondary objectives were to assess the adjusted prevalence of CM according to clinical presentation and patient characteristics, to determine crude 90-day survival according

Other studies have found associations between short sleep duration and poor sleep quality to be associated with health issues among LEO, but inflammatory biomarkers were not

8.4 Data collected in health care and biomedical research contexts are not intrinsically more or less ‘sensitive’ than other data relating to individuals, but the medical context in

Informed by the theory of community of practice (COP), this paper examines how two Saudi first-year students who are pursuing their master degree in TESOL at American

Membranipora Lacroixii and Bowerbankia imbricata were taken up as far as half a mile above Halton Quay, where the salinity range from high to low water was approximately 210/00to

Abbreviations and acronyms: ABCD, Appropriate Blood pressure Control in Diabetes; ABI, ankle-brachial index; ABPM, ambulatory blood pressure monitoring; ACCESS, Acute

However, a more nuanced question is whether contemporary information and communication technologies (ICTs) such as computers and mobile phones are engendering new ways for us