Big Data and Cloud Computing for GHRSST

(1)

Big Data and Cloud Computing for

GHRSST

Jean-Francois Piollé ([email protected])

Frédéric Paul, Olivier Archer

CERSAT / Institut Français de Recherche pour l’Exploitation

de la Mer

(2)

Facing data deluge

Today’s LTSRF archive : 49 TB

⇒ Increasing number of operational satellites, forthcoming Chinese /

Indian programs

⇒ increasing sensor spatial and temporal resolution

Challenges

How to allow high revisiting rate of

historical (and present) data ?

How to perform data intensive processing ?

How to afford large online archive ?

How to transfer data to user?

How to store locally data?

Processing

bottleneck

Storage

bottleneck

Network

bottleneck

Can new big data and cloud computing technologies help with that ?

(3)

How to cope with data volume?

⇒ usage of high-resolution data ok for case studies : limited amount of data

⇒ current solution for long time series : generation of high-level fusion products (L3

/ L4)

⇒ involves data transformation : averaging, smoothing,….

⇒ suitable for some applications only

⇒ what about more data intensive applications ?

⇒ highest spatial and temporal resolution

⇒ feature detection (front, eddies, ….)

⇒ data merging and synergy

Are massive central static and one-way archive centers still relevant ?

Data center User

Data – Tbytes !

Processings – Mbytes…

(4)

How to cope with data volume?

⇒ usage of high-resolution data ok for case studies : limited amount of data

⇒ current solution for long time series : generation of high-level fusion products (L3

/ L4)

⇒ involves data transformation : averaging, smoothing,….

⇒ suitable for some applications only

⇒ what about more data intensive applications ?

⇒ highest spatial and temporal resolution

⇒ feature detection (front, eddies, ….)

⇒ data merging and synergy

Are massive central static and one-way archive centers still relevant ?

Data center User

Data – Tbytes !

Processings – Mbytes…

(5)

Main aspects to consider

Storage (hardware)

File system

Workflow management

Virtualization

Cloud computing

Data organization and format

Data analysis User services Big data : very confusing term…

how to deal with data volume

growth and complexity to extract

fast and relevant information ?

Approach to design and strategies

for large volume of data

Issues with data management,

organization, storage, processing

Cloud computing : also very

confusing

In our context, offering flexible

remote processing capability

Virtualization + dynamic allocation

of resources

(6)

storage

Online storage on disk required for archives

⇒ Restoration from tapes : 500 GB / day

What technologies considered ?

• Big data centers (Google, Facebook,

…) rely on cheap hardware => weaker reliability balanced by

duplication/redundancy

• Strongly inter-related with file system (ex : management of redundancy, distribution,…)

• Connection strategy with

processing nodes to be considered (data intensive architecture taking into account data topology for job distribution « closest to the data ») – processing and network

performances while keeping low budget

(7)

File systems

⇒ Parallel and distributed

⇒ Large volume : disk cluster seen as one virtual space

Lustre

⇒Complex administration (scalability, …)

⇒No redundancy

⇒Bad fault tolerance

MooseFS

⇒Simple administration

⇒Scalability

⇒Reliability and robustness (redundancy implemented through replicates, and soon parity bit)

⇒No quota (soon)

GlusterFS ⇒Complex maintenance and administration

⇒Bad reliability

⇒Not suitable for large number of files

HDFS (Hadoop)

Performance for streaming and massive distributed processing ⇒Requires specific API for data access

⇒ Hadoop optimized for key/value data structure, not image/swath type structure

(8)

Cloud computing

Providing remote access and resources to users

Previous solutions :

• Ssh to server : limited allocation of resource (unmanageable), strong security

issues

• Ssh to supercomputer

• Expensive solution for data intensive applications (no communication between processing nodes)

• Strong environment constraints => specific system/software/libs/…

• Often not at the same location than data centers

• Grid technology

• Quite complex to use

• Strong environment constraints => specific system/software/libs/…

(9)

Cloud computing

Virtualization

=> deploy user dedicated and customized system environment (os and libraries, softwares,...)

=> remote machine close to user familiar environment

Cloud computing

=> management of ressources, allocation/deployment of virtual servers

IAAS : infrastructure

PAAS : platform (server + tools for processor integration, scheduling or reprocessing taks,...) SAAS : software

=> sustainability of processing environments

Private/public clouds

=> public clouds (Amazon S3,...) : expensive to be revised according to Ken), not adapted for large volume of data, concerns with sustainability

=> private cloud : restricted to within institute

=> hybrid clouds : private cloud with controlled access for external users. Security issues to be solved.

(10)

CERSAT Nephelae platform

Storage (hardware)

File system

Workflow management

Virtualization

Cloud computing

Data organization and format

Data analysis User services

400 TB – 414 Cores

Moose FS – full replication

netCDF4 – conversion effort for

existing datasets 15.8 TB for GHRSST PBS Pro – Torque Maui

data topology not taken into account KVM – Ubuntu / Cent-OS

w/ Matlab, scientific python

OpenStack, inherited from Nebula. Access through ssh.

Possible remote desktop with tried also Eucalyptus

OpenStack, inherited from Nebula tried also Eucalyptus

(11)

Feedback and experiences

Engineering perspective

1. Cost of commercial solutions and lack of optimization of storage vs processing

strategy

2. reliability of file systems (not to loose any data) is variable depending on the

file system. Longer assessment (and mistakes) is needed.

3. virtualization and input/output performances : drop by 50 %, about to be

solved

4. still completely to be addressed : using storage topology to distribute

processing to closest node

5. access security issues for external opening

6. stability status of most components, lack of documentation

7. lack of available expertise for our specific requirements

(12)

Feedback and experiences

Usage perspective

1. Used for reprocessing campaigns :

=> deployment of external partner's processor on platform matching developer's requirements and reprocessing

⇒ also allowed to save the processing environments and replay some part of the reprocessing later in the exact same conditions

⇒ Continuous re-processing capabality

2. Sandbox for various project contributors using and sharing the same data

=> product intercomparison and merging

=> test of new algorithms, perturbating initial conditions or settings

3. Systematic analysis of a dataset

=> detection of features in SST images

=> conversion to NetCDF4

Great help of the batch processing tools we have implemented (take a list of data files as input)

(13)