Big Data and Cloud Computing for
GHRSST
Jean-Francois Piollé ([email protected])
Frédéric Paul, Olivier Archer
CERSAT / Institut Français de Recherche pour l’Exploitation
de la Mer
Facing data deluge
Today’s LTSRF archive : 49 TB
⇒ Increasing number of operational satellites, forthcoming Chinese /
Indian programs
⇒ increasing sensor spatial and temporal resolution
Challenges
How to allow high revisiting rate of
historical (and present) data ?
How to perform data intensive processing ?
How to afford large online archive ?
How to transfer data to user?
How to store locally data?
Processing
bottleneck
Storage
bottleneck
Network
bottleneck
Can new big data and cloud computing technologies help with that ?
How to cope with data volume?
⇒ usage of high-resolution data ok for case studies : limited amount of data
⇒ current solution for long time series : generation of high-level fusion products (L3
/ L4)
⇒ involves data transformation : averaging, smoothing,….
⇒ suitable for some applications only
⇒ what about more data intensive applications ?
⇒ highest spatial and temporal resolution
⇒ feature detection (front, eddies, ….)
⇒ data merging and synergy
Are massive central static and one-way archive centers still relevant ?
Data center User
Data – Tbytes !
Processings – Mbytes…
How to cope with data volume?
⇒ usage of high-resolution data ok for case studies : limited amount of data
⇒ current solution for long time series : generation of high-level fusion products (L3
/ L4)
⇒ involves data transformation : averaging, smoothing,….
⇒ suitable for some applications only
⇒ what about more data intensive applications ?
⇒ highest spatial and temporal resolution
⇒ feature detection (front, eddies, ….)
⇒ data merging and synergy
Are massive central static and one-way archive centers still relevant ?
Data center User
Data – Tbytes !
Processings – Mbytes…
Main aspects to consider
Storage (hardware)
File system
Workflow management
Virtualization
Cloud computing
Data organization and format
Data analysis User services Big data : very confusing term…
how to deal with data volume
growth and complexity to extract
fast and relevant information ?
Approach to design and strategies
for large volume of data
Issues with data management,
organization, storage, processing
Cloud computing : also very
confusing
In our context, offering flexible
remote processing capability
Virtualization + dynamic allocation
of resources
storage
Online storage on disk required for archives
⇒ Restoration from tapes : 500 GB / day
What technologies considered ?
• Big data centers (Google, Facebook,
…) rely on cheap hardware => weaker reliability balanced by
duplication/redundancy
• Strongly inter-related with file system (ex : management of redundancy, distribution,…)
• Connection strategy with
processing nodes to be considered (data intensive architecture taking into account data topology for job distribution « closest to the data ») – processing and network
performances while keeping low budget
File systems
⇒ Parallel and distributed
⇒ Large volume : disk cluster seen as one virtual space
Lustre
⇒Complex administration (scalability, …)
⇒No redundancy
⇒Bad fault tolerance
MooseFS
⇒Simple administration
⇒Scalability
⇒Reliability and robustness (redundancy implemented through replicates, and soon parity bit)
⇒No quota (soon)
GlusterFS ⇒Complex maintenance and administration
⇒Bad reliability
⇒Not suitable for large number of files
HDFS (Hadoop)
Performance for streaming and massive distributed processing ⇒Requires specific API for data access
⇒ Hadoop optimized for key/value data structure, not image/swath type structure
Cloud computing
Providing remote access and resources to users
Previous solutions :
• Ssh to server : limited allocation of resource (unmanageable), strong security
issues
• Ssh to supercomputer
• Expensive solution for data intensive applications (no communication between processing nodes)
• Strong environment constraints => specific system/software/libs/…
• Often not at the same location than data centers
• Grid technology
• Quite complex to use
• Strong environment constraints => specific system/software/libs/…
Cloud computing
Virtualization
=> deploy user dedicated and customized system environment (os and libraries, softwares,...)
=> remote machine close to user familiar environment
Cloud computing
=> management of ressources, allocation/deployment of virtual servers
IAAS : infrastructure
PAAS : platform (server + tools for processor integration, scheduling or reprocessing taks,...) SAAS : software
=> sustainability of processing environments
Private/public clouds
=> public clouds (Amazon S3,...) : expensive to be revised according to Ken), not adapted for large volume of data, concerns with sustainability
=> private cloud : restricted to within institute
=> hybrid clouds : private cloud with controlled access for external users. Security issues to be solved.
CERSAT Nephelae platform
Storage (hardware)
File system
Workflow management
Virtualization
Cloud computing
Data organization and format
Data analysis User services
400 TB – 414 Cores
Moose FS – full replication
netCDF4 – conversion effort for
existing datasets 15.8 TB for GHRSST PBS Pro – Torque Maui
data topology not taken into account KVM – Ubuntu / Cent-OS
w/ Matlab, scientific python
OpenStack, inherited from Nebula. Access through ssh.
Possible remote desktop with tried also Eucalyptus
OpenStack, inherited from Nebula tried also Eucalyptus
Feedback and experiences
Engineering perspective
1. Cost of commercial solutions and lack of optimization of storage vs processing
strategy
2. reliability of file systems (not to loose any data) is variable depending on the
file system. Longer assessment (and mistakes) is needed.
3. virtualization and input/output performances : drop by 50 %, about to be
solved
4. still completely to be addressed : using storage topology to distribute
processing to closest node
5. access security issues for external opening
6. stability status of most components, lack of documentation
7. lack of available expertise for our specific requirements
Feedback and experiences
Usage perspective
1. Used for reprocessing campaigns :
=> deployment of external partner's processor on platform matching developer's requirements and reprocessing
⇒ also allowed to save the processing environments and replay some part of the reprocessing later in the exact same conditions
⇒ Continuous re-processing capabality
2. Sandbox for various project contributors using and sharing the same data
=> product intercomparison and merging
=> test of new algorithms, perturbating initial conditions or settings
3. Systematic analysis of a dataset
=> detection of features in SST images
=> conversion to NetCDF4
Great help of the batch processing tools we have implemented (take a list of data files as input)