The Ophidia framework: toward big
data analy7cs for climate change
Dr. Sandro Fiore
Leader, Scientific data management research group
Scientific Computing Division @ CMCC
Prof. Giovanni Aloisio
A set of requirements have been discussed with our scientists to identify the
key data analytics needs.
Preliminary requirements and needs focus on:
Time series analysis
Data reduc6on (e.g. by aggrega6on)
Data subse<ng
Model intercomparison
Mul6model means
Data transforma6on (through array-‐based primi6ves)
Param. Sweep experiments (same task applied on a set of data)
Comparison between historical data and future scenarios
Maps genera6on
Ensemble analysis
Data analy6cs worflow support
But also…
Performance, re-‐usability, extensibility
The Ophidia Project
S. Fiore, A. D’Anca, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, “Ophidia: toward bigdata analytics for
eScience”,
ICCS2013 Conference
, Procedia Elsevier, Barcelona, June 5-7, 2013
Ophidia
is a research effort carried out at the
Euro Mediterranean Centre on Climate
Change
(CMCC) to
address “big data” challenges, issues and requirements for climate
Ophidia Architecture 1.0
Front
end
Compute
layer
I/O
layer
I/O server
instance
Storage
layer
System
catalog
Array-based primitives
Analytics Framework
Standard interfaces
Partitioning/hierarchical data mng
Declarative language
Storage model (dimension-independent) & implementation
!
Array-based support and hierarchical storage
!
OPERATOR NAME OPERATOR DESCRIPTION
Operators “Data processing” – Domain-agnostic
OPH_APPLY(datacube_in,
datacube_out, array_based_primitive)
Creates the datacube_out by applying the array-based primitive to the
datacube_in OPH_DUPLICATE(datacube_
in, datacube_out) Creates a copy of the the datacube_outdatacube_in in
OPH_SUBSET(datacube_in,
subset_string, datacube_out) Creates the sub-setting of the datacube_outdatacube_in by doing a by applying the subset_string
OPH_MERGE(datacube_in,
merge_param, datacube_out) Creates the groups of merge_paramdatacube_out by merging fragments from datacube_in
OPH_SPLIT(datacube_in,
split_param, datacube_out) Creates the into groups of datacube_outsplit_param by splitting fragments each fragment of the datacube_in
OPH_INTERCOMPARISON (datacube_in1, datacube_in2,
datacube_out)
Creates the datacube_out which is the element-wise difference between
datacube_in1 and datacube_in2
OPH_DELETE(datacube_in) Removes the datacube_in
OPERATOR NAME OPERATOR DESCRIPTION
Operators “Data processing” – Domain-oriented
OPH_EXPORT_NC (datacube_in, file_out)
Exports the datacube_in data into the
file_out NetCDF file. OPH_IMPORT_NC
(file_in, datacube_out)
Imports the data stored into the file_in
NetCDF file into the new datacube_in
datacube
Operators “Data access”
OPH_INSPECT_FRAG (datacube_in, fragment_in)
Inspects the data stored in the
fragment_in from the datacube_in
OPH_PUBLISH(datacube_in) Publishes the datacube_in fragments into HTML pages
Operators “Metadata”
OPH_CUBE_ELEMENTS
(datacube_in) Provides the total number of the elements in the datacube_in
OPH_CUBE_SIZE (datacube_in)
Provides the disk space occupied by the
datacube_in
OPH_LIST(void) Provides the list of available datacubes. OPH_CUBEIO(datacube_in) Provides the provenance information
related to the datacube_in
OPH_FIND(search_param) Provides the list of datacubes matching the search_param criteria
Metadata management
(sequential and parallel operators)
Data processing
(parallel operators, MPI &
OpenMP based)
Data Access
(sequential and parallel operators)
The analytics framework: datacube operators (about 50)
!
Import/Export
(parallel operators)
The analytics framework: datacube operators
!
•
The
array data type
support is not enough to provide scientific data
management capabilities…
primitives
are needed as well.
•
A set of
array-based primitives
have been implemented
•
A primitive is applied to a
single array
•
They come in the form of
plugins
(I/O server extensions)
•
So far, Ophidia provides a
wide set of plugins (about 100)
to perform
data reduction (by aggregation), sub-setting, predicates evaluation,
statistical analysis, compression, and so forth.
•
Support is provided both for
byte
-oriented and
bit
-oriented arrays
•
Plugins can be nested
to get more complex functionalities
•
Compression
is provided as a primitive too
Array based primitives: OPH_MATH (“SIGN”)
oph_math(measure, “OPH_SIGN”, "OPH_DOUBLE”)
Array based primitives: OPH_BOXPLOT
oph_boxplot(measure, "OPH_DOUBLE”)
oph_boxplot(oph_subarray(oph_uncompress(measure), 1,18), "OPH_DOUBLE”)
subarray(measure, 1,18)
Array based primitives: nesting feature
Array based primitives: oph_aggregate
oph_aggregate(measure,"oph_avg”)
Ver6cal aggrega6on
Single chunk or fragment (input)
Single chunk or
fragment (output)
• Mathematical primitives:
oph_math, oph_mul_scalar, oph_sum_array, oph_sum_scalar, oph_sum_array_r, oph_gsl_sd, oph_gsl_complex_get_abs, oph_gsl_complex_get_arg, oph_gsl_complex_get_imag, oph_gsl_complex_get_real, oph_gsl_complex_to_polar, oph_gsl_complex_to_rect, oph_gsl_dwt, oph_gsl_fft, oph_gsl_idwt, oph_gsl_ifft, oph_gsl_quantile, oph_gsl_sort, oph_gsl_stats, oph_gsl_histogram, oph_gsl_boxplot, oph_petsc_vec_norm, oph_petsc_vec, oph_petsc_vec_r, oph_sum_scalar2, oph_mul_scalar2, oph_compare,…
• Data transformations:
oph_to_bin, oph_to_bit, oph_permute, oph_reverse, oph_dump, oph_convert_d, oph_shift, oph_rotate, oph_concat, oph_predicate, oph_roll_up, oph_get_subarray, oph_get_subarray2, oph_get_subarray3, oph_mask, oph_cast, oph_drill_down (stored procedure), oph_reduce, oph_reduce2, oph_reduce3, oph_aggregate_operator, oph_operator, oph_aggregate_stats, oph_extract,…
• Compression:
oph_compress, oph_uncompress, oph_compress2, oph_uncompress2, oph_id_to_index, oph_size_array, oph_count_array, oph_find, oph_find2, …
• Bit measures processing:
oph_bit_aggregate, oph_bit_import, oph_bit_export, oph_bit_size, oph_bit_count, oph_bit_dump, oph_bit_operator, oph_bit_not, oph_bit_shift, oph_bit_rotate, oph_bit_reverse, oph_bit_subarray, oph_bit_subarray2, oph_bit_concat, oph_bit_find, oph_bit_reduce, …
• Integration of numerical libraries: math, GNU GSL, PetsC.
Ophidia primitives
•
The Ophidia terminal provides an effec6ve and lightweight way to interact with the Ophidia server
•
Bash-‐like environment (commands interpreter)
•
Terminal with history management, auto-‐comple6on, specific environment variables and commands
with integrated help…
•
Easy installa6on as an only one executable using a small number of well-‐known and open-‐source
libraries
•
More than 15 KLOC
•
Simple enough for a novice and at the same 6me powerful enough for an expert
PRE-DEFINED VARIABLES
(user-defined variables possible)
OPH_TERM_PS1 = "color of the prompt”
OPH_USER = "username”
OPH_PASSWD = "password”
OPH_SERVER_HOST = "server address”
OPH_SERVER_PORT = "server port”
OPH_SESSION_ID = "current session”
OPH_EXEC_MODE = "execu6on mode”
OPH_NCORES = "number of cores”
OPH_CWD = "current working directory”
OPH_DATACUBE = "current datacube”
OPH_TERM_VIEWER = "output renderer”
OPH_TERM_IMGS = "save and/or auto-‐open images"
PRE-DEFINED COMMANDS
(user-defined aliases possible)
help = "get the descrip6on of a command or variable”
history = "manage the history of commands”
env = "list environment variables”
setenv = "set or change the value of a variable”
unsetenv = "unset an environment variable”
getenv = "print the value of a variable”
quit = "quit Oph_Term”
exit = "quit Oph_Term”
clear = "clear screen”
update = "update local XML files”
resume = "resume session”
view = "view jobs output”
alias = "list aliases”
setalias = "set or change the value of an alias”
unsetalias = "unset an alias”
getalias = "print the value of an alias"
“n”D base
cuboid
“0”D apex
cuboid
Ideal path
FRAGMENT1 – 1 TUPLE x 1 ELEM. ID MEASURE 1 24,35x1
FRAGMENT1 – 10^4 TUPLE x 10^5 ELEMENTS
ID MEASURE
1 2 … 10
FRAGMENT2 – 10^4 TUPLE x 10^5 ELEMENTS
ID MEASURE
1 2 … 10
FRAGMENT3 – 10^4 TUPLE x 10^5 ELEMENTS
ID MEASURE
1 2 … 10
FRAGMENT4 – 10^4 TUPLE x 10^5 ELEMENTS
ID MEASURE 1 2 … 10^
x64
1,95 8,64 10,47 … 16,11 14,81 18,14 19,93 … 24,35 … 6,87 10,99 12,85 … 16,93 1,95 8,64 10,47 … 16,11 14,81 18,14 19,93 … 24,35 … 6,87 10,99 12,85 … 16,93FRAG64 – 10^4 TUPLE x 10^5 ELEMENTS
ID MEASURE 1 2 … 10^4 1,95 8,64 … 16,11 14,81 18,14 … 24,35 … 6,87 10,99 … 16,93
Ideal path
“3”D base
cuboid
“0”D apex
cuboid
½ Terabyte
Datacube
Preliminary performance insights
!
AGGREGATE ALL MAX
FRAGMENT1 – 1 TUPLE x 1 ELEM. ID MEASURE 1 24,35x1
FRAGMENT1 – 64 TUPLE x 1 ELEM. ID MEASURE 1 2 … 64 19,78 12,87 … 24,35x1
MERGE ALL
FRAGMENT 1 – 1 TUPLE x 1 ELEM. ID MEASURE 1x64
FRAGMENT 2 – 1 TUPLE x 1 ELEM. ID MEASURE 1 FRAGMENT 3 – 1 TUPLE x 1 ELEM. ID MEASURE 1 FRAGMENT 4 – 1 TUPLE x 1 ELEM. ID MEASURE 1 24,35 FRAGMENT 64 – 1 TUPLE x 1 ELEM. ID MEASURE 1 24,35AGGREGATE ALL MAX
FRAGMENT1 – 10^4 TUPLE x 1 ELEM. ID MEASURE 1 2 … 104 FRAGMENT2 – 10^4 TUPLE x 1 ELEM. ID MEASURE 1 2 … 104 FRAGMENT3 – 10^4 TUPLE x 1 ELEM. ID MEASURE 1 2 … 104 FRAGMENT4 – 10^4 TUPLE x 1 ELEM. ID MEASURE 1 2 … 10^ FRAGMENT 64 – 10^4 TUPLE x 1 ELEM. ID MEASURE 1 2 … 10^4 16,11 24,35 …FRAGMENT1 – 10^4 TUPLE x 10^5 ELEMENTS
ID MEASURE
1 2 … 10
FRAGMENT2 – 10^4 TUPLE x 10^5 ELEMENTS
ID MEASURE
1 2 … 10
FRAGMENT3 – 10^4 TUPLE x 10^5 ELEMENTS
ID MEASURE
1 2 … 10
FRAGMENT4 – 10^4 TUPLE x 10^5 ELEMENTS
ID MEASURE 1 2 … 10^
x64
1,95 8,64 10,47 … 16,11 14,81 18,14 19,93 … 24,35 … 6,87 10,99 12,85 … 16,93 1,95 8,64 10,47 … 16,11 14,81 18,14 19,93 … 24,35 … 6,87 10,99 12,85 … 16,93FRAG64 – 10^4 TUPLE x 10^5 ELEMENTS
ID MEASURE 1 2 … 10^4 1,95 8,64 … 16,11 14,81 18,14 … 24,35 … 6,87 10,99 … 16,93
REDUCE ALL MAX
16,93
x64
Real path through 4 Ophidia operators
Ideal path
“3”D base
cuboid
“0”D apex
cuboid
REDUCE AGGREGATE MERGE AGGREGATE TOTAL
2,30 sec 0,17 sec 0,90 sec 0,06 sec 3,43 sec
512GB 5,12 MB 1KB 1KB 8Bytes
Preliminary performance insights on a
!
•
The Ophidia framework provides:
–
an internal storage model for multidimensional data
–
about 100 array-based primitives
–
about 50 operators for data analytics and metadata management
–
a Web Service (WS-I
+) based server interface
•
Client application available as user interface
–
Complete bash-like environment, production-level, lightweight, well-documented
•
APIs available at 5 different levels of the stack
–
server front-end, I/O server, storage device, primitives, operators
•
Performance evaluation still ongoing
•
Beta release available in December
–
Website, documentation, github repository
[1] G. Aloisio, S. Fiore, I. Foster, D. N. Williams , “
Scientific big data analytics challenges at large
scale
”, Big Data and Extreme-scale Computing (BDEC), April 30 to May 01, 2013, Charleston, USA
(position paper).
[2] S. Fiore, A. D'Anca, C. Palazzo, I. Foster, Dean N. Williams, Giovanni Aloisio, “
Ophidia: Toward Big
Data Analytics for eScience
”, ICCS 2013, June 5-7, 2013 Barcelona, Spain, Procedia Computer
Science, Elsevier, pp. 2376-2385.
[3] S. Fiore, C. Palazzo, A. D’Anca, I. Foster, D. N. Williams, G. Aloisio, “
A big data analytics framework
for scientific data management
”, Workshop on “Big Data and Science: Infrastructure and Services”,
IEEE International Conference on BigData 2013, October 6-9, 2013, Santa Clara, USA, pp. 1-8.
[4] S.Fiore, A. D’Anca, D. Elia, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, “
Ophidia: a full software
stack for scientific data analytics
”, Workshop on Big Data Principles, Architectures & Applications,
HPCS2014, Bologna, USA, July 21-25, 2014.
Contacts:
Sandro Fiore, Giovanni Aloisio
CMCC and University of Salento
,
,