• No results found

The Ophidia framework: toward big data analy7cs for climate change

N/A
N/A
Protected

Academic year: 2021

Share "The Ophidia framework: toward big data analy7cs for climate change"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

The  Ophidia  framework:  toward  big  

data  analy7cs  for  climate  change  

Dr. Sandro Fiore

Leader, Scientific data management research group

Scientific Computing Division @ CMCC

Prof. Giovanni Aloisio

(2)

A set of requirements have been discussed with our scientists to identify the

key data analytics needs.

 

 

Preliminary  requirements  and  needs  focus  on:  

Time  series  analysis  

Data  reduc6on  (e.g.  by  aggrega6on)  

Data  subse<ng  

Model  intercomparison  

Mul6model  means  

Data  transforma6on  (through  array-­‐based  primi6ves)  

Param.  Sweep  experiments  (same  task  applied  on  a  set  of  data)  

Comparison  between  historical  data  and  future  scenarios  

Maps  genera6on  

Ensemble  analysis  

Data  analy6cs  worflow  support  

But  also…  

Performance,  re-­‐usability,  extensibility  

(3)

The Ophidia Project

 

S. Fiore, A. D’Anca, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, “Ophidia: toward bigdata analytics for

eScience”,

ICCS2013 Conference

, Procedia Elsevier, Barcelona, June 5-7, 2013

Ophidia

is a research effort carried out at the

Euro Mediterranean Centre on Climate

Change

(CMCC) to

address “big data” challenges, issues and requirements for climate

(4)

Ophidia Architecture 1.0

 

Front

end

 

Compute

layer

 

I/O

layer

 

I/O server

instance

 

Storage

layer

 

System

catalog

 

Array-based primitives

 

Analytics Framework

Standard interfaces

Partitioning/hierarchical data mng

Declarative language

 

(5)

Storage model (dimension-independent) & implementation

!

Array-based support and hierarchical storage

!

(6)

OPERATOR NAME OPERATOR DESCRIPTION

Operators “Data processing” – Domain-agnostic

OPH_APPLY(datacube_in,

datacube_out, array_based_primitive)

Creates the datacube_out by applying the array-based primitive to the

datacube_in OPH_DUPLICATE(datacube_

in, datacube_out) Creates a copy of the the datacube_outdatacube_in in

OPH_SUBSET(datacube_in,

subset_string, datacube_out) Creates the sub-setting of the datacube_outdatacube_in by doing a by applying the subset_string

OPH_MERGE(datacube_in,

merge_param, datacube_out) Creates the groups of merge_paramdatacube_out by merging fragments from datacube_in

OPH_SPLIT(datacube_in,

split_param, datacube_out) Creates the into groups of datacube_outsplit_param by splitting fragments each fragment of the datacube_in

OPH_INTERCOMPARISON (datacube_in1, datacube_in2,

datacube_out)

Creates the datacube_out which is the element-wise difference between

datacube_in1 and datacube_in2

OPH_DELETE(datacube_in) Removes the datacube_in

OPERATOR NAME OPERATOR DESCRIPTION

Operators “Data processing” – Domain-oriented

OPH_EXPORT_NC (datacube_in, file_out)

Exports the datacube_in data into the

file_out NetCDF file. OPH_IMPORT_NC

(file_in, datacube_out)

Imports the data stored into the file_in

NetCDF file into the new datacube_in

datacube

Operators “Data access”

OPH_INSPECT_FRAG (datacube_in, fragment_in)

Inspects the data stored in the

fragment_in from the datacube_in

OPH_PUBLISH(datacube_in) Publishes the datacube_in fragments into HTML pages

Operators “Metadata”

OPH_CUBE_ELEMENTS

(datacube_in) Provides the total number of the elements in the datacube_in

OPH_CUBE_SIZE (datacube_in)

Provides the disk space occupied by the

datacube_in

OPH_LIST(void) Provides the list of available datacubes. OPH_CUBEIO(datacube_in) Provides the provenance information

related to the datacube_in

OPH_FIND(search_param) Provides the list of datacubes matching the search_param criteria

Metadata management

(sequential and parallel operators)

Data processing

(parallel operators, MPI &

OpenMP based)

 

Data Access

(sequential and parallel operators)

 

The analytics framework: datacube operators (about 50)

!

Import/Export

(parallel operators)

(7)
(8)
(9)

The analytics framework: datacube operators

!

(10)

The

array data type

support is not enough to provide scientific data

management capabilities…

primitives

are needed as well.

A set of

array-based primitives

have been implemented

A primitive is applied to a

single array

They come in the form of

plugins

(I/O server extensions)

So far, Ophidia provides a

wide set of plugins (about 100)

to perform

data reduction (by aggregation), sub-setting, predicates evaluation,

statistical analysis, compression, and so forth.

Support is provided both for

byte

-oriented and

bit

-oriented arrays

Plugins can be nested

to get more complex functionalities

Compression

is provided as a primitive too

(11)

Array based primitives: OPH_MATH (“SIGN”)

 

oph_math(measure, “OPH_SIGN”, "OPH_DOUBLE”)

(12)

Array based primitives: OPH_BOXPLOT

 

oph_boxplot(measure, "OPH_DOUBLE”)

(13)

oph_boxplot(oph_subarray(oph_uncompress(measure), 1,18), "OPH_DOUBLE”)

subarray(measure, 1,18)

 

Array based primitives: nesting feature

 

(14)

Array based primitives: oph_aggregate

 

oph_aggregate(measure,"oph_avg”)

Ver6cal  aggrega6on

 

Single chunk or fragment (input)

 

Single chunk or

fragment (output)

 

(15)

•  Mathematical primitives:

oph_math, oph_mul_scalar, oph_sum_array, oph_sum_scalar, oph_sum_array_r, oph_gsl_sd, oph_gsl_complex_get_abs, oph_gsl_complex_get_arg, oph_gsl_complex_get_imag, oph_gsl_complex_get_real, oph_gsl_complex_to_polar, oph_gsl_complex_to_rect, oph_gsl_dwt, oph_gsl_fft, oph_gsl_idwt, oph_gsl_ifft, oph_gsl_quantile, oph_gsl_sort, oph_gsl_stats, oph_gsl_histogram, oph_gsl_boxplot, oph_petsc_vec_norm, oph_petsc_vec, oph_petsc_vec_r, oph_sum_scalar2, oph_mul_scalar2, oph_compare,…

•  Data transformations:

oph_to_bin, oph_to_bit, oph_permute, oph_reverse, oph_dump, oph_convert_d, oph_shift, oph_rotate, oph_concat, oph_predicate, oph_roll_up, oph_get_subarray, oph_get_subarray2, oph_get_subarray3, oph_mask, oph_cast, oph_drill_down (stored procedure), oph_reduce, oph_reduce2, oph_reduce3, oph_aggregate_operator, oph_operator, oph_aggregate_stats, oph_extract,…

•  Compression:

oph_compress, oph_uncompress, oph_compress2, oph_uncompress2, oph_id_to_index, oph_size_array, oph_count_array, oph_find, oph_find2, …

•  Bit measures processing:

oph_bit_aggregate, oph_bit_import, oph_bit_export, oph_bit_size, oph_bit_count, oph_bit_dump, oph_bit_operator, oph_bit_not, oph_bit_shift, oph_bit_rotate, oph_bit_reverse, oph_bit_subarray, oph_bit_subarray2, oph_bit_concat, oph_bit_find, oph_bit_reduce, …

•  Integration of numerical libraries: math, GNU GSL, PetsC.

Ophidia primitives

 

(16)

• 

The  Ophidia  terminal  provides  an  effec6ve  and  lightweight  way  to  interact  with  the  Ophidia  server  

• 

Bash-­‐like  environment  (commands  interpreter)  

• 

Terminal  with  history  management,  auto-­‐comple6on,  specific  environment  variables  and  commands  

with  integrated  help…  

• 

Easy  installa6on  as  an  only  one  executable  using  a  small  number  of  well-­‐known  and  open-­‐source  

libraries  

• 

More  than  15  KLOC  

• 

Simple  enough  for  a  novice  and  at  the  same  6me  powerful  enough  for  an  expert  

(17)

PRE-DEFINED VARIABLES

(user-defined variables possible)

OPH_TERM_PS1  =  "color  of  the  prompt”  

OPH_USER  =  "username”  

OPH_PASSWD  =  "password”  

OPH_SERVER_HOST  =  "server  address”  

OPH_SERVER_PORT  =  "server  port”  

OPH_SESSION_ID  =  "current  session”  

OPH_EXEC_MODE  =  "execu6on  mode”  

OPH_NCORES  =  "number  of  cores”  

OPH_CWD  =  "current  working  directory”  

OPH_DATACUBE  =  "current  datacube”  

OPH_TERM_VIEWER  =  "output  renderer”  

OPH_TERM_IMGS  =  "save  and/or  auto-­‐open  images"  

PRE-DEFINED COMMANDS

(user-defined aliases possible)

help  =  "get  the  descrip6on  of  a  command  or  variable”  

history  =  "manage  the  history  of  commands”  

env  =  "list  environment  variables”  

setenv  =  "set  or  change  the  value  of  a  variable”  

unsetenv  =  "unset  an  environment  variable”  

getenv  =  "print  the  value  of  a  variable”  

quit  =  "quit  Oph_Term”  

exit  =  "quit  Oph_Term”  

clear  =  "clear  screen”  

update  =  "update  local  XML  files”  

resume  =  "resume  session”  

view  =  "view  jobs  output”  

alias  =  "list  aliases”  

setalias  =  "set  or  change  the  value  of  an  alias”  

unsetalias  =  "unset  an  alias”  

getalias  =  "print  the  value  of  an  alias"  

(18)
(19)
(20)
(21)

“n”D  base  

cuboid  

“0”D  apex  

cuboid  

Ideal  path  

FRAGMENT1  –     1  TUPLE  x  1  ELEM.   ID     MEASURE   1   24,35  

x1  

FRAGMENT1  –  10^4  TUPLE  x  10^5  ELEMENTS  

ID     MEASURE  

1   2   …   10  

FRAGMENT2  –  10^4  TUPLE  x  10^5  ELEMENTS  

ID     MEASURE  

1   2   …   10  

FRAGMENT3  –  10^4  TUPLE  x  10^5  ELEMENTS  

ID     MEASURE  

1   2   …   10  

FRAGMENT4  –  10^4  TUPLE  x  10^5  ELEMENTS  

ID     MEASURE   1   2   …   10^  

x64  

1,95   8,64   10,47   …   16,11   14,81   18,14   19,93   …   24,35   …   6,87   10,99   12,85   …   16,93   1,95   8,64   10,47   …   16,11   14,81   18,14   19,93   …   24,35   …   6,87   10,99   12,85   …   16,93  

FRAG64  –  10^4  TUPLE  x  10^5  ELEMENTS  

ID     MEASURE   1   2   …   10^4   1,95   8,64   …   16,11   14,81   18,14   …   24,35   …   6,87   10,99   …   16,93  

Ideal  path  

“3”D  base  

cuboid  

“0”D  apex  

cuboid  

½  Terabyte  

Datacube  

Preliminary performance insights

!

(22)

AGGREGATE  ALL  MAX    

FRAGMENT1  –     1  TUPLE  x  1  ELEM.   ID     MEASURE   1   24,35  

x1  

FRAGMENT1  –   64  TUPLE  x  1  ELEM.   ID     MEASURE   1   2   …   64   19,78   12,87   …   24,35  

x1  

MERGE  ALL  

FRAGMENT  1  –     1  TUPLE  x  1  ELEM.   ID     MEASURE   1  

x64  

FRAGMENT  2  –     1  TUPLE  x  1  ELEM.   ID     MEASURE   1   FRAGMENT  3  –     1  TUPLE  x  1  ELEM.   ID     MEASURE   1   FRAGMENT  4  –     1  TUPLE  x  1  ELEM.   ID     MEASURE   1   24,35   FRAGMENT  64  –     1  TUPLE  x  1  ELEM.   ID     MEASURE   1   24,35  

AGGREGATE  ALL  MAX    

FRAGMENT1  –  10^4   TUPLE  x  1  ELEM.   ID     MEASURE   1   2   …   104   FRAGMENT2  –  10^4   TUPLE  x  1  ELEM.   ID     MEASURE   1   2   …   104   FRAGMENT3  –  10^4   TUPLE  x  1  ELEM.   ID     MEASURE   1   2   …   104   FRAGMENT4  –  10^4   TUPLE  x  1  ELEM.   ID     MEASURE   1   2   …   10^   FRAGMENT  64  –  10^4   TUPLE  x  1  ELEM.   ID     MEASURE   1   2   …   10^4   16,11   24,35   …  

FRAGMENT1  –  10^4  TUPLE  x  10^5  ELEMENTS  

ID     MEASURE  

1   2   …   10  

FRAGMENT2  –  10^4  TUPLE  x  10^5  ELEMENTS  

ID     MEASURE  

1   2   …   10  

FRAGMENT3  –  10^4  TUPLE  x  10^5  ELEMENTS  

ID     MEASURE  

1   2   …   10  

FRAGMENT4  –  10^4  TUPLE  x  10^5  ELEMENTS  

ID     MEASURE   1   2   …   10^  

x64  

1,95   8,64   10,47   …   16,11   14,81   18,14   19,93   …   24,35   …   6,87   10,99   12,85   …   16,93   1,95   8,64   10,47   …   16,11   14,81   18,14   19,93   …   24,35   …   6,87   10,99   12,85   …   16,93  

FRAG64  –  10^4  TUPLE  x  10^5  ELEMENTS  

ID     MEASURE   1   2   …   10^4   1,95   8,64   …   16,11   14,81   18,14   …   24,35   …   6,87   10,99   …   16,93  

REDUCE  ALL  MAX    

16,93  

x64  

Real  path  through   4  Ophidia    operators  

Ideal  path  

“3”D  base  

cuboid  

“0”D  apex  

cuboid  

REDUCE AGGREGATE MERGE AGGREGATE TOTAL

2,30 sec 0,17 sec 0,90 sec 0,06 sec 3,43 sec

512GB     5,12  MB   1KB   1KB   8Bytes  

Preliminary performance insights on a

!

(23)

The Ophidia framework provides:

an internal storage model for multidimensional data

about 100 array-based primitives

about 50 operators for data analytics and metadata management

a Web Service (WS-I

+

) based server interface

Client application available as user interface

Complete bash-like environment, production-level, lightweight, well-documented

APIs available at 5 different levels of the stack

server front-end, I/O server, storage device, primitives, operators

Performance evaluation still ongoing

Beta release available in December

Website, documentation, github repository

(24)

[1] G. Aloisio, S. Fiore, I. Foster, D. N. Williams , “

Scientific big data analytics challenges at large

scale

”, Big Data and Extreme-scale Computing (BDEC), April 30 to May 01, 2013, Charleston, USA

(position paper).

[2] S. Fiore, A. D'Anca, C. Palazzo, I. Foster, Dean N. Williams, Giovanni Aloisio, “

Ophidia: Toward Big

Data Analytics for eScience

”, ICCS 2013, June 5-7, 2013 Barcelona, Spain, Procedia Computer

Science, Elsevier, pp. 2376-2385.

[3] S. Fiore, C. Palazzo, A. D’Anca, I. Foster, D. N. Williams, G. Aloisio, “

A big data analytics framework

for scientific data management

”, Workshop on “Big Data and Science: Infrastructure and Services”,

IEEE International Conference on BigData 2013, October 6-9, 2013, Santa Clara, USA, pp. 1-8.

[4] S.Fiore, A. D’Anca, D. Elia, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, “

Ophidia: a full software

stack for scientific data analytics

”, Workshop on Big Data Principles, Architectures & Applications,

HPCS2014, Bologna, USA, July 21-25, 2014.

Contacts:

Sandro Fiore, Giovanni Aloisio

CMCC and University of Salento

[email protected]

,

[email protected]

,

(25)

References

Related documents

If you use FrontPage to create your web site, you will want to have the FrontPage Server Extensions installed in your account.. The Server Extensions allow you to use the server-side

The new proposal would significantly lower the statutory rate for the corporation income tax, lower individual rates further and increase the tax thresholds, tax

ECP Project Management Structure Board of Directors Science Council Industry Council Project Director Deputy Director CTO Integration Manager D ep ar tm en t o f En er g y

1) All surface and intermediate casing strings shall be centralized through the interval to be cemented. Casing and cementing programs shall comply with

The retrofit concept is based on energy efficiency measures (reduction of transmission, infiltration and ventilation losses), on a high ratio of renewable energy sources and on

Like other parents, Andrés’ mother wanted him to be able to communicate with friends and family in their home country, and she gave him frequent reminders to keep up his Spanish: “

In order to increase transparency of the regulations contained in the Law and the Decree, the President of the Office for the Protection of Competition and Consumers

Measurement of gastrointestinal transit times, including gastric emptying and colonic transit times, using an ingestible pH and pressure capsule is considered investigational for