• No results found

NASA Earth Science Research in Data and Computational Science Technologies Report of the ESTO/AIST Big Data Study Roadmap Team September 2015

N/A
N/A
Protected

Academic year: 2021

Share "NASA Earth Science Research in Data and Computational Science Technologies Report of the ESTO/AIST Big Data Study Roadmap Team September 2015"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

NASA Earth Science Research in Data and Computational Science Technologies

Report  of  the  ESTO/AIST  Big  Data  Study  Roadmap  Team September  2015

     

I.  Background

Over  the  next  decade,  the  dramatic  growth  of  NASA’s  Earth  Science  data  collections  is   projected  to  outpace  the  ability  of  scientists  to  analyze  that  data  meaningfully.    What  has   been  termed  the  “Vs”  of  data  (volume,  variety,  velocity,  etc.)  pose  significant  challenges  for   both  Earth  Science  missions  and  researchers  as  traditional  methods  for  developing  science   data  pipelines,  distributing  scientific  datasets  and  performing  effective  analysis  will  require   new  approaches.  The  Intergovernmental  Panel  on  Climate  Change’s  (IPCC)  Assessment   Report  6,  for  example,  predicts  the  growth  of  data  to  tens  of  petabytes.  Future  remote   sensing  projects  include  an  increasing  set  of  data-­‐intensive  instruments  that  will  pose   severe  challenges  to  existing  systems.  Likewise,  with  Earth  Science  data  archives,  these   massive  increases  will  shift  the  focus  from  distributing  whole  data  sets  to  providing  online   services  for  computation  and  analysis.  In  addition,  instruments  flown  on  NASA  Earth-­‐ observing  satellites  will  continue  to  generate  data  and  stress  the  boundaries  of  end-­‐to-­‐end   data  systems.    These  challenges  taken  together  require  new  thinking  in  data  capture,   management,  processing,  and  analysis,  both  onboard  and  on  the  ground.

Big  Data  and  Data  Science  have  been  used  as  terms  to  describe  this  data  deluge  and  the   discipline  devote  to  address  it.  For  the  purposes  of  this  document,  Big  Data  is  a  term  used   to  describe  the  state  of  collection  and  analysis  of  data  that  exceeds  conventional  methods   or  software  systems.    This  state  of  affairs  necessitates  new  approaches  that  will  change  the   paradigm  by  which  data  is  collected  and  analyzed.      Data  Science  focuses  on  the  use  of   systematic  architectural,  software,  methodological,  and  algorithmic  approaches  (e.g.,  data   management,  intelligent  algorithms,  statistics  and  visualization)  for  generation,  capture,   management,  analysis,  and  discovery  from  massive  data  sets  and  streams  (i.e.,  big  data)   including  those  from  multiple  sensors,  models,  archives,  and  other  sources,  to  enable   research  and  decision  support.    Data  Science  is  a  term  that  is  being  applied  at  a  growing   number  of  universities  that  are  establishing  programs  in  this  area.

Data  Science  is  thus  emerging  as  a  critical  area  of  research  and  technology  to  advance   scientific  discovery.  The  sheer  volume  of  data  increase  over  the  next  decade,  coupled  with   the  highly  distributed  and  heterogeneous  nature  of  scientific  data  sets  is  requiring  these   new  approaches.  Numerous  technologies  are  under  development  in  multiple  communities   to  address  Big  Data  challenges.  While  NASA  can  leverage  some  of  this  capability  (e.g.,  map-­‐ reduce  to  scale  computation),  there  are  significant  technologies  that  need  to  be  developed  

(2)

that  scale  to  address  the  data-­‐intensive  challenges  underlying  modern  observing  systems   and  the  science  questions  they  were  created  to  investigate.  Existing  techniques  also  do  not   address  the  end-­‐to-­‐end  nature  of  NASA’s  science-­‐driven  observational  environment.  NASA   needs  to  develop  a  multi-­‐year  plan  to  address  these  challenges  in  total,  rather  than  with   incremental,  isolated  improvements  that  are  unlikely  to  scale.  Such  an  approach  promises   significant  scientific  yield  from  NASA’s  missions,  instruments,  archives,  and  research   community  and  will  be  necessary  in  order  to  remain  on  the  critical  path  to  accomplish   science  in  the  Big  Data  era.  

   

II.  Current  State:  Mission,  Instrument  and  Large-­‐scale  Data  Analysis  Challenges Currently,  the  analysis  of  large  data  collections  from  NASA  or  other  agencies  is  executed   through  traditional  computational  and  data  analysis  approaches,  which  require  users  to   bring  data  to  their  desktops  and  perform  local  data  analysis.  Alternatively,  data  are  hauled   to  large  computational  environments  that  provide  centralized  data  analysis  via  traditional   High  Performance  Computing  (HPC).  Scientific  data  archives,  however,  are  not  only  

growing  massive,  but  are  also  becoming  highly  distributed.    As  a  consequence,  neither   traditional  approach  provides  a  good  solution  for  optimizing  analysis  into  the  future.  

Further,  assumptions  across  the  NASA  mission  and  science  data  lifecycle,  which  historically   assume  that  all  data  can  be  collected,  transmitted,  processed,  and  archived,  will  not  scale  as   more  capable  instruments  stress  legacy-­‐based  systems.  A  new  paradigm  is  needed  in  order   to  increase  the  productivity  and  effectiveness  of  scientific  data  analysis.  This  paradigm   must  recognize  that  architectural  and  analytical  choices  are  interrelated,  and  must  be   carefully  coordinated  in  any  system  that  aims  to  allow  efficient,  interactive  scientific   exploration  and  discovery  to  exploit  massive  data  collections,  from  point  of  collection  (e.g.,   onboard)  to  analysis  and  decision  support.  Expanding  on  this  point,  the  most  effective   approach  to  analyzing  a  distributed  set  of  massive  data  may  involve  some  exploration  and   iteration,  putting  a  premium  on  the  flexibility  afforded  by  the  architectural  framework.     The  framework  should  enable  scientist  users  to  assemble  workflows  efficiently,  manage   the  uncertainties  related  to  data  analysis  and  inference,  and  optimize  deep-­‐dive  analytics  to   enhance  scalability.  These  challenges  are  not  limited  to  NASA.  Multiple  agencies  are  

confronted  with  the  question  of  how  to  draw  scientific  inference  from  growing,  distributed   archives,  as  identified  in  the  appendix.

NASA  has  already  made  significant  investments  in  capturing  and  sharing  data  from   massive,  online  data  systems.  The  NASA  Earth  Science  Distributed  Information  System   (ESDIS),  most  prominently  the  Distributed  Active  Archive  Centers  (DAACs),  provide  an   excellent  foundation  for  capturing  and  building  high  quality  repositories.    Technology   investments  to  date  by  NASA  (e.g.,  ROSES  ACCESS,  ROSES  AIST,  etc.)  and  the  DOE  (e.g.,  

(3)

Earth  System  Grid)  have  focused  on  developing  software  services  that  have  improved   access  to  distributed  data.      The  challenge  that  is  posed  now  is  how  to  integrate  these   capabilities  with  careful  architectural  approaches  and  emerging  technologies  that  will   allow  NASA  to  continue  to  scale  the  entire  data  lifecycle  and  support  the  data  analysis   needs  of  the  Earth  Science  community.      The  existing  infrastructures  are  prime  candidates   for  grounding  a  more  distributed,  scalable  computational  environment  to  optimize  

scientific  data  analysis  and  move  to  the  era  of  “Big  Data  Analytics.”  An  integrated  

architecture  and  ecosystem  will  address  the  modern  data  and  computing  challenges  across   the  NASA  mission  and  science  data  lifecycle:

● Reproducibility   ● Uncertainty   ● Data  fusion   ● Data  reduction   ● Data  movement   ● Data  visualization   ● Cost   ● Performance  

III.  Use  Cases  

Several  use  cases  have  been  identified  which  present  challenges  to  future  NASA  missions   and  science.  These  use  cases  identify  the  challenges  and  required  capabilities  that  are   needed  in  order  to  not  just  keep  pace  but  increase  the  science  yield  from  NASA  Earth   Science  missions,  instruments  and  data  collections.  

Use  Cases1  –  the  following  identify  use  cases  and  their  data  science  challenges.    

Use Case Description Data Science

Challenge

Enabling Mission/ Capability

Climate Modeling

Formulate hypotheses from observed empirical relationships;

Simulate current and past conditions under those hypotheses using climate models; Test hypotheses by comparing simulations to observations; Evaluate uncertainty of predictions originated from statistical sampling of models and observations.

Highly distributed data sources; fusion of different

observations; moving computation to the data; data reduction

CMIP6 will move towards exascale archives requiring new approaches to evaluating models relative to

observational data.

Satellite Missions

Missions such as NI-SAR and SWOT will generate massive observational data. However, they are have different architectural patterns including compute intensive, data intensive, heterogeneous,

Massive data rates, data movement challenges, computational

NI-SAR and SWOT require new approaches for

computation, data movement, data archiving and distribution,

1

(4)

etc. scalability, archiving and distribution; onboard processing for data reduction/analysis; high-volume data transfer for ground processing analytics. Applications - Hydrology (Central Valley of California)

Understanding groundwater dynamics on a regional scale using measurements from satellite, airborne and in-situ measurements. Compare against predictive models.

Distributed computation; highly distributed data sources; data fusion of multiple products; massive new satellite observations.

Integration of data from PALSAR-2, Sentinel, Grace-FO, ASO, and SMAP. Scale to support NI-SAR and SWOT. Comparison against models. Requires new architectural approaches for distributed data analytics.

. Airborne

Missions

Airborne missions tend to be much more agile and on-demand. Integrating this into a data ecosystem provides new opportunities to quickly generate and understand various measurements.

On-demand architectures; distributed data sources; on-the-fly data processing; onboard processing for data reduction/analysis; high-volume data transfer for ground processing

Current missions such as CARVE and Airborne Snow Observatory; Future such as proposed EVI-3 and ASO follow-on missions

Other  science  disciplines  such  as  biology  carry  similar  use  cases  and  have  identified  similar   technology  needs  to  those  of  NASA.    These  are  identified  in  the  section  on  benchmarking.   IV.  Conceptual  Architecture:  The  Data  Lifecycle  for  NASA  Earth  Science

The  data  lifecycle  is  a  critical  perspective  for  developing  a  comprehensive  set  of  capabilities   to  increase  scientific  yield  from  missions.  This  encompassing  approach  must  include  

developing  capabilities  from  the  point  of  collection  all  the  way  to  analysis  and  extracted   understanding.  Figure  1  below  shows  this  concept  and  exposes  the  need  to  improve  data   science  capabilities  at  multiple  points  of  the  data  lifecycle:  from  onboard  computing,  to   data  triage  of  massive  data  streams,  to  data  analysis.  This  perspective  requires  

architectural  considerations  for  determining  and  integrating  methodologies  and  

(5)

to-­‐end  concept,  the  model  below  provides  guidance  for  NASA  investments  and  capabilities   that  will  be  required  to  support  effectiveness  and  scalability  in  the  Big  Data  era.

The  data  lifecycle  view  reveals  that  choices  and  results  of  operating  on  data  at  one  point  in   the  lifecycle  inevitably  shape  and  possibly  limit  the  possibilities  for  working  with  the  data   at  a  later  point  in  the  lifecycle.  Coordination  must  be  addressed  in  both  directions:  

ultimately  objectives  for  scientific  data  understanding  and  reproducibility  must  back-­‐drive   the  design  and  development  of  capabilities  early  in  the  data  lifecycle  so  as  to  enable  desired   results,  to  the  greatest  extent  possible.  See  Figure  2.  There  is  also  an  outer  loop  to  the  data   lifecycle:  Successful  data  understanding  (and  possibly,  disappointments)  drives  the  next   round  of  science  instrument  and  mission  proposals.

Figure 1: Data Lifecycle for NASA Science Missions

(6)

Figure 2: End-to-End NASA Mission/Science Data Lifecycle

The  data  lifecycle  views  shown  in  Figures  1  and  2  identify  a  number  of  challenges  across   the  mission  lifecycle  that  affect  the  various  scientific  disciplines  that  are  core  to  NASA.     Earth  Science,  in  particular,  benefits  from  new  approaches  to  improving  data  analysis  given   the  complex,  heterogeneous,  high-­‐volume,  and  distributed  nature  of  the  remote  sensing   data  acquired  from  Earth-­‐observing  missions.  As  an  example,  use  of  these  data  for  

comparison  against  and  verification  of  climate  model  simulations  is  critical  to  supporting   research  in  global  climate  change.  However,  much  of  this  data  is  managed  in  highly  

distributed  repositories  with  different  data  representations,  formats,  access  methods,  and   owners.  Significant  time  and  effort  are  required  to  access,  move,  and  rectify  data  from   different  repositories  to  support  specific  analyses  of  interest  to  particular  researchers.   These  requirements  constrain  both  the  type  and  scope  of  analyses  that  can  be  performed,   and  therefore  ultimately  the  scientific  hypotheses  that  can  be  tested.  A  major  goal  for  data   science  is  to  reduce  the  cost  of  large-­‐scale,  interactive  data  analysis  and,  at  the  same  time,   leverage  the  richness  of  massive  data  sets  to  reduce  uncertainty  in  the  answers  to  crucial   scientific  questions.  This  is  not  possible  in  the  current  environment  due  to  lack  of  

scalability.  Data  science  seeks  to  enable  not  just  more  discipline  science  (e.g.,  more  data   incorporated  in  analysis,  more  parameters  to  compare,  etc.),  but  better  science  (e.g.,   quantitative  assessments  of  uncertainty,  repeatability  of  results,  etc.),  to  be  performed  in  a   similar  or  shorter  amount  of  time  despite  growing  sets  of  data.

(7)

VI. Capability  Gaps  and  Drivers  

A  major  gap  has  been  the  focus  on  localized  data  analysis  from  instruments  vs.  a  holistic   consideration  of  all  the  observing  systems  and  how  they  fit  into  a  big  data  analytics  view   and  capability.  Data  systems  today  are  organized  around  the  capture  and  archiving  of  data   from  specific  missions  or  observing  capabilities.  However,  the  integration  of  data  from   multiple  instruments  (spaceborne,  airborne,  ground-­‐based)  is  important  for  supporting   scientific  research,  in  particular,  in  moving  from  isolated  data  analysis  to  knowledge   discovery  through  the  use  of  a  big  data  analytics  approach.  This  includes  making  data   available  from  multiple  sources  and  integrating  those  data  using  intelligent  algorithms  and   methods.  In  addition,  as  data  grows  and  more  automated  methods  are  in  place  for  data   discovery,  this  affords  opportunities  to  improve  the  efficiency  and  effectiveness  of  ongoing   mission  operations  and  move  it  towards  a  data-­‐driven  approach  where  data  is  reduced   onboard  or  at  ground  stations,  prior  to  archive,  as  well  as  during  fully  offline  science   analysis.  The  introduction  of  new  approaches  with  interpretation  of  data  across  the   lifecycle  allows  for  informed  decisions  at  arbitrary  points  in  the  lifecycle  –  allowing  for   mission  plans  to  be  updated,  new  relevant  data  products  to  be  generated,  etc.  This  same   full  view  of  the  data  lifecycle  also  will  inform  provenance  practices,  so  that  the  relevant   details  –  all  the  way  back  to  the  point  of  collection  –  can  be  captured  to  provide  a  basis  for   reproducing  the  end  results  of  scientific  understanding.  This  is  a  paradigm  shift  from  how   mission  and  science  operations  vs.  analysis  are  performed  today,  largely  in  separate   arenas2.  

 

Figure  3  below  shows  the  current  approach  for  organizing  NASA  Earth  Science  data   systems.      The  approach  focuses  extensively  on  the  stewardship  of  data,  by  collecting  data   into  organized  and  community-­‐accessible  archives.    This  approach  has  enabled  NASA  to   build  high  quality  data  archives,  but  as  the  need  for  a  shift  towards  systematic  approaches   to  data  analysis  increases,  it  is  important  to  take  a  broader  view  of  how  activities  across  the   entire  data  lifecycle  can  be  organized  and  integrated.    Today,  onboard  computing,  mission   operations,  science  data  processing  and  archiving,  and  data  analysis  are  performed  as   generally  independent  architectures,  systems  and  components  of  the  data  lifecycle.    

2This state of affairs is reflected in the disjoint NASA programs for science missions vs. research and analysis.

(8)

Figure 3: Today, Stewardship Model of NASA Data Systems

 

Taking  a  broader  view,  cyber-­‐infrastructures,  machine  learning  and  other  intelligent  

algorithms  (analytics),  statistics,  and  visualization  should  be  brought  together  as  integrated   architectural  solutions  that  can  be  scaled  to  meet  NASA’s  data  challenges  in  enabling  

missions  and  science.    The  Figure  4  below  shows  a  shift  from  data  generation  and   stewardship  of  archives  to  enabling  data  analysis  through  an  integrated  cyber-­‐

infrastructure  where  data,  algorithms,  computation,  and  visualization  are  brought  together   across  a  highly  distributed  data  environment  to  quickly  stand  up  new  analytic  capabilities   for  different  stakeholders,  measurements,  science  questions,  and  applications.    This   requires  new  thinking  as  follows:

1. New  architectural  approaches  across  the  entire  science  mission  and  data   lifecycle  in  order  to  increase  the  science  yield  by  scaling  and  improving  the   integration  of  each  of  the  various  components  of  the  lifecycle.  

2. Increasing  capability  onboard  to  support  data  reduction,  autonomous   mission  (re-­‐)planning,  etc.,  as  part  of  managing  bandwidth  capabilities.   3. Integrated  analytics  as  part  of  the  real-­‐time  mission  pipeline.  

4. Ability  to  quickly  construct  new  analytic  centers  that  can  unify  archives,   computing  capabilities,  and  software  services  to  bring  heterogeneous  data   sets  as  well  as  models/simulations  together  for  analysis.  

Data$ Acquisi+on$ and$$ Command$ Instrume nt$ Opera+on s$ EDOS/ GDS$ L0A$$ Processin g$ Science$ Data$$ Processing$ L0B$ L1$ L2$ L3$ L4$ $ SDS$ EOSDIS$DAAC$ Science$ Data$ Management$$ Archive$ &$ Distribu+on$ Instrume nt$ Opera+on s$ EDOS/ GDS$ L0A$$ Processin g$ Science$ Data$$ Processing$ L0B$ L1$ L2$ L3$ L4$ $ SDS$ EOSDIS$DAAC$ Science$ Data$ Management$$ Archive$ &$ Distribu+on$ EOSDIS$Data$ Centers$ Science$ Data$ Management$$ Archive$ &$ Distribu+on$ Science$ Data$$ Processing$ L0B$ L1$ L2$ L3$ L4$ $ Science$Data$ Systems$ Instrument$ Opera+ons$ EDOS/ Ground$Data$ Systems$ L0A$$ Processing$

Science$Teams$ Outreach$ Research$

Mission$ Opera+on s$ Mission$ Opera+ons$ TDRS% Netwo rk% On$Board$ Processing$

(9)

5. Systems  that  can  react  to  increased  velocity  of  data  across  the  lifecycle  with   new  technology  approaches  for  data  triage  and  capture.  

6. Interoperability  with  other  agencies  including  capture  of  data  from  their   instruments  and  integration  with  their  ground  systems,  archives,  and   analytic  capabilities.  

7. Quantitative  uncertainty  management  at  all  stages  of  the  data  lifecycle    in   order  to  provide  uncertainties  required  for  scientific  inference..    

Figure 4. Future, NASA Mission-Science Data Ecosystem

V. Proposed  NASA  Data  and  Computational  Science  Technology  Areas  

As  discussed  above,  as  the  data-­‐intensive  nature  of  NASA  science  and  exploration  missions   increase,  there  is  a  greater  need  to  consider  the  data  lifecycle  from  the  point  of  collection  all   the  way  to  extracted  understanding  of  the  data  in  order  to  support  scalability  and  full   utilization  of  the  data.    Furthermore,  missions  may  have  no  choice  but  to  require  that  data   reduction  and  intelligent  triage  be  performed  across  the  data  lifecycle  to  identify  which   data  are  to  be  captured  and  archived.    In  addition  to  considerations  of  the  data  and  

software,  it  is  critical  that  common  information  models  be  developed  and  defined  in  order   to  ensure  that  consistent  definitions  of  the  data  are  applied  (what  is  “definitions  of  the  

Data$ Acquisi+on$ and$$ Command$ Instrume nt$ Opera+on s$ EDOS/ GDS$ L0A$$ Processin g$ Instrume nt$ Opera+on s$ EDOS/ GDS$ L0A$$ Processin g$ Instrument$ Opera+ons$ EDOS/ Ground$Data$ Systems$ L0A$$ Processing$ Mission$ Opera+on s$ Mission$ Opera+ons$ TDRS% Netwo rk% On$Board$ Processing$ Data Science Infrastructure (Data, Algorithms, Machines) Applica+ons$ Data$Analy)cs$ Centers$ Other%Data% Systems%(e.g.% NOAA)% Other%Data% Systems%(e.g.% NOAA)% Other$$ Data$Systems$$ (InDSitu,$Other$ Agency,$etc)$ Decision$ Support$ $ On$Demand$ Algorithms$ Science$ Data$ Manage$ Satellite$ Instrument$ Data$ Systems$ Science$ Data$ Manage$ Airborne$ Data$ Science$ Data$ Manage$ NASA$ Data$ Archives$ Research$ Science$Teams$

(10)

data”?.    In  particular,  a  rigorous,  probabilistic  definition  of  uncertainty  should  be  adopted   to  cover  all  NASA  missions  and  data  so  that  uncertainty  is  defined  in  the  same  way  for  all   data  that  are  to  be  used  in  scientific  studies.  This  practice  ensures  that  the  data  is  reliably   managed,  is  supportive  of  discovery,  and  is  fully  utilized.    It  is  important  to  point  out  that   data  across  this  entire  lifecycle  view  should  not  be  considered  data  at  rest,  but  rather  data   that  is  discoverable,  accessible,  and  utilizable  to  update  plans  and  inform  other  decisions,   support  local  operations,  and  of  course  enable  science.  A  well-­‐architected  data  system  from   onboard  data  capture  to  ground-­‐based  operations  through  data  analysis  must  be  in  place  to   support  all  of  these  objectives.

Critical  areas  of  the  lifecycle  (with  considerations  at  each  stage)  include: a) Data  Generation:  Perform  original  processing  at  the  sensor/instrument   b) Data  Triage:  Make  choices  at  the  collection  point  about  which  data  to  keep  

c) Data  Compression:  Maximize  information  throughput  against  available  bandwidth   d) Data  Transport:  Improve  resource  efficiencies  to  enable  moving  the  most  data   e) Data  Processing:  Increasing  computation  availability    

f) Data  Archiving:  Scaling  the  capture,  management  and  distribution  of  data   g) Data  Visualization:  Develop  and  apply  visualization  techniques  to  enable  data  

exploration  and  insights    

h) Data  Analytics:  Create  services  to  integrate  the  analysis  of  massive,  distributed,   heterogeneous  data,  and  propagate  uncertainties.  

 

In  addition,  there  are  relevant  technologies  and  architectural  approaches  that  are   inherently  cross  cutting  in  that  they  apply  equally  to  several  Earth  Science  areas  of  

investigation.    This  perspective  is  particularly  important  for  architecture,  which  can  enable   how  various  technologies  can  be  integrated  to  construct  a  general  data-­‐intensive  approach   for  the  NASA  Earth  Science  enterprise.  

Table 1: Proposed AIST Technology Areas

Technology Name

Data Lifecycle Area

(s) Description

Big Data Architecture Earth Science Remote

Sensing

Cross-Cutting Definition of a scalable data big data lifecycle architecture for earth observing systems identifying how Big Data can scale from onboard computing to data analysis to increase science yield.

Big Data Information

Models and Semantics Cross-Cutting Advanced semantic technologies for defining, deriving, and integrating heterogeneous ontologies and information models as applied across the entire data lifecycle (onboard, ground-based operations, archives, analysis) Onboard data science

methods for data triage Data triage Onboard data science methods for real-time event detection, and planning. Onboard data science

methods for data reduction

(11)

Massive Data Movement

Technologies Data Transport Massive data movement technologies for ground-based networks from operations through analysis Real-time ground-based

data science methods Data Processing Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture.

Open source data

processing frameworks Data Processing Open source data processing and workflow frameworks that can massively scale to computational infrastructures (HPC, public cloud, etc.) handling large data streams, products, including near-real time constraints, as part of the data lifecycle architecture.

Reusable data science methodologies for missions and science

(1) Data Processing; (2)

Data Analytics Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. This includes on-demand data analytics for massive data repositories.

Federated data access Data Archives Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics Massive Data Distribution Data Archives Massive data distribution for large-scale repositories and archives including methods for data reduction, computation, etc., as integrated, on-demand data analytics.

Intelligent search and

mining (1) Data Archives; (2) Data Analytics Provide methods for intelligent search and mining of massive data. This may include integration of on-demand analytics to perform deep searches. Visualization of massive

data sets

Visualization Visualization of massive data sets including data reduction methods that are driven by domain.

On-demand distributed data analytics

Data Analytics On-demand data analytics that can integrate data from archives, repositories, etc., applying data science methods (data reduction, fusion, feature detection, etc.) provided through a computational infrastructure

Distributed data analytics Data Analytics Analysis of data across distributed archives to support Earth system science Uncertainty Quantification;

Measurement Science

Data Analytics Quantification and management of uncertainty through all data processing steps, and subsequently through analytical algorithms, as part of a measurement science strategy for data fusion and data science Open source data

management/science frameworks

(1) Data Archives; (2) Data Analytics

Open source data management/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture, for archiving and analytics as part of a big data cyber-infrastructure.

Computational Infrastructures

(1) Data Processing; (2) Data Analytics

Computational Infrastructures to scale data analytics using HPC and public cloud. This includes on-demand massive HPC and storage for integration to drive analytics.

Figure  5  shows  the  mapping  of  these  technologies  to  the  Big  Data  Analytics  architecture   concept.  A  key  element,  as  mentioned,  is  the  integration  of  the  various  technologies  based   on  a  data-­‐driven  architecture  view.        

(12)

Figure 5: Architecture-Technology Mapping

VIII.  Benchmarking

Several  U.S.  agencies  have  initiated  efforts  in  Big  Data.  This  includes  technology   investments,  new  initiatives  and  new  organizations.    These  efforts  are  captured  in  the   appendix  and  are  summarized  below.  

 

Agency Big Data Overview and Strategy

National Institutes of Health (NIH)

The NIH has initiated a new program in Data Science and appointed an Associate Director to lead the effort. They are establishing an NIH Commons (computation, software, standards, etc.) through its Big Data to Knowledge (BD2K) initiative. The NIH commons will provide capabilities to various NIH institutes who support directed research efforts. Efforts focus on enabling data management and big data analytics capabilities.

National Science Foundation (NSF)

The NSF has several initiatives coordinating through the Office of Cyberinfrastructure (OCI). OCI coordinates with various disciplines within the NSF including the EarthCube program that seeks to build a national Geosciences Cyberinfrastructure. Goals of the NSF include:

1) Derive knowledge from data;

2) Develop new cyber-infrastructures to manage, curate and serve data; 3) Develop new approaches for education and workforce development; and 4) Enable new types of inter-disciplinary collaboration, community building

Data$ Acquisi+on$ and$$ Command$ Instrume nt$ Opera+on s$ EDOS/ GDS$ L0A$$ Processin g$ Instrume nt$ Opera+on s$ EDOS/ GDS$ L0A$$ Processin g$ Instrument$ Opera+ons$ Ground$Data$ Systems$ L0A$$ Processing$ Mission$ Opera+on s$ Mission$ Opera+ons$ TDRS% Network% On$Board$ Processing$ Data Science Infrastructure (Data, Algorithms, Machines) Applica+ons$ Data$Analy)cs$ Centers$ Other%Data% Systems%(e.g.% NOAA)% Other%Data% Systems%(e.g.% NOAA)% Other$$ Data$Systems$$ (InDSitu,$Other$ Agency,$etc)$ Decision$ Support$ $ On$Demand$ Algorithms$ Science$ Data$ Manage$ Satellite$ Instrument$ Data$ Systems$ Science$ Data$ Manage$ Airborne$ Data$ Science$ Data$ Manage$ NASA$ Data$ Archives$ Research$ Science$Teams$ Cross9Cu;ng% •  Data%System%Architectures% •  InformaBon%Architectures% Onboard% •  Data%Triage% •  Data%ReducBon% Ground%System% •  Real9Time%Data%Triage% •  Reusable%Data%Science%Methods% •  On9demand%workflows,%computaBon% •  Integrated%Cyberinfrastrcture% Cyberinfrastructure% •  Open%Source%Data%Management% %%%%%%%%%/Processing%Frameworks% •  Data%Movement% •  Federated%Data%Access% •  Scalable%ComputaBon%and%Storage% Data%AnalyBcs%and%Viz% •  Distributed%Data%AnalyBcs% •  On9demand%computaBon% •  Intelligent%search%and%mining% •  Uncertainty%QuanBficaBon% •  VisualizaBon%of%massive%data%sets%

(13)

DARPA

DARPA has several programs in Big Data including the XDATA and Memex Programs that are developing data science frameworks for big data analytics and mechanisms to explore deep searching of the Internet. DARPA is working to explore the use of open source technologies and their application to these programs.

NOAA

NOAA is working to explore commercial opportunities to build cyber-infrastructures. This includes the use of cloud-based computing capabilities and support to scale data management and computation. NOAA is also participating in the Big Earth Data Initiative (BEDI) project.

Department of Energy (DOE)

DOE has been exploring programs in extreme scale science, particularly as it relates to high performance computing (HPC). The goal is to address the combined challenges of Big Data and Big Compute to develop a exascale computing environment for simulation and data analysis at scale cutting across various disciplines in energy, biology and climate.

USGS

USGS is exploring its role in big data, focusing on data capture and integration, as well as sharing and leveraging HPC infrastructure across the agency. Programs such as EROS are exploring new architectural approaches to scale for the future.

National Institutes of Standards and Technology (NIST)

NIST has established a program in Data Science focusing on the development of architectures, use cases, standards and interoperability. They are also focused on areas including measurement foundations/principles to increase the accuracy of derived inferences from massive data.

 

Science  in  broadest  terms  unifies  the  data  science  technology  needs  of  NASA  and  other   agencies.    Beyond  NASA  science  disciplines  such  as  climate  science,  other  disciplines  such   as  biological  science  are  declaring  a  direct  need  for  new  technologies  to  support  science   data  analysis  (often  captured  as  “data-­‐intensive  science”  or  “data-­‐driven  science”).  This   should  not  be  surprising,  given  the  vast  quantities  of  data  being  produced  in  such  areas  as   genomics  and  proteomics.    Data  science  methods  that  automate  the  extraction,  

classification,  reduction  and  discovery  from  massive  data  sets  will  have  direct  value  across   several  science  disciplines  and  agencies.  

 

International  agencies,  such  as  the  European  Space  Agency  (ESA),  also  have  initiated   several  efforts  in  Big  Data.    In  2013  and  2014,  ESA  held  a  workshop  on  Big  Data  for  Space3  

covering  several  of  the  technologies  that  are  described  in  this  report.    Results  of  the  recent   conference  highlight  movement  towards  the  big  data  lifecycle,  enabling  integrated  

analytics,  integrating  high-­‐performance  computing  (HPC)  with  data  infrastructures,  new   techniques  for  machine  learning,  and  a  shift  towards  new  cyber-­‐infrastructures.      

International  groups  such  as  the  International  Virtual  Observatory  Alliance  (IVOA),  

International  Planetary  Data  Alliance  (IPDA),  and  the  Global  Organization  for  Earth  System   Science  Portals  (GO-­‐ESSP)  have  developed  efforts  to  share  data  and  technology  across   international  boundaries  in  order  to  support  world-­‐wide  science  data  analysis.

IX.  Ten  Year  Roadmap  of  Capabilities   What  is  this  table?  It’s  not  explained.    

Today Near-Term (2-5 Years) Far Term (5-10 Years) Highly curated data repositories Scalable data science framework Onboard data science and reduction

methods

Access and search methods for distributed Integrated data analytics Distributed data analytics and computation

3

(14)

repositories

Core data standards Data provenance for reproducibility Integrated uncertainty analysis Machine Learning Techniques Unified architecture for data exploration

and analysis Virtual observational missions Is this about machine learning? On-demand data science methods

integrated with data repositories

Virtual, Immersive Visualization Environments for Massive Data Is this about machine learning? Data visualization methods

Is this about machine learning? Scalable computing infrastructure

 

Disruptive  Approaches  Leveraging  Computational  Science  (5-­‐10  Years)

Far-Term Disruptive Capabilities Description

Onboard data science and reduction methods Increase onboard autonomy, planning and data processing as close to the instrument as possible as part of a broader big data strategy to increase

science return.

Distributed data analytics and computation Shift to a data analytics ecosystem to enable analysis of distributed data from multiple instruments (ground-based, airborne, satellite) as well as comparison

against climate models.

Integrated uncertainty analysis Develop an architectural approach that manages uncertainly levels as data and computational demands change.

Virtual observational missions Shift towards a data-driven approach for missions where new instruments integrate into an overall virtual infrastructure; support planning of new science

goals and observations from a combination of instruments and archival data. Virtual, Immersive Visualization Environments for

Massive Data

Enable immersive environments for scientific analysis as well as outreach using observational data from NASA missions.

I  would  like  to  see  an  additional  role  in  this  table  devoted  to  the  interaction  between  the   architecture  and  the  (statistical)  methods  that  define  analytical  algorithms.  New  statistical   methods  optimized  for  a  given  architecture;  what  architecture  is  best  given  a  set  of  target   analyses;  trade-­‐off  between  uncertainty  and  cost,  etc.    

Technology  Prioritization

Technology Name Priority Applicable Far-Term Capability Big Data Architecture for Earth Science Remote Sensing Highest Cross Cutting

Big Data Information Models and Semantics Medium Distributed data analytics and computation

Onboard data science methods for data triage Medium Onboard data science and reduction Onboard data science methods for data reduction Medium Onboard data science and reduction

Massive Data Movement Technologies Medium Distributed data analytics and computation Real-time ground-based data science methods Highest Virtual observational missions

Open source data processing frameworks Highest Distributed data analytics and computation; Virtual observational missions; Virtual, immersive visualization

environments for massive data Reusable data science methodologies for missions and

science

Highest Virtual observational missions; Distributed data analytics and

computation Federated data access Medium Distributed data analytics and

computation Massive Data Distribution Medium Distributed data analytics and

(15)

Intelligent search and mining Medium Distributed data analytics and computation Visualization of massive data sets Medium Virtual, immersive visualization

environments for massive data On-demand distributed data analytics Highest Distributed data analytics and

computation Distributed data analytics Highest Distributed data analytics and

computation Uncertainty Quantification; Measurement Science Medium I thing this should be

Highest Integrated uncertainty analysis Open source data management/science frameworks Medium Distributed data analytics and computation; Virtual observational missions; Virtual, immersive visualization

environments for massive data Computational Infrastructures Highest Distributed data analytics and

computation; Virtual observational missions; Virtual, immersive visualization

environments for massive data

X.  Overall  Recommendations

● NASA  data  technology  needs  to  shift  from  ad  hoc  investments  across  the  mission   and  science  data  lifecycle  to  an  integrated  architecture  where  technology  

investments  fit  into  a  broader  capability  and  vision  to  enable  Earth  Science.  

● The  NASA  data  ecosystem  needs  to  shift  from  a  stewardship  model  to  a  data-­‐driven   discovery  model  where  both  data  management  and  data  discovery  are  enabled   through  a  systematic  data  architecture,  computational  infrastructure  and  lifecycle   model.  

● Data  architectures  should  be  modeled  and  assessed  overall  to  plan  for  technology   capabilities  and  improvements  to  ensure  scalability,  and  to  address  performance,   cost,  and  uncertainty  management  goals.  

● Data  discovery  methods  should  be  applied  across  the  entire  data  lifecycle  to  support   scalability  and  discovery  at  each  point,  from  onboard  computing,  to  data  processing   and  archiving,  to  distributed  data  analytics.  

● Architectures  should  enable  flexible  tradeoffs  of  where  to  and  how  to  compute,  to   include  the  improved  integration  of  HPC  and  data  infrastructures.    

● New  capabilities  should  ensure  reproducibility  of  derived  scientific  results.   ● Computational  and  data  science  should  play  an  important  role  in  planning  new  

missions  including  identification  of  how  data,  algorithms,  and  computing  can  be   applied  and  integrated  to  improve  overall  data  discovery.    

 

References

[1]  2014  NASA  Office  of  the  Chief  Technologist  Roadmap  (under  development)   [2]  National  Research  Council,  “Frontiers  in  Massive  Data  Analysis”,  2013.  

(16)

[3]  2011  PCAST  Report  on  Information  Technology.  

[4]  American  Geophysical  Union,  “Trends  in  Earth  and  Space  Science   Appendix  A.    NASA  Mission/Science  Data  Lifecycle

As  the  data-­‐intensive  nature  of  NASA  science  and  exploration  missions  increases,  there  is   an  increasing  need  to  consider  the  data  lifecycle  from  the  point  of  collection  all  the  way  to   the  application  and  use  of  the  data.    Considerations  across  the  entire  lifecycle  need  to  be   made  in  order  to  support  scalability  and  use  of  the  data.    Furthermore,  missions  may   require  that  data  reduction  and  intelligent  triage  be  done  on  the  data  itself  across  the   lifecycle  to  identify  which  data  should  be  captured  and  archived.    In  addition  to  

considerations  of  the  data  and  software,  it  is  critical  that  common  information  models  be   developed  and  defined  in  order  to  ensure  consistent  definitions  of  the  data  are  applied  so   that  the  data  itself  can  be  accurately  managed,  discovered  and  used.    It  is  important  to  point   out  that  data  across  this  entire  lifecycle  should  not  be  considered  data  at  rest,  but  rather   data  that  should  be  discoverable,  accessible,  and  usable  to  update  plans,  support  local   operations,  and  enable  science.  As  a  result,  a  well-­‐architected  data  system  from  on-­‐board   data  capture  through  ground-­‐based  operations  and  data  analysis  must  be  in  place  to   enabling  scalability  at  multiple  points  across  that  lifecycle.  

Critical  areas  of  the  lifecycle  (with  considerations  at  each  stage)  include:

a) Data Generation: Performing original processing at the sensor/instrument b) Data Triage: Make choices at the collection point about which data to keep

c) Data Compression: Maximize information throughput against available bandwidth d) Data Transport: Improve resource efficiencies to enable moving the most data e) Data Processing: Increasing computation availability

f) Data Archiving: Scaling the capture, management and distribution of data

g) Visualization: Develop and apply analysis techniques to enable data understanding from visualization of massive data

h) Data Analytics: Create analytics services to integrate massive, distributed, heterogeneous data

Appendix  B.    ESTO/AIST  Proposed  Big  Data  Technology  Thrust  Areas.

Technology Name Data Lifecycle Area (s) Description Big Data Architecture for

Earth Science Remote Sensing

Cross-Cutting Definition of a scalable data big data lifecyce architecture for earth observing systems identifying how Big Data can scale from onboard computing to data analysis to increase science yield.

Big Data Information Models and Semantics

Cross-Cutting Advanced semantic technologies for defining, deriving, and integrating heterogeneous ontologies and information models as applied across the entire data lifecycle (onboard, ground-based operations, archives, analysis) Onboard data science

methods for data triage Data triage Onboard data science methods for real-time event detection, and planning. Onboard data science

(17)

reduction Massive Data Movement

Technologies

Data Transport Massive data movement technologies for ground-based networks from operations through analysis

Real-time ground-based

data science methods Data Processing Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture.

Open source data processing frameworks

Data Processing Open source data processing and workflow frameworks that can massively scale to computational infrastructures (HPC, public cloud, etc) handling large data streams, products, including near-real time constraints, as part of the data lifecycle architecture.

Real-time ground-based

data science methods Data Processing Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture. Duplicated?

Reusable data science methodologies for missions and science

(1) Data Processing; (2) Data Analytics

Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. This includes on-demand data analytics for massive data repositories.

Federated data access Data Archives Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics Massive Data Distribution Data Archives Massive data distribution for large-scale repositories and archives including methods for data reduction, computation, etc, as integrated, on-demand data analytics.

Intelligent search and mining

(1) Data Archives; (2) Data Analytics

Provide methods for intelligent search and mining of massive data. This may include integration of on-demand analytics to perform deep searches. Visualization of massive

data sets

Visualization Visualization of massive data sets incuding data reduction methods that are driven by domain.

On-demand distributed

data analytics Data Analytics On-demand data analytics that can integrate data from archives, repositories, etc, applying data science methods (data reduction, fusion, feature detection, etc..) provided through a computational infrastructure

Distributed data analytics Data Analytics Analysis of data across distributed archives to support Earth system science including application of novel machine learning, statistical methods (e.g., data fusion) and other computational capabilities.

Uncertainty Quantification; Measurement Science

Data Analytics Management of uncertainty in all phases of the data life cycle and analysis as part of a measurement science strategy for data fusion and data science Open source data

management/science frameworks

(1) Data Archives; (2)

Data Analytics Open source data management/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture, for archiving and analytics as part of a big data cyberinfrastructure.

Computational Infrastructures

(1) Data Processing; (2) Data Analytics

Computational Infrastructures to scale data analytics using HPC and public cloud. This includes on-demand massive HPC and storage for integration to drive analytics.

Appendix  C.    Mapping  2014  OCT  TA-­‐11  to  Proposed  ESTO/AIST  Big  Data  Technology   Thrust  Areas.

The  2014  NASA  OCT  TA-­‐11  Roadmap  team  identified  the  following  areas:    

Flight  Computing:  Includes  technologies  to  support  greater  computation  and  data   management  at  the  point  of  collection  onboard.    In  some  cases,  data  reduction  at  the   point  of  collection  through  intelligent  triage  methods  may  be  required.    Flight  

computing  technologies  include  ultra-­‐reliable,  radiation-­‐hardened  platforms,  which,   until  recently,  have  been  extremely  costly  and  limited  in  performance.    

Ground  Computing:  Includes  exascale  supercomputing  and  data  storage,  as  well  as   quantum,  cognitive,  and  other  types  of  advanced  computing,  for  Big  Data  analysis  

(18)

and  high-­‐fidelity  physics-­‐based  simulations  for  Earth  and  space  science,  and   aerospace  research  and  engineering.    

Science,  Engineering,  and  Mission  Data  Lifecycle:  As  the  data-­‐intensive  nature  of   NASA  science  and  exploration  missions  increases,  there  is  an  increasing  need  to   consider  the  data  lifecycle  from  the  point  of  collection  all  the  way  to  the  application   and  use  of  the  data.  

Intelligent  Data  Understanding:    Intelligent  data  understanding  (IDU)  refers  to   the  capability  to  automatically  mine  and  analyze  large  datasets  that  are  large,  noisy,   and  of  varying  modalities  (discrete,  continuous,  text,  graph,  etc.),  in  order  to  extract   or  discover  information  that  can  be  used  for  further  analysis  or  in  decision  making,   on  the  ground  or  onboard.  It  is  closely  coupled  to  the  capability  to  detect  and   respond  to  interesting  events  and/or  to  generate  alerts.  

Semantic  Technologies:  Technologies  that  enable  data  understanding,  analysis,   and  automated  consulting  and  operations.  

 

Each  of  these  areas  has  cross-­‐cutting  challenges  which  are  relevant  to  AIST  as  follows.   Flight  Computing

TA Technology Name Description Applicable Technology Area 11.1

.1.3 High Performance Flight Software Enables on-board, high performance autonomy and data processing Onboardreduction, real-time event detection, and data science methods for data planning

Ground  Computing

TA Technology Name Description Applicable Technology Area 11.1.

2.1

Exascale Supercomputer

Provides peak computational capability of ≥ 1 exaflop (10^18 floating point operations per second) for exascale performance of NASA computations, with excellent energy efficiency and reliability, to support NASA’s exponentially growing high end computational needs.

Computationalinfrastructures to scale data analytics

11.1.

2.2 Automated Exascale Software Development Toolset

Provides automated, exascale application performance monitoring, analysis, tuning, and scaling.

Computationalinfrastructures to scale data analytics

11.1.

2.3 Supercomputing File Exascale System

Provides online data storage capacity of ≥ 1 exabyte, enabling data storage for exascale M&S and data analysis, with sufficient performance and reliability to maintain productivity for a broad array of NASA applications.

Computationalinfrastructures to scale data analytics

11.1. 2.5

Public Cloud Supercomputer

Provides additional resources for NASA supercomputer users, such as for mission-critical computing in an emergency.

Computationalinfrastructures to scale data analytics

(19)

TA Technology Name Description Applicable Technology Area 11.4.1. 1 Reference Information System Architecture Frameworks

Provide reference architectures for the end-to-end science and engineering data lifecycle.

Definition of a scalable data big data lifecyce architecture for earth observing systems identifying how Big Data can scale from onboard computing to data analysis to increase science yield.

11.4.1.

2 Information Distributed Architecture Frameworks

Provide reference information architectures to define data across the end-to-end engineering and science data lifecycle

Advanced semantic technologies for auto generating ontologies and information models from existing large, distributed, existing data. 11.4.1.

4

Onboard Data Capture and Triage

Methodologies

Apply novel machine learning capabilities on board to support data reduction, model-based compression and triage of massive data sets

Onboard data science methods for data reduction, real-time event detection, and planning. 11.4.1.

5 Triage and Data Real-time Data Reduction Methodologies

Apply novel machine learning capabilities in ground data processing systems to support data reduction and triage of massive data sets

Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture. 11.4.1. 6 Scalable Data Processing Frameworks

Provide scalable software processing frameworks for processing scientific and engineering data sets.

Open source data processing frameworks that can massively scale to handle large data streams, products, including near-real time constraints, as part of the data lifecycle architecture.

11.4.1.

7 Engineering and Massive Science Data

Analysis Methodologies

Provide scalable methodologies for analysis of

massive data. Development of methodologies for analysis of data on the ground reusable data science as part of the data lifecycle architecture. This includes on-demand data analytics for massive data repositories.

11.4.1. 8

Remote Data Access Framework

Provides access to and sharing of distributed data sources in a secure environment.

Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics

11.4.1.

9 Massive Data Movement Services

Develop new technologies for the movement of

massive, multi-petabyte data over the network. Data movement output, etc) across the data lifecycle is critical to (observations, climate model scaling for Big Data. Develop capabilities to improve movement of data between points of the data lifecycle from networks to parallel methods for data transfer. 11.4.1. 10 Large-Scale Data Dissemination Environments

Enable scaling data infrastructures (software, computation, networks, etc) that are required to support large-scale data dissemination

Massive data distribution for large-scale repostiroeis and archives including methods for data reduction, computation, etc, as integrated, on-demand data analytics.

11.4.1. 11

Toolset for Massive Model Data

Makes data/information transparent, scalable and usable when infusing multiple large and diverse datasets in complex models.

Scale tools that provide on-demand data analytics that bring together distributed models,

observational data, and provides a computational infrastructure to enable analysis and

intercomparison. On Demand Data

Analytics

Provides analytics service coupled with computational infrastructures.

On-demand data analytics that can integrate data from archives, repositories, etc, applying data science methods (data reduction, fusion, feature detection, etc..) provided through a computational infrastructure

Intelligent  Data  Understanding

TA Technology Name Description Applicable Technology Area 11.4.

2.1 Intelligent Data Collection and Prioritization Toolset

Provides a means to reduce the size of the data (e.g., removing clouds), remove corrupted data and/or collect complementary data for

value-Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. In

(20)

added information. particular, intelligent algorithms for data reduction, classification, generation, etc…

11.4.

2.2 Event Detection and Intelligent Action Toolset

Provides computational mechanisms to identify high information content data, either pre-specified, novel or anomalous, including multi-spacecraft collaborative event detection and to take an autonomous or assisted onboard decision, as a result of data analysis

Onboard and ground-based data science methods for data reduction, real-time event detection, and planning.

11.4. 2.3

Data on Demand Toolset

Enables users and models to task sensors and to leverage sensor webs to develop "on-demand" products

This is representative of multiple AIST areas including architecture, on demand data processing, etc

11.4.

2.4 Search and Mining Intelligent Data Toolset

Develops search services and engines for massive, distributed data holdings; enables the application of different searching rules/schemes that learn from past searches and develop “agents” to find and create the most relevant products. It includes rich queries, including fact-based, free-text searches, web-service based indexing as well as anomaly/novelty detection, where the system suggests items of interest to the user without the user necessarily prescribing the information being sought.

Provide methods for intelligent search and mining of massive data. This may include integration of on-demand analytics to perform deep searches.

11.4.

2.5 Data Fusion Toolset Combines data from multiple sources (e.g. remote sensing, in-situ, models) in order to make inferences that might otherwise not be possible with single data sources, or in order to improve the uncertainty characteristics of these inferences, over what might be achieved with single data sources.

Ground-based data science methods for data fusion including uncertainty quantification require to make scientific inferences.

Semantic  Technologies

TA Technology Name Description Applicable Technology Area 11.4.

3.1

Semantic Enabler for Data (Text, Binary

and Databases)

Ingests completely data of all types and produces a data model and a precision ontology, improves the quality of any existing metadata, including provenance and quality of the source document and disambiguates words in the context of the document, database or file and visualizes the results.

Advanced semantic technologies for auto generating ontologies and information models from existing large, distributed, existing data. This could be extremely useful for reverse engineering a large-semantic model for a scientific discipline or sub-discipline. 11.4. 3.2 Ultra Large-Scale Visualization and Incremental Toolset

Enables and automates analysis of ultra large scale datasets and visualizes the results in the context of the knowledge domain

Visualization of massive data sets incuding data reduction methods that are driven by domain. 11.4.

3.3

Semantic Bridge Framework

Enables the alignment of two or more data sources based on their respective ontological descriptions to achieve semantic interoperability and facilitates calculations to be properly made using data from each dataset.

Advanced semantic technologies for ontology integration. As NASA moves to more system science, spans multiple missions, instruments, systems, environments, etc, this is critical. Constructing On-demand data analytics capabilities will require it.

Collaborative  Systems

TA Technology Name Description Applicable Technology Area 11.4.

4.3 Distributed Collaborative Science Data Analysis

Enable data, computation and services to be brought together to support distributed data analysis in collaborative environments for science

On-demand data analytics that can integrate data from archives, repositories, etc, applying data science methods (data reduction, fusion, feature

(21)

Frameworks detection, etc..) provided through a computational infrastructure enabling scientific collaboration.

Mission  Systems

TA Technology Name Description Applicable Technology Area 11.4.

5.2

Adaptive Systems Framework

Manages a set of interacting or interdependent entities, real or abstract, forming an integrated whole that together are able to respond to environmental changes or changes in the interacting parts.

Open source data processing frameworks that can massively scale to handle large data streams, products, including near-real time constraints, as part of the data lifecycle architecture.

Cyberinfrastructure

TA

Technology

Name Description

Applicable Technology Area 11.4.

6.1 On-Demand, Multi-Mission Data Storage and Computation

Provides scalable storage and computing available on demand across projects including both internal and external (hybrid) clouds at center and NASA levels.

Computationalinfrastructures to scale data archive and analytics. Note: this could an area for joint exploration between ESTO/AIST and NASA OCIO. 11.4. 6.2 Scalable Data Management Frameworks

Are extensible, scalable data management frameworks that can take advantage of massive storage and computing resources

Open source data mangement/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture for mission data management.

11.4.

6.3 Scalable Data Archives Systems Support scalable archives that can capture, manage, distribute and preserve massive data sets, both engineering and science.

Open source data mangement/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture, for archiving. 11.4.

6.4 High Performance Networking Provide terabit data networks to handle movement of massive datasets, particularly those for scientific research.

Data movement (observations, climate model output, etc) across the data lifecycle is critical to scaling for Big Data. Provide high speed networks as core infrastructure. Note: this could an area for joint exploration between ESTO/AIST and NASA OCIO.

Human  System  Interaction

TA Technology Name Description Applicable Technology Area 11.4.

7.6

Assistive tool for heterogeneous data integration

Allows integration of heterogeneous data sources to enable querying and linking data.

Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics. Advanced semantic technologies for ontology integration.

Appendix  D.    Classification  of  Agency  Big  Data  Efforts Based  on  NITRD,  and  other  presentations;  etc.

Figure

Figure 1: Data Lifecycle for NASA Science Missions
Figure 2: End-to-End NASA Mission/Science Data Lifecycle
Figure 3: Today, Stewardship Model of NASA Data Systems
Figure 4. Future, NASA Mission-Science Data Ecosystem
+4

References

Related documents

Nurses feel that both the software and the nurse are essential to clinical decision-making, and describe a process of ‘dual decision- making’, with the nurse as active decision

The publisher or other rights-holder may allow further reproduction and re-use of this version - refer to the White Rose Research Online record for this item.. Where records

human body can persist through death is equally a reason to suppose that a. human animal can persist through death, and any reason to deny

In this scenario total energy consumed is above the “target” energy demand for the transport sector for this scenario of 403 TWh (Table 3) and again, it was not possible to push

The 10 resident domains cluster into three groups : universal requirements for older people living in residential settings (privacy, the ability to personalise their

to the Convention for the Protection of Human Rights and Dignity of the Human Being with Regard to the Application of Biology and Medicine, on the Prohibition of Cloning Human

The role of dopamine in chemoreception remains to be fully established, but it is clear that stimulus evoked transmitter release from type I cells on to afferent nerve endings is a

The International Board held that any organisation composed of at least 75% of Business and Professional Women was eligible for membership in the Federation, and the Council of