NASA Earth Science Research in Data and Computational Science Technologies
Report of the ESTO/AIST Big Data Study Roadmap Team September 2015
I. Background
Over the next decade, the dramatic growth of NASA’s Earth Science data collections is projected to outpace the ability of scientists to analyze that data meaningfully. What has been termed the “Vs” of data (volume, variety, velocity, etc.) pose significant challenges for both Earth Science missions and researchers as traditional methods for developing science data pipelines, distributing scientific datasets and performing effective analysis will require new approaches. The Intergovernmental Panel on Climate Change’s (IPCC) Assessment Report 6, for example, predicts the growth of data to tens of petabytes. Future remote sensing projects include an increasing set of data-‐intensive instruments that will pose severe challenges to existing systems. Likewise, with Earth Science data archives, these massive increases will shift the focus from distributing whole data sets to providing online services for computation and analysis. In addition, instruments flown on NASA Earth-‐ observing satellites will continue to generate data and stress the boundaries of end-‐to-‐end data systems. These challenges taken together require new thinking in data capture, management, processing, and analysis, both onboard and on the ground.
Big Data and Data Science have been used as terms to describe this data deluge and the discipline devote to address it. For the purposes of this document, Big Data is a term used to describe the state of collection and analysis of data that exceeds conventional methods or software systems. This state of affairs necessitates new approaches that will change the paradigm by which data is collected and analyzed. Data Science focuses on the use of systematic architectural, software, methodological, and algorithmic approaches (e.g., data management, intelligent algorithms, statistics and visualization) for generation, capture, management, analysis, and discovery from massive data sets and streams (i.e., big data) including those from multiple sensors, models, archives, and other sources, to enable research and decision support. Data Science is a term that is being applied at a growing number of universities that are establishing programs in this area.
Data Science is thus emerging as a critical area of research and technology to advance scientific discovery. The sheer volume of data increase over the next decade, coupled with the highly distributed and heterogeneous nature of scientific data sets is requiring these new approaches. Numerous technologies are under development in multiple communities to address Big Data challenges. While NASA can leverage some of this capability (e.g., map-‐ reduce to scale computation), there are significant technologies that need to be developed
that scale to address the data-‐intensive challenges underlying modern observing systems and the science questions they were created to investigate. Existing techniques also do not address the end-‐to-‐end nature of NASA’s science-‐driven observational environment. NASA needs to develop a multi-‐year plan to address these challenges in total, rather than with incremental, isolated improvements that are unlikely to scale. Such an approach promises significant scientific yield from NASA’s missions, instruments, archives, and research community and will be necessary in order to remain on the critical path to accomplish science in the Big Data era.
II. Current State: Mission, Instrument and Large-‐scale Data Analysis Challenges Currently, the analysis of large data collections from NASA or other agencies is executed through traditional computational and data analysis approaches, which require users to bring data to their desktops and perform local data analysis. Alternatively, data are hauled to large computational environments that provide centralized data analysis via traditional High Performance Computing (HPC). Scientific data archives, however, are not only
growing massive, but are also becoming highly distributed. As a consequence, neither traditional approach provides a good solution for optimizing analysis into the future.
Further, assumptions across the NASA mission and science data lifecycle, which historically assume that all data can be collected, transmitted, processed, and archived, will not scale as more capable instruments stress legacy-‐based systems. A new paradigm is needed in order to increase the productivity and effectiveness of scientific data analysis. This paradigm must recognize that architectural and analytical choices are interrelated, and must be carefully coordinated in any system that aims to allow efficient, interactive scientific exploration and discovery to exploit massive data collections, from point of collection (e.g., onboard) to analysis and decision support. Expanding on this point, the most effective approach to analyzing a distributed set of massive data may involve some exploration and iteration, putting a premium on the flexibility afforded by the architectural framework. The framework should enable scientist users to assemble workflows efficiently, manage the uncertainties related to data analysis and inference, and optimize deep-‐dive analytics to enhance scalability. These challenges are not limited to NASA. Multiple agencies are
confronted with the question of how to draw scientific inference from growing, distributed archives, as identified in the appendix.
NASA has already made significant investments in capturing and sharing data from massive, online data systems. The NASA Earth Science Distributed Information System (ESDIS), most prominently the Distributed Active Archive Centers (DAACs), provide an excellent foundation for capturing and building high quality repositories. Technology investments to date by NASA (e.g., ROSES ACCESS, ROSES AIST, etc.) and the DOE (e.g.,
Earth System Grid) have focused on developing software services that have improved access to distributed data. The challenge that is posed now is how to integrate these capabilities with careful architectural approaches and emerging technologies that will allow NASA to continue to scale the entire data lifecycle and support the data analysis needs of the Earth Science community. The existing infrastructures are prime candidates for grounding a more distributed, scalable computational environment to optimize
scientific data analysis and move to the era of “Big Data Analytics.” An integrated
architecture and ecosystem will address the modern data and computing challenges across the NASA mission and science data lifecycle:
● Reproducibility ● Uncertainty ● Data fusion ● Data reduction ● Data movement ● Data visualization ● Cost ● Performance
III. Use Cases
Several use cases have been identified which present challenges to future NASA missions and science. These use cases identify the challenges and required capabilities that are needed in order to not just keep pace but increase the science yield from NASA Earth Science missions, instruments and data collections.
Use Cases1 – the following identify use cases and their data science challenges.
Use Case Description Data Science
Challenge
Enabling Mission/ Capability
Climate Modeling
Formulate hypotheses from observed empirical relationships;
Simulate current and past conditions under those hypotheses using climate models; Test hypotheses by comparing simulations to observations; Evaluate uncertainty of predictions originated from statistical sampling of models and observations.
Highly distributed data sources; fusion of different
observations; moving computation to the data; data reduction
CMIP6 will move towards exascale archives requiring new approaches to evaluating models relative to
observational data.
Satellite Missions
Missions such as NI-SAR and SWOT will generate massive observational data. However, they are have different architectural patterns including compute intensive, data intensive, heterogeneous,
Massive data rates, data movement challenges, computational
NI-SAR and SWOT require new approaches for
computation, data movement, data archiving and distribution,
1
etc. scalability, archiving and distribution; onboard processing for data reduction/analysis; high-volume data transfer for ground processing analytics. Applications - Hydrology (Central Valley of California)
Understanding groundwater dynamics on a regional scale using measurements from satellite, airborne and in-situ measurements. Compare against predictive models.
Distributed computation; highly distributed data sources; data fusion of multiple products; massive new satellite observations.
Integration of data from PALSAR-2, Sentinel, Grace-FO, ASO, and SMAP. Scale to support NI-SAR and SWOT. Comparison against models. Requires new architectural approaches for distributed data analytics.
. Airborne
Missions
Airborne missions tend to be much more agile and on-demand. Integrating this into a data ecosystem provides new opportunities to quickly generate and understand various measurements.
On-demand architectures; distributed data sources; on-the-fly data processing; onboard processing for data reduction/analysis; high-volume data transfer for ground processing
Current missions such as CARVE and Airborne Snow Observatory; Future such as proposed EVI-3 and ASO follow-on missions
Other science disciplines such as biology carry similar use cases and have identified similar technology needs to those of NASA. These are identified in the section on benchmarking. IV. Conceptual Architecture: The Data Lifecycle for NASA Earth Science
The data lifecycle is a critical perspective for developing a comprehensive set of capabilities to increase scientific yield from missions. This encompassing approach must include
developing capabilities from the point of collection all the way to analysis and extracted understanding. Figure 1 below shows this concept and exposes the need to improve data science capabilities at multiple points of the data lifecycle: from onboard computing, to data triage of massive data streams, to data analysis. This perspective requires
architectural considerations for determining and integrating methodologies and
to-‐end concept, the model below provides guidance for NASA investments and capabilities that will be required to support effectiveness and scalability in the Big Data era.
The data lifecycle view reveals that choices and results of operating on data at one point in the lifecycle inevitably shape and possibly limit the possibilities for working with the data at a later point in the lifecycle. Coordination must be addressed in both directions:
ultimately objectives for scientific data understanding and reproducibility must back-‐drive the design and development of capabilities early in the data lifecycle so as to enable desired results, to the greatest extent possible. See Figure 2. There is also an outer loop to the data lifecycle: Successful data understanding (and possibly, disappointments) drives the next round of science instrument and mission proposals.
Figure 1: Data Lifecycle for NASA Science Missions
Figure 2: End-to-End NASA Mission/Science Data Lifecycle
The data lifecycle views shown in Figures 1 and 2 identify a number of challenges across the mission lifecycle that affect the various scientific disciplines that are core to NASA. Earth Science, in particular, benefits from new approaches to improving data analysis given the complex, heterogeneous, high-‐volume, and distributed nature of the remote sensing data acquired from Earth-‐observing missions. As an example, use of these data for
comparison against and verification of climate model simulations is critical to supporting research in global climate change. However, much of this data is managed in highly
distributed repositories with different data representations, formats, access methods, and owners. Significant time and effort are required to access, move, and rectify data from different repositories to support specific analyses of interest to particular researchers. These requirements constrain both the type and scope of analyses that can be performed, and therefore ultimately the scientific hypotheses that can be tested. A major goal for data science is to reduce the cost of large-‐scale, interactive data analysis and, at the same time, leverage the richness of massive data sets to reduce uncertainty in the answers to crucial scientific questions. This is not possible in the current environment due to lack of
scalability. Data science seeks to enable not just more discipline science (e.g., more data incorporated in analysis, more parameters to compare, etc.), but better science (e.g., quantitative assessments of uncertainty, repeatability of results, etc.), to be performed in a similar or shorter amount of time despite growing sets of data.
VI. Capability Gaps and Drivers
A major gap has been the focus on localized data analysis from instruments vs. a holistic consideration of all the observing systems and how they fit into a big data analytics view and capability. Data systems today are organized around the capture and archiving of data from specific missions or observing capabilities. However, the integration of data from multiple instruments (spaceborne, airborne, ground-‐based) is important for supporting scientific research, in particular, in moving from isolated data analysis to knowledge discovery through the use of a big data analytics approach. This includes making data available from multiple sources and integrating those data using intelligent algorithms and methods. In addition, as data grows and more automated methods are in place for data discovery, this affords opportunities to improve the efficiency and effectiveness of ongoing mission operations and move it towards a data-‐driven approach where data is reduced onboard or at ground stations, prior to archive, as well as during fully offline science analysis. The introduction of new approaches with interpretation of data across the lifecycle allows for informed decisions at arbitrary points in the lifecycle – allowing for mission plans to be updated, new relevant data products to be generated, etc. This same full view of the data lifecycle also will inform provenance practices, so that the relevant details – all the way back to the point of collection – can be captured to provide a basis for reproducing the end results of scientific understanding. This is a paradigm shift from how mission and science operations vs. analysis are performed today, largely in separate arenas2.
Figure 3 below shows the current approach for organizing NASA Earth Science data systems. The approach focuses extensively on the stewardship of data, by collecting data into organized and community-‐accessible archives. This approach has enabled NASA to build high quality data archives, but as the need for a shift towards systematic approaches to data analysis increases, it is important to take a broader view of how activities across the entire data lifecycle can be organized and integrated. Today, onboard computing, mission operations, science data processing and archiving, and data analysis are performed as generally independent architectures, systems and components of the data lifecycle.
2This state of affairs is reflected in the disjoint NASA programs for science missions vs. research and analysis.
Figure 3: Today, Stewardship Model of NASA Data Systems
Taking a broader view, cyber-‐infrastructures, machine learning and other intelligent
algorithms (analytics), statistics, and visualization should be brought together as integrated architectural solutions that can be scaled to meet NASA’s data challenges in enabling
missions and science. The Figure 4 below shows a shift from data generation and stewardship of archives to enabling data analysis through an integrated cyber-‐
infrastructure where data, algorithms, computation, and visualization are brought together across a highly distributed data environment to quickly stand up new analytic capabilities for different stakeholders, measurements, science questions, and applications. This requires new thinking as follows:
1. New architectural approaches across the entire science mission and data lifecycle in order to increase the science yield by scaling and improving the integration of each of the various components of the lifecycle.
2. Increasing capability onboard to support data reduction, autonomous mission (re-‐)planning, etc., as part of managing bandwidth capabilities. 3. Integrated analytics as part of the real-‐time mission pipeline.
4. Ability to quickly construct new analytic centers that can unify archives, computing capabilities, and software services to bring heterogeneous data sets as well as models/simulations together for analysis.
Data$ Acquisi+on$ and$$ Command$ Instrume nt$ Opera+on s$ EDOS/ GDS$ L0A$$ Processin g$ Science$ Data$$ Processing$ L0B$ L1$ L2$ L3$ L4$ $ SDS$ EOSDIS$DAAC$ Science$ Data$ Management$$ Archive$ &$ Distribu+on$ Instrume nt$ Opera+on s$ EDOS/ GDS$ L0A$$ Processin g$ Science$ Data$$ Processing$ L0B$ L1$ L2$ L3$ L4$ $ SDS$ EOSDIS$DAAC$ Science$ Data$ Management$$ Archive$ &$ Distribu+on$ EOSDIS$Data$ Centers$ Science$ Data$ Management$$ Archive$ &$ Distribu+on$ Science$ Data$$ Processing$ L0B$ L1$ L2$ L3$ L4$ $ Science$Data$ Systems$ Instrument$ Opera+ons$ EDOS/ Ground$Data$ Systems$ L0A$$ Processing$
Science$Teams$ Outreach$ Research$
Mission$ Opera+on s$ Mission$ Opera+ons$ TDRS% Netwo rk% On$Board$ Processing$
5. Systems that can react to increased velocity of data across the lifecycle with new technology approaches for data triage and capture.
6. Interoperability with other agencies including capture of data from their instruments and integration with their ground systems, archives, and analytic capabilities.
7. Quantitative uncertainty management at all stages of the data lifecycle in order to provide uncertainties required for scientific inference..
Figure 4. Future, NASA Mission-Science Data Ecosystem
V. Proposed NASA Data and Computational Science Technology Areas
As discussed above, as the data-‐intensive nature of NASA science and exploration missions increase, there is a greater need to consider the data lifecycle from the point of collection all the way to extracted understanding of the data in order to support scalability and full utilization of the data. Furthermore, missions may have no choice but to require that data reduction and intelligent triage be performed across the data lifecycle to identify which data are to be captured and archived. In addition to considerations of the data and
software, it is critical that common information models be developed and defined in order to ensure that consistent definitions of the data are applied (what is “definitions of the
Data$ Acquisi+on$ and$$ Command$ Instrume nt$ Opera+on s$ EDOS/ GDS$ L0A$$ Processin g$ Instrume nt$ Opera+on s$ EDOS/ GDS$ L0A$$ Processin g$ Instrument$ Opera+ons$ EDOS/ Ground$Data$ Systems$ L0A$$ Processing$ Mission$ Opera+on s$ Mission$ Opera+ons$ TDRS% Netwo rk% On$Board$ Processing$ Data Science Infrastructure (Data, Algorithms, Machines) Applica+ons$ Data$Analy)cs$ Centers$ Other%Data% Systems%(e.g.% NOAA)% Other%Data% Systems%(e.g.% NOAA)% Other$$ Data$Systems$$ (InDSitu,$Other$ Agency,$etc)$ Decision$ Support$ $ On$Demand$ Algorithms$ Science$ Data$ Manage$ Satellite$ Instrument$ Data$ Systems$ Science$ Data$ Manage$ Airborne$ Data$ Science$ Data$ Manage$ NASA$ Data$ Archives$ Research$ Science$Teams$
data”?. In particular, a rigorous, probabilistic definition of uncertainty should be adopted to cover all NASA missions and data so that uncertainty is defined in the same way for all data that are to be used in scientific studies. This practice ensures that the data is reliably managed, is supportive of discovery, and is fully utilized. It is important to point out that data across this entire lifecycle view should not be considered data at rest, but rather data that is discoverable, accessible, and utilizable to update plans and inform other decisions, support local operations, and of course enable science. A well-‐architected data system from onboard data capture to ground-‐based operations through data analysis must be in place to support all of these objectives.
Critical areas of the lifecycle (with considerations at each stage) include: a) Data Generation: Perform original processing at the sensor/instrument b) Data Triage: Make choices at the collection point about which data to keep
c) Data Compression: Maximize information throughput against available bandwidth d) Data Transport: Improve resource efficiencies to enable moving the most data e) Data Processing: Increasing computation availability
f) Data Archiving: Scaling the capture, management and distribution of data g) Data Visualization: Develop and apply visualization techniques to enable data
exploration and insights
h) Data Analytics: Create services to integrate the analysis of massive, distributed, heterogeneous data, and propagate uncertainties.
In addition, there are relevant technologies and architectural approaches that are inherently cross cutting in that they apply equally to several Earth Science areas of
investigation. This perspective is particularly important for architecture, which can enable how various technologies can be integrated to construct a general data-‐intensive approach for the NASA Earth Science enterprise.
Table 1: Proposed AIST Technology Areas
Technology Name
Data Lifecycle Area
(s) Description
Big Data Architecture Earth Science Remote
Sensing
Cross-Cutting Definition of a scalable data big data lifecycle architecture for earth observing systems identifying how Big Data can scale from onboard computing to data analysis to increase science yield.
Big Data Information
Models and Semantics Cross-Cutting Advanced semantic technologies for defining, deriving, and integrating heterogeneous ontologies and information models as applied across the entire data lifecycle (onboard, ground-based operations, archives, analysis) Onboard data science
methods for data triage Data triage Onboard data science methods for real-time event detection, and planning. Onboard data science
methods for data reduction
Massive Data Movement
Technologies Data Transport Massive data movement technologies for ground-based networks from operations through analysis Real-time ground-based
data science methods Data Processing Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture.
Open source data
processing frameworks Data Processing Open source data processing and workflow frameworks that can massively scale to computational infrastructures (HPC, public cloud, etc.) handling large data streams, products, including near-real time constraints, as part of the data lifecycle architecture.
Reusable data science methodologies for missions and science
(1) Data Processing; (2)
Data Analytics Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. This includes on-demand data analytics for massive data repositories.
Federated data access Data Archives Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics Massive Data Distribution Data Archives Massive data distribution for large-scale repositories and archives including methods for data reduction, computation, etc., as integrated, on-demand data analytics.
Intelligent search and
mining (1) Data Archives; (2) Data Analytics Provide methods for intelligent search and mining of massive data. This may include integration of on-demand analytics to perform deep searches. Visualization of massive
data sets
Visualization Visualization of massive data sets including data reduction methods that are driven by domain.
On-demand distributed data analytics
Data Analytics On-demand data analytics that can integrate data from archives, repositories, etc., applying data science methods (data reduction, fusion, feature detection, etc.) provided through a computational infrastructure
Distributed data analytics Data Analytics Analysis of data across distributed archives to support Earth system science Uncertainty Quantification;
Measurement Science
Data Analytics Quantification and management of uncertainty through all data processing steps, and subsequently through analytical algorithms, as part of a measurement science strategy for data fusion and data science Open source data
management/science frameworks
(1) Data Archives; (2) Data Analytics
Open source data management/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture, for archiving and analytics as part of a big data cyber-infrastructure.
Computational Infrastructures
(1) Data Processing; (2) Data Analytics
Computational Infrastructures to scale data analytics using HPC and public cloud. This includes on-demand massive HPC and storage for integration to drive analytics.
Figure 5 shows the mapping of these technologies to the Big Data Analytics architecture concept. A key element, as mentioned, is the integration of the various technologies based on a data-‐driven architecture view.
Figure 5: Architecture-Technology Mapping
VIII. Benchmarking
Several U.S. agencies have initiated efforts in Big Data. This includes technology investments, new initiatives and new organizations. These efforts are captured in the appendix and are summarized below.
Agency Big Data Overview and Strategy
National Institutes of Health (NIH)
The NIH has initiated a new program in Data Science and appointed an Associate Director to lead the effort. They are establishing an NIH Commons (computation, software, standards, etc.) through its Big Data to Knowledge (BD2K) initiative. The NIH commons will provide capabilities to various NIH institutes who support directed research efforts. Efforts focus on enabling data management and big data analytics capabilities.
National Science Foundation (NSF)
The NSF has several initiatives coordinating through the Office of Cyberinfrastructure (OCI). OCI coordinates with various disciplines within the NSF including the EarthCube program that seeks to build a national Geosciences Cyberinfrastructure. Goals of the NSF include:
1) Derive knowledge from data;
2) Develop new cyber-infrastructures to manage, curate and serve data; 3) Develop new approaches for education and workforce development; and 4) Enable new types of inter-disciplinary collaboration, community building
Data$ Acquisi+on$ and$$ Command$ Instrume nt$ Opera+on s$ EDOS/ GDS$ L0A$$ Processin g$ Instrume nt$ Opera+on s$ EDOS/ GDS$ L0A$$ Processin g$ Instrument$ Opera+ons$ Ground$Data$ Systems$ L0A$$ Processing$ Mission$ Opera+on s$ Mission$ Opera+ons$ TDRS% Network% On$Board$ Processing$ Data Science Infrastructure (Data, Algorithms, Machines) Applica+ons$ Data$Analy)cs$ Centers$ Other%Data% Systems%(e.g.% NOAA)% Other%Data% Systems%(e.g.% NOAA)% Other$$ Data$Systems$$ (InDSitu,$Other$ Agency,$etc)$ Decision$ Support$ $ On$Demand$ Algorithms$ Science$ Data$ Manage$ Satellite$ Instrument$ Data$ Systems$ Science$ Data$ Manage$ Airborne$ Data$ Science$ Data$ Manage$ NASA$ Data$ Archives$ Research$ Science$Teams$ Cross9Cu;ng% • Data%System%Architectures% • InformaBon%Architectures% Onboard% • Data%Triage% • Data%ReducBon% Ground%System% • Real9Time%Data%Triage% • Reusable%Data%Science%Methods% • On9demand%workflows,%computaBon% • Integrated%Cyberinfrastrcture% Cyberinfrastructure% • Open%Source%Data%Management% %%%%%%%%%/Processing%Frameworks% • Data%Movement% • Federated%Data%Access% • Scalable%ComputaBon%and%Storage% Data%AnalyBcs%and%Viz% • Distributed%Data%AnalyBcs% • On9demand%computaBon% • Intelligent%search%and%mining% • Uncertainty%QuanBficaBon% • VisualizaBon%of%massive%data%sets%
DARPA
DARPA has several programs in Big Data including the XDATA and Memex Programs that are developing data science frameworks for big data analytics and mechanisms to explore deep searching of the Internet. DARPA is working to explore the use of open source technologies and their application to these programs.
NOAA
NOAA is working to explore commercial opportunities to build cyber-infrastructures. This includes the use of cloud-based computing capabilities and support to scale data management and computation. NOAA is also participating in the Big Earth Data Initiative (BEDI) project.
Department of Energy (DOE)
DOE has been exploring programs in extreme scale science, particularly as it relates to high performance computing (HPC). The goal is to address the combined challenges of Big Data and Big Compute to develop a exascale computing environment for simulation and data analysis at scale cutting across various disciplines in energy, biology and climate.
USGS
USGS is exploring its role in big data, focusing on data capture and integration, as well as sharing and leveraging HPC infrastructure across the agency. Programs such as EROS are exploring new architectural approaches to scale for the future.
National Institutes of Standards and Technology (NIST)
NIST has established a program in Data Science focusing on the development of architectures, use cases, standards and interoperability. They are also focused on areas including measurement foundations/principles to increase the accuracy of derived inferences from massive data.
Science in broadest terms unifies the data science technology needs of NASA and other agencies. Beyond NASA science disciplines such as climate science, other disciplines such as biological science are declaring a direct need for new technologies to support science data analysis (often captured as “data-‐intensive science” or “data-‐driven science”). This should not be surprising, given the vast quantities of data being produced in such areas as genomics and proteomics. Data science methods that automate the extraction,
classification, reduction and discovery from massive data sets will have direct value across several science disciplines and agencies.
International agencies, such as the European Space Agency (ESA), also have initiated several efforts in Big Data. In 2013 and 2014, ESA held a workshop on Big Data for Space3
covering several of the technologies that are described in this report. Results of the recent conference highlight movement towards the big data lifecycle, enabling integrated
analytics, integrating high-‐performance computing (HPC) with data infrastructures, new techniques for machine learning, and a shift towards new cyber-‐infrastructures.
International groups such as the International Virtual Observatory Alliance (IVOA),
International Planetary Data Alliance (IPDA), and the Global Organization for Earth System Science Portals (GO-‐ESSP) have developed efforts to share data and technology across international boundaries in order to support world-‐wide science data analysis.
IX. Ten Year Roadmap of Capabilities What is this table? It’s not explained.
Today Near-Term (2-5 Years) Far Term (5-10 Years) Highly curated data repositories Scalable data science framework Onboard data science and reduction
methods
Access and search methods for distributed Integrated data analytics Distributed data analytics and computation
3
repositories
Core data standards Data provenance for reproducibility Integrated uncertainty analysis Machine Learning Techniques Unified architecture for data exploration
and analysis Virtual observational missions Is this about machine learning? On-demand data science methods
integrated with data repositories
Virtual, Immersive Visualization Environments for Massive Data Is this about machine learning? Data visualization methods
Is this about machine learning? Scalable computing infrastructure
Disruptive Approaches Leveraging Computational Science (5-‐10 Years)
Far-Term Disruptive Capabilities Description
Onboard data science and reduction methods Increase onboard autonomy, planning and data processing as close to the instrument as possible as part of a broader big data strategy to increase
science return.
Distributed data analytics and computation Shift to a data analytics ecosystem to enable analysis of distributed data from multiple instruments (ground-based, airborne, satellite) as well as comparison
against climate models.
Integrated uncertainty analysis Develop an architectural approach that manages uncertainly levels as data and computational demands change.
Virtual observational missions Shift towards a data-driven approach for missions where new instruments integrate into an overall virtual infrastructure; support planning of new science
goals and observations from a combination of instruments and archival data. Virtual, Immersive Visualization Environments for
Massive Data
Enable immersive environments for scientific analysis as well as outreach using observational data from NASA missions.
I would like to see an additional role in this table devoted to the interaction between the architecture and the (statistical) methods that define analytical algorithms. New statistical methods optimized for a given architecture; what architecture is best given a set of target analyses; trade-‐off between uncertainty and cost, etc.
Technology Prioritization
Technology Name Priority Applicable Far-Term Capability Big Data Architecture for Earth Science Remote Sensing Highest Cross Cutting
Big Data Information Models and Semantics Medium Distributed data analytics and computation
Onboard data science methods for data triage Medium Onboard data science and reduction Onboard data science methods for data reduction Medium Onboard data science and reduction
Massive Data Movement Technologies Medium Distributed data analytics and computation Real-time ground-based data science methods Highest Virtual observational missions
Open source data processing frameworks Highest Distributed data analytics and computation; Virtual observational missions; Virtual, immersive visualization
environments for massive data Reusable data science methodologies for missions and
science
Highest Virtual observational missions; Distributed data analytics and
computation Federated data access Medium Distributed data analytics and
computation Massive Data Distribution Medium Distributed data analytics and
Intelligent search and mining Medium Distributed data analytics and computation Visualization of massive data sets Medium Virtual, immersive visualization
environments for massive data On-demand distributed data analytics Highest Distributed data analytics and
computation Distributed data analytics Highest Distributed data analytics and
computation Uncertainty Quantification; Measurement Science Medium I thing this should be
Highest Integrated uncertainty analysis Open source data management/science frameworks Medium Distributed data analytics and computation; Virtual observational missions; Virtual, immersive visualization
environments for massive data Computational Infrastructures Highest Distributed data analytics and
computation; Virtual observational missions; Virtual, immersive visualization
environments for massive data
X. Overall Recommendations
● NASA data technology needs to shift from ad hoc investments across the mission and science data lifecycle to an integrated architecture where technology
investments fit into a broader capability and vision to enable Earth Science.
● The NASA data ecosystem needs to shift from a stewardship model to a data-‐driven discovery model where both data management and data discovery are enabled through a systematic data architecture, computational infrastructure and lifecycle model.
● Data architectures should be modeled and assessed overall to plan for technology capabilities and improvements to ensure scalability, and to address performance, cost, and uncertainty management goals.
● Data discovery methods should be applied across the entire data lifecycle to support scalability and discovery at each point, from onboard computing, to data processing and archiving, to distributed data analytics.
● Architectures should enable flexible tradeoffs of where to and how to compute, to include the improved integration of HPC and data infrastructures.
● New capabilities should ensure reproducibility of derived scientific results. ● Computational and data science should play an important role in planning new
missions including identification of how data, algorithms, and computing can be applied and integrated to improve overall data discovery.
References
[1] 2014 NASA Office of the Chief Technologist Roadmap (under development) [2] National Research Council, “Frontiers in Massive Data Analysis”, 2013.
[3] 2011 PCAST Report on Information Technology.
[4] American Geophysical Union, “Trends in Earth and Space Science Appendix A. NASA Mission/Science Data Lifecycle
As the data-‐intensive nature of NASA science and exploration missions increases, there is an increasing need to consider the data lifecycle from the point of collection all the way to the application and use of the data. Considerations across the entire lifecycle need to be made in order to support scalability and use of the data. Furthermore, missions may require that data reduction and intelligent triage be done on the data itself across the lifecycle to identify which data should be captured and archived. In addition to
considerations of the data and software, it is critical that common information models be developed and defined in order to ensure consistent definitions of the data are applied so that the data itself can be accurately managed, discovered and used. It is important to point out that data across this entire lifecycle should not be considered data at rest, but rather data that should be discoverable, accessible, and usable to update plans, support local operations, and enable science. As a result, a well-‐architected data system from on-‐board data capture through ground-‐based operations and data analysis must be in place to enabling scalability at multiple points across that lifecycle.
Critical areas of the lifecycle (with considerations at each stage) include:
a) Data Generation: Performing original processing at the sensor/instrument b) Data Triage: Make choices at the collection point about which data to keep
c) Data Compression: Maximize information throughput against available bandwidth d) Data Transport: Improve resource efficiencies to enable moving the most data e) Data Processing: Increasing computation availability
f) Data Archiving: Scaling the capture, management and distribution of data
g) Visualization: Develop and apply analysis techniques to enable data understanding from visualization of massive data
h) Data Analytics: Create analytics services to integrate massive, distributed, heterogeneous data
Appendix B. ESTO/AIST Proposed Big Data Technology Thrust Areas.
Technology Name Data Lifecycle Area (s) Description Big Data Architecture for
Earth Science Remote Sensing
Cross-Cutting Definition of a scalable data big data lifecyce architecture for earth observing systems identifying how Big Data can scale from onboard computing to data analysis to increase science yield.
Big Data Information Models and Semantics
Cross-Cutting Advanced semantic technologies for defining, deriving, and integrating heterogeneous ontologies and information models as applied across the entire data lifecycle (onboard, ground-based operations, archives, analysis) Onboard data science
methods for data triage Data triage Onboard data science methods for real-time event detection, and planning. Onboard data science
reduction Massive Data Movement
Technologies
Data Transport Massive data movement technologies for ground-based networks from operations through analysis
Real-time ground-based
data science methods Data Processing Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture.
Open source data processing frameworks
Data Processing Open source data processing and workflow frameworks that can massively scale to computational infrastructures (HPC, public cloud, etc) handling large data streams, products, including near-real time constraints, as part of the data lifecycle architecture.
Real-time ground-based
data science methods Data Processing Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture. Duplicated?
Reusable data science methodologies for missions and science
(1) Data Processing; (2) Data Analytics
Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. This includes on-demand data analytics for massive data repositories.
Federated data access Data Archives Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics Massive Data Distribution Data Archives Massive data distribution for large-scale repositories and archives including methods for data reduction, computation, etc, as integrated, on-demand data analytics.
Intelligent search and mining
(1) Data Archives; (2) Data Analytics
Provide methods for intelligent search and mining of massive data. This may include integration of on-demand analytics to perform deep searches. Visualization of massive
data sets
Visualization Visualization of massive data sets incuding data reduction methods that are driven by domain.
On-demand distributed
data analytics Data Analytics On-demand data analytics that can integrate data from archives, repositories, etc, applying data science methods (data reduction, fusion, feature detection, etc..) provided through a computational infrastructure
Distributed data analytics Data Analytics Analysis of data across distributed archives to support Earth system science including application of novel machine learning, statistical methods (e.g., data fusion) and other computational capabilities.
Uncertainty Quantification; Measurement Science
Data Analytics Management of uncertainty in all phases of the data life cycle and analysis as part of a measurement science strategy for data fusion and data science Open source data
management/science frameworks
(1) Data Archives; (2)
Data Analytics Open source data management/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture, for archiving and analytics as part of a big data cyberinfrastructure.
Computational Infrastructures
(1) Data Processing; (2) Data Analytics
Computational Infrastructures to scale data analytics using HPC and public cloud. This includes on-demand massive HPC and storage for integration to drive analytics.
Appendix C. Mapping 2014 OCT TA-‐11 to Proposed ESTO/AIST Big Data Technology Thrust Areas.
The 2014 NASA OCT TA-‐11 Roadmap team identified the following areas:
• Flight Computing: Includes technologies to support greater computation and data management at the point of collection onboard. In some cases, data reduction at the point of collection through intelligent triage methods may be required. Flight
computing technologies include ultra-‐reliable, radiation-‐hardened platforms, which, until recently, have been extremely costly and limited in performance.
• Ground Computing: Includes exascale supercomputing and data storage, as well as quantum, cognitive, and other types of advanced computing, for Big Data analysis
and high-‐fidelity physics-‐based simulations for Earth and space science, and aerospace research and engineering.
• Science, Engineering, and Mission Data Lifecycle: As the data-‐intensive nature of NASA science and exploration missions increases, there is an increasing need to consider the data lifecycle from the point of collection all the way to the application and use of the data.
• Intelligent Data Understanding: Intelligent data understanding (IDU) refers to the capability to automatically mine and analyze large datasets that are large, noisy, and of varying modalities (discrete, continuous, text, graph, etc.), in order to extract or discover information that can be used for further analysis or in decision making, on the ground or onboard. It is closely coupled to the capability to detect and respond to interesting events and/or to generate alerts.
• Semantic Technologies: Technologies that enable data understanding, analysis, and automated consulting and operations.
Each of these areas has cross-‐cutting challenges which are relevant to AIST as follows. Flight Computing
TA Technology Name Description Applicable Technology Area 11.1
.1.3 High Performance Flight Software Enables on-board, high performance autonomy and data processing Onboardreduction, real-time event detection, and data science methods for data planning
Ground Computing
TA Technology Name Description Applicable Technology Area 11.1.
2.1
Exascale Supercomputer
Provides peak computational capability of ≥ 1 exaflop (10^18 floating point operations per second) for exascale performance of NASA computations, with excellent energy efficiency and reliability, to support NASA’s exponentially growing high end computational needs.
Computationalinfrastructures to scale data analytics
11.1.
2.2 Automated Exascale Software Development Toolset
Provides automated, exascale application performance monitoring, analysis, tuning, and scaling.
Computationalinfrastructures to scale data analytics
11.1.
2.3 Supercomputing File Exascale System
Provides online data storage capacity of ≥ 1 exabyte, enabling data storage for exascale M&S and data analysis, with sufficient performance and reliability to maintain productivity for a broad array of NASA applications.
Computationalinfrastructures to scale data analytics
11.1. 2.5
Public Cloud Supercomputer
Provides additional resources for NASA supercomputer users, such as for mission-critical computing in an emergency.
Computationalinfrastructures to scale data analytics
TA Technology Name Description Applicable Technology Area 11.4.1. 1 Reference Information System Architecture Frameworks
Provide reference architectures for the end-to-end science and engineering data lifecycle.
Definition of a scalable data big data lifecyce architecture for earth observing systems identifying how Big Data can scale from onboard computing to data analysis to increase science yield.
11.4.1.
2 Information Distributed Architecture Frameworks
Provide reference information architectures to define data across the end-to-end engineering and science data lifecycle
Advanced semantic technologies for auto generating ontologies and information models from existing large, distributed, existing data. 11.4.1.
4
Onboard Data Capture and Triage
Methodologies
Apply novel machine learning capabilities on board to support data reduction, model-based compression and triage of massive data sets
Onboard data science methods for data reduction, real-time event detection, and planning. 11.4.1.
5 Triage and Data Real-time Data Reduction Methodologies
Apply novel machine learning capabilities in ground data processing systems to support data reduction and triage of massive data sets
Real-time ground-based data science methods for data reduction and real-time event detection for massive data streams as part of the data lifecycle architecture. 11.4.1. 6 Scalable Data Processing Frameworks
Provide scalable software processing frameworks for processing scientific and engineering data sets.
Open source data processing frameworks that can massively scale to handle large data streams, products, including near-real time constraints, as part of the data lifecycle architecture.
11.4.1.
7 Engineering and Massive Science Data
Analysis Methodologies
Provide scalable methodologies for analysis of
massive data. Development of methodologies for analysis of data on the ground reusable data science as part of the data lifecycle architecture. This includes on-demand data analytics for massive data repositories.
11.4.1. 8
Remote Data Access Framework
Provides access to and sharing of distributed data sources in a secure environment.
Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics
11.4.1.
9 Massive Data Movement Services
Develop new technologies for the movement of
massive, multi-petabyte data over the network. Data movement output, etc) across the data lifecycle is critical to (observations, climate model scaling for Big Data. Develop capabilities to improve movement of data between points of the data lifecycle from networks to parallel methods for data transfer. 11.4.1. 10 Large-Scale Data Dissemination Environments
Enable scaling data infrastructures (software, computation, networks, etc) that are required to support large-scale data dissemination
Massive data distribution for large-scale repostiroeis and archives including methods for data reduction, computation, etc, as integrated, on-demand data analytics.
11.4.1. 11
Toolset for Massive Model Data
Makes data/information transparent, scalable and usable when infusing multiple large and diverse datasets in complex models.
Scale tools that provide on-demand data analytics that bring together distributed models,
observational data, and provides a computational infrastructure to enable analysis and
intercomparison. On Demand Data
Analytics
Provides analytics service coupled with computational infrastructures.
On-demand data analytics that can integrate data from archives, repositories, etc, applying data science methods (data reduction, fusion, feature detection, etc..) provided through a computational infrastructure
Intelligent Data Understanding
TA Technology Name Description Applicable Technology Area 11.4.
2.1 Intelligent Data Collection and Prioritization Toolset
Provides a means to reduce the size of the data (e.g., removing clouds), remove corrupted data and/or collect complementary data for
value-Development of reusable data science methodologies for analysis of data on the ground as part of the data lifecycle architecture. In
added information. particular, intelligent algorithms for data reduction, classification, generation, etc…
11.4.
2.2 Event Detection and Intelligent Action Toolset
Provides computational mechanisms to identify high information content data, either pre-specified, novel or anomalous, including multi-spacecraft collaborative event detection and to take an autonomous or assisted onboard decision, as a result of data analysis
Onboard and ground-based data science methods for data reduction, real-time event detection, and planning.
11.4. 2.3
Data on Demand Toolset
Enables users and models to task sensors and to leverage sensor webs to develop "on-demand" products
This is representative of multiple AIST areas including architecture, on demand data processing, etc
11.4.
2.4 Search and Mining Intelligent Data Toolset
Develops search services and engines for massive, distributed data holdings; enables the application of different searching rules/schemes that learn from past searches and develop “agents” to find and create the most relevant products. It includes rich queries, including fact-based, free-text searches, web-service based indexing as well as anomaly/novelty detection, where the system suggests items of interest to the user without the user necessarily prescribing the information being sought.
Provide methods for intelligent search and mining of massive data. This may include integration of on-demand analytics to perform deep searches.
11.4.
2.5 Data Fusion Toolset Combines data from multiple sources (e.g. remote sensing, in-situ, models) in order to make inferences that might otherwise not be possible with single data sources, or in order to improve the uncertainty characteristics of these inferences, over what might be achieved with single data sources.
Ground-based data science methods for data fusion including uncertainty quantification require to make scientific inferences.
Semantic Technologies
TA Technology Name Description Applicable Technology Area 11.4.
3.1
Semantic Enabler for Data (Text, Binary
and Databases)
Ingests completely data of all types and produces a data model and a precision ontology, improves the quality of any existing metadata, including provenance and quality of the source document and disambiguates words in the context of the document, database or file and visualizes the results.
Advanced semantic technologies for auto generating ontologies and information models from existing large, distributed, existing data. This could be extremely useful for reverse engineering a large-semantic model for a scientific discipline or sub-discipline. 11.4. 3.2 Ultra Large-Scale Visualization and Incremental Toolset
Enables and automates analysis of ultra large scale datasets and visualizes the results in the context of the knowledge domain
Visualization of massive data sets incuding data reduction methods that are driven by domain. 11.4.
3.3
Semantic Bridge Framework
Enables the alignment of two or more data sources based on their respective ontological descriptions to achieve semantic interoperability and facilitates calculations to be properly made using data from each dataset.
Advanced semantic technologies for ontology integration. As NASA moves to more system science, spans multiple missions, instruments, systems, environments, etc, this is critical. Constructing On-demand data analytics capabilities will require it.
Collaborative Systems
TA Technology Name Description Applicable Technology Area 11.4.
4.3 Distributed Collaborative Science Data Analysis
Enable data, computation and services to be brought together to support distributed data analysis in collaborative environments for science
On-demand data analytics that can integrate data from archives, repositories, etc, applying data science methods (data reduction, fusion, feature
Frameworks detection, etc..) provided through a computational infrastructure enabling scientific collaboration.
Mission Systems
TA Technology Name Description Applicable Technology Area 11.4.
5.2
Adaptive Systems Framework
Manages a set of interacting or interdependent entities, real or abstract, forming an integrated whole that together are able to respond to environmental changes or changes in the interacting parts.
Open source data processing frameworks that can massively scale to handle large data streams, products, including near-real time constraints, as part of the data lifecycle architecture.
Cyberinfrastructure
TA
Technology
Name Description
Applicable Technology Area 11.4.
6.1 On-Demand, Multi-Mission Data Storage and Computation
Provides scalable storage and computing available on demand across projects including both internal and external (hybrid) clouds at center and NASA levels.
Computationalinfrastructures to scale data archive and analytics. Note: this could an area for joint exploration between ESTO/AIST and NASA OCIO. 11.4. 6.2 Scalable Data Management Frameworks
Are extensible, scalable data management frameworks that can take advantage of massive storage and computing resources
Open source data mangement/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture for mission data management.
11.4.
6.3 Scalable Data Archives Systems Support scalable archives that can capture, manage, distribute and preserve massive data sets, both engineering and science.
Open source data mangement/science frameworks that can massively scale to handle and manage large data streams, products, including near-real time constraints, as part of the data lifecycle architecture, for archiving. 11.4.
6.4 High Performance Networking Provide terabit data networks to handle movement of massive datasets, particularly those for scientific research.
Data movement (observations, climate model output, etc) across the data lifecycle is critical to scaling for Big Data. Provide high speed networks as core infrastructure. Note: this could an area for joint exploration between ESTO/AIST and NASA OCIO.
Human System Interaction
TA Technology Name Description Applicable Technology Area 11.4.
7.6
Assistive tool for heterogeneous data integration
Allows integration of heterogeneous data sources to enable querying and linking data.
Federation of data access from distributed repositories as part of the data lifecycle architecture, moving towards on-demand distributed data analytics. Advanced semantic technologies for ontology integration.
Appendix D. Classification of Agency Big Data Efforts Based on NITRD, and other presentations; etc.