NIST Big Data Public Working Group
Volume 3, Use Cases and General Requirements
NIST Big Data Public Working Group
Overview
Geoffrey Fox Indiana University
Piyush Mehrotra, NASA Ames
NIST Campus
June 1, 2017
Volume 3, Use Cases and General Requirements
Document Scope
•
Version 1 collected 51 big data use cases with a 26 feature
template and used this to extract requirements to feed into
NIST Big Data Reference Architecture
•
The version 2 template merges version 1 General and Security
& Privacy use case analysis
•
The discussion of this at first NIST Big Data meeting identified
need for patterns which were proposed during version 2 work;
version 2 template incorporates new questions to help identify
patterns.
•
Work with Vol 4 (SnP) Vol 6 (Big Data Reference Architecture)
Vol 7 (standards) Vol 8 (interfaces)
June 1, 2017
Volume 3, Use Cases and General Requirements
Version 1 Overview
•
Gathered and evaluated 51 use cases from nine application
domains.
•
Gathered input regarding Big Data requirements
•
Analyzed and prioritized a list of challenging use case specific
requirements that may delay or prevent adoption of Big Data
deployment
•
Developed a comprehensive list of generalized Big Data
requirements
•
Developed a set of features that characterized applications
–
Used to compare different Big Data problems and to
•
Collaborated with the NBD-PWG Reference Architecture
Subgroup to provide input for the NBDRA
June 1, 2017
51 Detailed Use Cases: Version 1 Contributed July-September
2013
• Government Operation(4): National Archives and Records Administration, Census
Bureau
• Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web
Search, Digital Materials, Cargo shipping (as in UPS)
• Defense(3): Sensors, Image surveillance, Situation Assessment
• Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis,
Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity
• Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter,
Crowd Sourcing, Network Science, NIST benchmark datasets
• The Ecosystem for Research(4): Metadata, Collaboration, Translation, Light source
data
• Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large
Hadron Collider at CERN, Belle II Accelerator in Japan
• Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere,
Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification,
Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors
• Energy(1): Smart grid 4
June 1, 2017
Version 1
Use Case Template
•
Note agreed in this form
August 11 2013
•
Some clarification on Veracity
v. Data Quality added
•
Request for picture and
summary done by hand for
version 1 but included in
version 2 template.
•
Early version 1 use cases did a
detailed breakup of workflow
into multiple stages which we
want to restore but do not
have agreed format yet
Big Data Applications & Analytics MOOC Use Case Analysis Fall 2013
12/26/13
Size of Requirements Analysis
•
35 General Requirements
•
437 Specific Requirements
–
8.6 per use case, 12.5 per general requirement
•
Data Sources:
3 General 78 Specific
•
Transformation:
4 General 60 Specific
•
Capability (Infrastructure):
6 General 133 Specific
•
Data Consumer:
6 General 55 Specific
•
Security & Privacy:
2 General 45 Specific
•
Lifecycle
: 9 General 43 Specific
•
Other:
5 General 23 Specific
June 1, 2017
June 1, 2017
Classifying Use Cases into Patterns labelled by Features
•
The
Big Data Ogres
built on a collection of 51 big data uses gathered by
the NIST Public Working Group where 26 properties were gathered for
each application.
•
This information was combined with other studies including the
Berkeley
dwarfs
, the
NAS parallel benchmarks
and the
Computational Giants of
the NRC Massive Data Analysis Report
.
•
The Ogre analysis led to a set of
50 features
divided into four views that
could be used to categorize and distinguish between applications.
•
The four views are
Problem Architecture
(Macro pattern);
Execution
Features
(Micro patterns);
Data Source and Style
; and finally the
Processing View
or runtime features.
•
We generalized this approach to integrate Big Data and Simulation
applications into a single classification looking separately at
Data
and
Model
with the total facets growing to 64 in number, called
convergence diamonds
, and split between the same 4 views.
June 1, 2017
7 Computational Giants of NRC Massive Data
Analysis Report
1)
G1:
Basic Statistics (termed MRStat later as
suitable for simple MapReduce implementation)
2)
G2:
Generalized N-Body Problems
3)
G3:
Graph-Theoretic Computations
4)
G4:
Linear Algebraic Computations
5)
G5:
Optimizations e.g. Linear Programming
6)
G6:
Integration (Called GML Global Machine
Learning Later)
7)
G7:
Alignment Problems e.g. BLAST
June 1, 2017
Features of 51 Use Cases I
•
PP (26)
“All”
Pleasingly Parallel or Map Only
•
MR (18)
Classic MapReduce MR (add MRStat below for full count)
•
MRStat (7
) Simple version of MR where key computations are simple
reduction as found in statistical averages such as histograms and
averages
•
MRIter (23
)
Iterative MapReduce or MPI (Spark, Twister)
•
Graph (9)
Complex graph data structure needed in analysis
•
Fusion (11)
Integrate diverse data to aid discovery/decision making;
could involve sophisticated algorithms or could just be a portal
•
Streaming (41)
data comes in incrementally and is processed this
way
•
Classify (30)
Classification: divide data into categories
•
S/Q (12)
Index, Search and Query
June 1, 2017
Patterns (Ogres) modelled on 13 Berkeley Dwarfs
11
•
Dense Linear Algebra
•
Sparse Linear Algebra
•
Spectral Methods
•
N-Body Methods
•
Structured Grids
•
Unstructured Grids
•
MapReduce
•
Combinational Logic
•
Graph Traversal
•
Dynamic Programming
•
Backtrack and
Branch-and-Bound
•
Graphical Models
•
Finite State Machines
•
The Berkeley dwarfs and NAS Parallel Benchmarks are perhaps two best
known approaches to characterizing Parallel Computing Uses Cases / Kernels
/ Patterns
•
Note dwarfs somewhat inconsistent as for example MapReduce is a
programming model and spectral method is a numerical method.
June 1, 2017
Features of 51 Use Cases II
•
CF (4)
Collaborative Filtering for recommender engines
•
LML (36) Local Machine Learning
(
Independent for each parallel entity) –
application could have GML as well
•
GML (23) Global Machine Learning:
Deep Learning, Clustering, LDA, PLSI,
MDS,
– Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief
Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can call EGO or Exascale Global Optimization with scalable parallel algorithm
•
Workflow (51)
Universal
•
GIS (16)
Geotagged data and often displayed in ESRI, Microsoft Virtual
Earth, Google Earth, GeoServer etc.
•
HPC (5)
Classic large-scale simulation of cosmos, materials, etc.
generating (visualization) data
•
Agent (2)
Simulations of models of data-defined macroscopic entities
represented as agents
Big Data Applications & Analytics MOOC Use Case Analysis Fall 2013
12/26/13
3: Census Bureau Statistical Survey
Response Improvement (Adaptive Design)
• Application: Survey costs are increasing as survey response declines. The goal of this work is to use advanced “recommendation system techniques” that are
open and scientifically objective, using data mashed up from several sources and historical survey para-data (administrative data about the survey) to drive
operational processes in an effort to increase quality and reduce the cost of field surveys.
• Current Approach: About a petabyte of data coming from surveys and other government administrative sources. Data can be streamed with approximately 150 million records transmitted as field data streamed continuously, during the decennial census. All data must be both confidential and secure. All processes must be auditable for security and confidentiality as required by various legal statutes. Data quality should be high and statistically checked for accuracy and reliability throughout the collection process. Use Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig
software.
• Futures: Analytics needs to be developed which give statistical estimations that provide more detail, on a more near real time basis for less cost. The reliability of estimated statistics from such “mashed up” sources still must be evaluated.
14
Governmen t
PP, MRStat, S/Q, Index, CF
Big Data Applications & Analytics MOOC Use Case Analysis Fall 2013
12/26/13
13: Cloud Large Scale
Geospatial Analysis and
Visualization
• Application: Need to support large scale geospatial data analysis and visualization with number of geospatially aware sensors and the number of geospatially tagged data sources rapidly increasing.
• Current Approach: Traditional GIS systems are generally capable of analyzing a millions of objects and easily visualizing thousands. Data types include Imagery
(various formats such as NITF, GeoTiff, CADRG), and vector with various formats like shape files, KML, text streams. Object types include points, lines, areas, polylines, circles, ellipses. Data accuracy very important with image registration and sensor accuracy relevant. Analytics include closest point of approach, deviation from route, and point density over time, PCA and ICA. Software includes Server with Geospatially enabled RDBMS, Geospatial server/analysis software – ESRI ArcServer, Geoserver; Visualization by ArcMap or browser based visualization
• Futures: Today’s intelligence systems often contain trillions of geospatial objects and need to be able to visualize and interact with millions of objects. Critical issues are Indexing, retrieval and distributed analysis; Visualization generation and
transmission; Visualization of data at the end of low bandwidth wireless connections; Data is sensitive and must be completely secure in transit and at rest (particularly on handhelds); Geospatial data requires unique approaches to indexing and distributed analysis.
15
Defens e
Big Data Applications & Analytics MOOC Use Case Analysis Fall 2013
12/26/13
19: NIST Genome in a Bottle
Consortium
•
Application:
NIST/Genome in a Bottle Consortium integrates data
from multiple sequencing technologies and methods to develop
highly confident characterization of whole human genomes as
reference materials, and develop methods to use these Reference
Materials to assess performance of any genome sequencing run.
•
Current Approach:
The storage of ~40TB NFS at NIST is full; there
are also PBs of genomics data at NIH/NCBI. Use Open-source
sequencing bioinformatics software from academic groups
(UNIX-based) on a 72 core cluster at NIST supplemented by larger
systems at collaborators.
•
Futures:
DNA sequencers can generate ~300GB compressed
data/day which volume has increased much faster than Moore’s
Law. Future data could include other ‘omics’ measurements,
which will be even larger than DNA sequencing. Clouds have been
explored.
16
Healthcare Life Sciences
PP, MR, MRIter, Classification
Parallelism over Gene fragments at various stages
Big Data Applications & Analytics MOOC Use Case Analysis Fall 2013
12/26/13
38: Large Survey Data for
Cosmology
• Application: For DES (Dark Energy Survey) the data are sent from the mountaintop via a microwave link to La Serena, Chile. From there, an optical link forwards them to the NCSA as well as NERSC for storage and "reduction”. Here galaxies and stars in both the individual and stacked images are identified, catalogued, and finally their
properties measured and stored in a database.
17
Astronomy & Physics
• Current Approach: Subtraction pipelines are run using extant imaging data to find new optical transients through machine learning algorithms. Linux cluster, Oracle RDBMS server, Postgres PSQL, large memory machines, standard Linux interactive hosts, GPFS. For simulations, HPC resources.
Standard astrophysics reduction software as well as Perl/Python wrapper scripts, Linux Cluster
scheduling.
• Futures: Techniques for handling Cholesky
decompostion for thousands of simulations with matrices of order 1M on a side and parallel image storage would be important. LSST will generate 60PB of imaging data and 15PB of catalog data and a correspondingly large (or larger) amount of
simulation data. Over 20TB of data per night.
Victor M. Blanco Telescope Chile where new wide angle 520 mega pixel camera DECam installed
June 1, 2017
Typical Big Data Pattern 2. Perform real time analytics on
data source streams and notify users when specified
events occur
Storm (Heron), Kafka, Hbase, Zookeeper
Streaming Data
Streaming Data
Streaming Data
Posted Data
Identified
Events
Filter Identifying Events
Repository
Specify filter
Archive
Post Selected Events
Fetch
June 1, 2017
Typical Big Data Pattern 5A. Perform interactive
analytics on observational scientific data
Grid or Many Task Software, Hadoop, Spark, Giraph, Pig …
Data Storage: HDFS, Hbase, File Collection
Streaming Twitter data for Social Networking
Science Analysis Code, Mahout, R, SPIDAL
Transport batch of data to primary analysis data system
Record Scientific Data in “field”
Local Accumulate
and initial computing Direct Transfer
NIST examples include LHC, Remote Sensing, Astronomy and
June 1, 2017
10. Orchestrate multiple sequential and parallel data
transformations and/or analytic processing using a workflow
manager
Hadoop, Spark, Giraph, Pig …
Data Storage: HDFS, Hbase
Analytic-1
Analytic-2
Orchestration Layer (Workflow)
Specify Analytics Pipeline
Analytic-3
(Visualize)
June 1, 2017
Volume 3, Use Cases and General Requirements
Version 2 Opportunities for Contribution
•
More use cases. (Roll up
sleeves; budget an hour.)
•
Soliciting greater application
domain diversity:
–
Smart cars (Smart X)
–
Large scale utility IoT
–
Geolocation applications
involving people
–
Energy from discovery to
generation
–
Scientific studies involving
human subjects at large scale
–
Highly distributed use cases
bridging multiple enterprises
21
Choose a domain and
collect/analyze a set of related
use-cases
Develop technology requirements
for applications in domain
Feed lessons into version 3 of
template
Compare different big data
applications in needed
architecture, interfaces
June 1, 2017