Volume 3, Use Cases and General Requirements. NIST Big Data Public Working Group Overview

(1)

NIST Big Data Public Working Group

Volume 3, Use Cases and General Requirements

NIST Big Data Public Working Group

Overview

Geoffrey Fox Indiana University

Piyush Mehrotra, NASA Ames

NIST Campus

(2)

June 1, 2017

Volume 3, Use Cases and General Requirements

Document Scope

• Version 1 collected 51 big data use cases with a 26 feature

template and used this to extract requirements to feed into

NIST Big Data Reference Architecture

• The version 2 template merges version 1 General and Security

& Privacy use case analysis

• The discussion of this at first NIST Big Data meeting identified

need for patterns which were proposed during version 2 work;

version 2 template incorporates new questions to help identify

patterns.

• Work with Vol 4 (SnP) Vol 6 (Big Data Reference Architecture)

Vol 7 (standards) Vol 8 (interfaces)

(3)

June 1, 2017

Volume 3, Use Cases and General Requirements

Version 1 Overview

• Gathered and evaluated 51 use cases from nine application

domains.

• Gathered input regarding Big Data requirements

• Analyzed and prioritized a list of challenging use case specific

requirements that may delay or prevent adoption of Big Data

deployment

• Developed a comprehensive list of generalized Big Data

requirements

• Developed a set of features that characterized applications

–

Used to compare different Big Data problems and to

• Collaborated with the NBD-PWG Reference Architecture

Subgroup to provide input for the NBDRA

(4)

June 1, 2017

51 Detailed Use Cases: Version 1 Contributed July-September

2013

• Government Operation(4): National Archives and Records Administration, Census

Bureau

• Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web

Search, Digital Materials, Cargo shipping (as in UPS)

• Defense(3): Sensors, Image surveillance, Situation Assessment

• Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis,

Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity

• Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter,

Crowd Sourcing, Network Science, NIST benchmark datasets

• The Ecosystem for Research(4): Metadata, Collaboration, Translation, Light source

data

• Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large

Hadron Collider at CERN, Belle II Accelerator in Japan

• Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere,

Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification,

Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors

• Energy(1): Smart grid 4

(5)

June 1, 2017

Version 1

Use Case Template

• Note agreed in this form

August 11 2013

• Some clarification on Veracity

v. Data Quality added

• Request for picture and

summary done by hand for

version 1 but included in

version 2 template.

• Early version 1 use cases did a

detailed breakup of workflow

into multiple stages which we

want to restore but do not

have agreed format yet

(6)

Big Data Applications & Analytics MOOC Use Case Analysis Fall 2013

12/26/13

Size of Requirements Analysis

• 35 General Requirements

• 437 Specific Requirements

–

8.6 per use case, 12.5 per general requirement

• Data Sources:

3 General 78 Specific

• Transformation:

4 General 60 Specific

• Capability (Infrastructure):

6 General 133 Specific

• Data Consumer:

6 General 55 Specific

• Security & Privacy:

2 General 45 Specific

• Lifecycle

: 9 General 43 Specific

• Other:

5 General 23 Specific

(7)

June 1, 2017

(8)

June 1, 2017

Classifying Use Cases into Patterns labelled by Features

• The

Big Data Ogres

built on a collection of 51 big data uses gathered by

the NIST Public Working Group where 26 properties were gathered for

each application.

• This information was combined with other studies including the

Berkeley

dwarfs

, the

NAS parallel benchmarks

and the

Computational Giants of

the NRC Massive Data Analysis Report

.

• The Ogre analysis led to a set of

50 features

divided into four views that

could be used to categorize and distinguish between applications.

• The four views are

Problem Architecture

(Macro pattern);

Execution

Features

(Micro patterns);

Data Source and Style

; and finally the

Processing View

or runtime features.

• We generalized this approach to integrate Big Data and Simulation

applications into a single classification looking separately at

Data

and

Model

with the total facets growing to 64 in number, called

convergence diamonds

, and split between the same 4 views.

(9)

June 1, 2017

7 Computational Giants of NRC Massive Data

Analysis Report

1)

G1:

Basic Statistics (termed MRStat later as

suitable for simple MapReduce implementation)

2)

G2:

Generalized N-Body Problems

3)

G3:

Graph-Theoretic Computations

4)

G4:

Linear Algebraic Computations

5)

G5:

Optimizations e.g. Linear Programming

6)

G6:

Integration (Called GML Global Machine

Learning Later)

7)

G7:

Alignment Problems e.g. BLAST

(10)

June 1, 2017

Features of 51 Use Cases I

• PP (26)

“All”

Pleasingly Parallel or Map Only

• MR (18)

Classic MapReduce MR (add MRStat below for full count)

• MRStat (7

) Simple version of MR where key computations are simple

reduction as found in statistical averages such as histograms and

averages

• MRIter (23

)

Iterative MapReduce or MPI (Spark, Twister)

• Graph (9)

Complex graph data structure needed in analysis

• Fusion (11)

Integrate diverse data to aid discovery/decision making;

could involve sophisticated algorithms or could just be a portal

• Streaming (41)

data comes in incrementally and is processed this

way

• Classify (30)

Classification: divide data into categories

• S/Q (12)

Index, Search and Query

(11)

June 1, 2017

Patterns (Ogres) modelled on 13 Berkeley Dwarfs

11

• Dense Linear Algebra

• Sparse Linear Algebra

• Spectral Methods

• N-Body Methods

• Structured Grids

• Unstructured Grids

• MapReduce

• Combinational Logic

• Graph Traversal

• Dynamic Programming

• Backtrack and

Branch-and-Bound

• Graphical Models

• Finite State Machines

• The Berkeley dwarfs and NAS Parallel Benchmarks are perhaps two best

known approaches to characterizing Parallel Computing Uses Cases / Kernels

/ Patterns

• Note dwarfs somewhat inconsistent as for example MapReduce is a

programming model and spectral method is a numerical method.

(12)

June 1, 2017

Features of 51 Use Cases II

• CF (4)

Collaborative Filtering for recommender engines

• LML (36) Local Machine Learning

(

Independent for each parallel entity) –

application could have GML as well

• GML (23) Global Machine Learning:

Deep Learning, Clustering, LDA, PLSI,

MDS,

– Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief

Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can call EGO or Exascale Global Optimization with scalable parallel algorithm

• Workflow (51)

Universal

• GIS (16)

Geotagged data and often displayed in ESRI, Microsoft Virtual

Earth, Google Earth, GeoServer etc.

• HPC (5)

Classic large-scale simulation of cosmos, materials, etc.

generating (visualization) data

• Agent (2)

Simulations of models of data-defined macroscopic entities

represented as agents

(13)

(14)

12/26/13

3: Census Bureau Statistical Survey

Response Improvement (Adaptive Design)

• Application: Survey costs are increasing as survey response declines. The goal of this work is to use advanced “recommendation system techniques” that are

open and scientifically objective, using data mashed up from several sources and historical survey para-data (administrative data about the survey) to drive

operational processes in an effort to increase quality and reduce the cost of field surveys.

• Current Approach: About a petabyte of data coming from surveys and other government administrative sources. Data can be streamed with approximately 150 million records transmitted as field data streamed continuously, during the decennial census. All data must be both confidential and secure. All processes must be auditable for security and confidentiality as required by various legal statutes. Data quality should be high and statistically checked for accuracy and reliability throughout the collection process. Use Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig

software.

• Futures: Analytics needs to be developed which give statistical estimations that provide more detail, on a more near real time basis for less cost. The reliability of estimated statistics from such “mashed up” sources still must be evaluated.

14

Governmen t

PP, MRStat, S/Q, Index, CF

(15)

12/26/13

13: Cloud Large Scale

Geospatial Analysis and

Visualization

• Application: Need to support large scale geospatial data analysis and visualization with number of geospatially aware sensors and the number of geospatially tagged data sources rapidly increasing.

• Current Approach: Traditional GIS systems are generally capable of analyzing a millions of objects and easily visualizing thousands. Data types include Imagery

(various formats such as NITF, GeoTiff, CADRG), and vector with various formats like shape files, KML, text streams. Object types include points, lines, areas, polylines, circles, ellipses. Data accuracy very important with image registration and sensor accuracy relevant. Analytics include closest point of approach, deviation from route, and point density over time, PCA and ICA. Software includes Server with Geospatially enabled RDBMS, Geospatial server/analysis software – ESRI ArcServer, Geoserver; Visualization by ArcMap or browser based visualization

• Futures: Today’s intelligence systems often contain trillions of geospatial objects and need to be able to visualize and interact with millions of objects. Critical issues are Indexing, retrieval and distributed analysis; Visualization generation and

transmission; Visualization of data at the end of low bandwidth wireless connections; Data is sensitive and must be completely secure in transit and at rest (particularly on handhelds); Geospatial data requires unique approaches to indexing and distributed analysis.

15

Defens e

(16)

12/26/13

19: NIST Genome in a Bottle

Consortium

• Application:

NIST/Genome in a Bottle Consortium integrates data

from multiple sequencing technologies and methods to develop

highly confident characterization of whole human genomes as

reference materials, and develop methods to use these Reference

Materials to assess performance of any genome sequencing run.

• Current Approach:

The storage of ~40TB NFS at NIST is full; there

are also PBs of genomics data at NIH/NCBI. Use Open-source

sequencing bioinformatics software from academic groups

(UNIX-based) on a 72 core cluster at NIST supplemented by larger

systems at collaborators.

• Futures:

DNA sequencers can generate ~300GB compressed

data/day which volume has increased much faster than Moore’s

Law. Future data could include other ‘omics’ measurements,

which will be even larger than DNA sequencing. Clouds have been

explored.

16

Healthcare Life Sciences

PP, MR, MRIter, Classification

Parallelism over Gene fragments at various stages

(17)

12/26/13

38: Large Survey Data for

Cosmology

• Application: For DES (Dark Energy Survey) the data are sent from the mountaintop via a microwave link to La Serena, Chile. From there, an optical link forwards them to the NCSA as well as NERSC for storage and "reduction”. Here galaxies and stars in both the individual and stacked images are identified, catalogued, and finally their

properties measured and stored in a database.

17

Astronomy & Physics

• Current Approach: Subtraction pipelines are run using extant imaging data to find new optical transients through machine learning algorithms. Linux cluster, Oracle RDBMS server, Postgres PSQL, large memory machines, standard Linux interactive hosts, GPFS. For simulations, HPC resources.

Standard astrophysics reduction software as well as Perl/Python wrapper scripts, Linux Cluster

scheduling.

• Futures: Techniques for handling Cholesky

decompostion for thousands of simulations with matrices of order 1M on a side and parallel image storage would be important. LSST will generate 60PB of imaging data and 15PB of catalog data and a correspondingly large (or larger) amount of

simulation data. Over 20TB of data per night.

Victor M. Blanco Telescope Chile where new wide angle 520 mega pixel camera DECam installed

(18)

June 1, 2017

Typical Big Data Pattern 2. Perform real time analytics on

data source streams and notify users when specified

events occur

Storm (Heron), Kafka, Hbase, Zookeeper

Streaming Data

Posted Data

Identified

_Events

Filter Identifying Events

Repository

Specify filter

Typical Big Data Pattern 5A. Perform interactive

analytics on observational scientific data

Grid or Many Task Software, Hadoop, Spark, Giraph, Pig …

Data Storage: HDFS, Hbase, File Collection

Streaming Twitter data for Social Networking

Science Analysis Code, Mahout, R, SPIDAL

Transport batch of data to primary analysis data system

Record Scientific Data in “field”

Local Accumulate

and initial computing Direct Transfer

NIST examples include LHC, Remote Sensing, Astronomy and

(20)

June 1, 2017

10. Orchestrate multiple sequential and parallel data

transformations and/or analytic processing using a workflow

manager

Hadoop, Spark, Giraph, Pig …

Data Storage: HDFS, Hbase

Analytic-1

Analytic-2

Orchestration Layer (Workflow)

Specify Analytics Pipeline

Analytic-3

(Visualize)

(21)

June 1, 2017

Volume 3, Use Cases and General Requirements

Version 2 Opportunities for Contribution

• More use cases. (Roll up

sleeves; budget an hour.)

• Soliciting greater application

domain diversity:

–

Smart cars (Smart X)

–

Large scale utility IoT

–

Geolocation applications

involving people

–

Energy from discovery to

generation

–

Scientific studies involving

human subjects at large scale

–

Highly distributed use cases

bridging multiple enterprises

21

Choose a domain and

collect/analyze a set of related

use-cases

Develop technology requirements

for applications in domain

Feed lessons into version 3 of

template

Compare different big data

applications in needed

architecture, interfaces

(22)

June 1, 2017