• No results found

Volume 3, Use Cases and General Requirements. NIST Big Data Public Working Group Overview

N/A
N/A
Protected

Academic year: 2019

Share "Volume 3, Use Cases and General Requirements. NIST Big Data Public Working Group Overview"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

NIST Big Data Public Working Group

Volume 3, Use Cases and General Requirements

NIST Big Data Public Working Group

Overview

Geoffrey Fox Indiana University

Piyush Mehrotra, NASA Ames

NIST Campus

(2)

June 1, 2017

Volume 3, Use Cases and General Requirements

Document Scope

Version 1 collected 51 big data use cases with a 26 feature

template and used this to extract requirements to feed into

NIST Big Data Reference Architecture

The version 2 template merges version 1 General and Security

& Privacy use case analysis

The discussion of this at first NIST Big Data meeting identified

need for patterns which were proposed during version 2 work;

version 2 template incorporates new questions to help identify

patterns.

Work with Vol 4 (SnP) Vol 6 (Big Data Reference Architecture)

Vol 7 (standards) Vol 8 (interfaces)

(3)

June 1, 2017

Volume 3, Use Cases and General Requirements

Version 1 Overview

Gathered and evaluated 51 use cases from nine application

domains.

Gathered input regarding Big Data requirements

Analyzed and prioritized a list of challenging use case specific

requirements that may delay or prevent adoption of Big Data

deployment

Developed a comprehensive list of generalized Big Data

requirements

Developed a set of features that characterized applications

Used to compare different Big Data problems and to

Collaborated with the NBD-PWG Reference Architecture

Subgroup to provide input for the NBDRA

(4)

June 1, 2017

51 Detailed Use Cases: Version 1 Contributed July-September

2013

Government Operation(4): National Archives and Records Administration, Census

Bureau

Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web

Search, Digital Materials, Cargo shipping (as in UPS)

Defense(3): Sensors, Image surveillance, Situation Assessment

Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis,

Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity

Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter,

Crowd Sourcing, Network Science, NIST benchmark datasets

The Ecosystem for Research(4): Metadata, Collaboration, Translation, Light source

data

Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large

Hadron Collider at CERN, Belle II Accelerator in Japan

Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere,

Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification,

Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors

Energy(1): Smart grid 4

(5)

June 1, 2017

Version 1

Use Case Template

Note agreed in this form

August 11 2013

Some clarification on Veracity

v. Data Quality added

Request for picture and

summary done by hand for

version 1 but included in

version 2 template.

Early version 1 use cases did a

detailed breakup of workflow

into multiple stages which we

want to restore but do not

have agreed format yet

(6)

Big Data Applications & Analytics MOOC Use Case Analysis Fall 2013

12/26/13

Size of Requirements Analysis

35 General Requirements

437 Specific Requirements

8.6 per use case, 12.5 per general requirement

Data Sources:

3 General 78 Specific

Transformation:

4 General 60 Specific

Capability (Infrastructure):

6 General 133 Specific

Data Consumer:

6 General 55 Specific

Security & Privacy:

2 General 45 Specific

Lifecycle

: 9 General 43 Specific

Other:

5 General 23 Specific

(7)

June 1, 2017

(8)

June 1, 2017

Classifying Use Cases into Patterns labelled by Features

The

Big Data Ogres

built on a collection of 51 big data uses gathered by

the NIST Public Working Group where 26 properties were gathered for

each application.

This information was combined with other studies including the

Berkeley

dwarfs

, the

NAS parallel benchmarks

and the

Computational Giants of

the NRC Massive Data Analysis Report

.

The Ogre analysis led to a set of

50 features

divided into four views that

could be used to categorize and distinguish between applications.

The four views are

Problem Architecture

(Macro pattern);

Execution

Features

(Micro patterns);

Data Source and Style

; and finally the

Processing View

or runtime features.

We generalized this approach to integrate Big Data and Simulation

applications into a single classification looking separately at

Data

and

Model

with the total facets growing to 64 in number, called

convergence diamonds

, and split between the same 4 views.

(9)

June 1, 2017

7 Computational Giants of NRC Massive Data

Analysis Report

1)

G1:

Basic Statistics (termed MRStat later as

suitable for simple MapReduce implementation)

2)

G2:

Generalized N-Body Problems

3)

G3:

Graph-Theoretic Computations

4)

G4:

Linear Algebraic Computations

5)

G5:

Optimizations e.g. Linear Programming

6)

G6:

Integration (Called GML Global Machine

Learning Later)

7)

G7:

Alignment Problems e.g. BLAST

(10)

June 1, 2017

Features of 51 Use Cases I

PP (26)

“All”

Pleasingly Parallel or Map Only

MR (18)

Classic MapReduce MR (add MRStat below for full count)

MRStat (7

) Simple version of MR where key computations are simple

reduction as found in statistical averages such as histograms and

averages

MRIter (23

)

Iterative MapReduce or MPI (Spark, Twister)

Graph (9)

Complex graph data structure needed in analysis

Fusion (11)

Integrate diverse data to aid discovery/decision making;

could involve sophisticated algorithms or could just be a portal

Streaming (41)

data comes in incrementally and is processed this

way

Classify (30)

Classification: divide data into categories

S/Q (12)

Index, Search and Query

(11)

June 1, 2017

Patterns (Ogres) modelled on 13 Berkeley Dwarfs

11

Dense Linear Algebra

Sparse Linear Algebra

Spectral Methods

N-Body Methods

Structured Grids

Unstructured Grids

MapReduce

Combinational Logic

Graph Traversal

Dynamic Programming

Backtrack and

Branch-and-Bound

Graphical Models

Finite State Machines

The Berkeley dwarfs and NAS Parallel Benchmarks are perhaps two best

known approaches to characterizing Parallel Computing Uses Cases / Kernels

/ Patterns

Note dwarfs somewhat inconsistent as for example MapReduce is a

programming model and spectral method is a numerical method.

(12)

June 1, 2017

Features of 51 Use Cases II

CF (4)

Collaborative Filtering for recommender engines

LML (36) Local Machine Learning

(

Independent for each parallel entity) –

application could have GML as well

GML (23) Global Machine Learning:

Deep Learning, Clustering, LDA, PLSI,

MDS,

– Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief

Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can call EGO or Exascale Global Optimization with scalable parallel algorithm

Workflow (51)

Universal

GIS (16)

Geotagged data and often displayed in ESRI, Microsoft Virtual

Earth, Google Earth, GeoServer etc.

HPC (5)

Classic large-scale simulation of cosmos, materials, etc.

generating (visualization) data

Agent (2)

Simulations of models of data-defined macroscopic entities

represented as agents

(13)
(14)

Big Data Applications & Analytics MOOC Use Case Analysis Fall 2013

12/26/13

3: Census Bureau Statistical Survey

Response Improvement (Adaptive Design)

Application: Survey costs are increasing as survey response declines. The goal of this work is to use advanced “recommendation system techniques” that are

open and scientifically objective, using data mashed up from several sources and historical survey para-data (administrative data about the survey) to drive

operational processes in an effort to increase quality and reduce the cost of field surveys.

Current Approach: About a petabyte of data coming from surveys and other government administrative sources. Data can be streamed with approximately 150 million records transmitted as field data streamed continuously, during the decennial census. All data must be both confidential and secure. All processes must be auditable for security and confidentiality as required by various legal statutes. Data quality should be high and statistically checked for accuracy and reliability throughout the collection process. Use Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig

software.

Futures: Analytics needs to be developed which give statistical estimations that provide more detail, on a more near real time basis for less cost. The reliability of estimated statistics from such “mashed up” sources still must be evaluated.

14

Governmen t

PP, MRStat, S/Q, Index, CF

(15)

Big Data Applications & Analytics MOOC Use Case Analysis Fall 2013

12/26/13

13: Cloud Large Scale

Geospatial Analysis and

Visualization

Application: Need to support large scale geospatial data analysis and visualization with number of geospatially aware sensors and the number of geospatially tagged data sources rapidly increasing.

Current Approach: Traditional GIS systems are generally capable of analyzing a millions of objects and easily visualizing thousands. Data types include Imagery

(various formats such as NITF, GeoTiff, CADRG), and vector with various formats like shape files, KML, text streams. Object types include points, lines, areas, polylines, circles, ellipses. Data accuracy very important with image registration and sensor accuracy relevant. Analytics include closest point of approach, deviation from route, and point density over time, PCA and ICA. Software includes Server with Geospatially enabled RDBMS, Geospatial server/analysis software – ESRI ArcServer, Geoserver; Visualization by ArcMap or browser based visualization

Futures: Today’s intelligence systems often contain trillions of geospatial objects and need to be able to visualize and interact with millions of objects. Critical issues are Indexing, retrieval and distributed analysis; Visualization generation and

transmission; Visualization of data at the end of low bandwidth wireless connections; Data is sensitive and must be completely secure in transit and at rest (particularly on handhelds); Geospatial data requires unique approaches to indexing and distributed analysis.

15

Defens e

(16)

Big Data Applications & Analytics MOOC Use Case Analysis Fall 2013

12/26/13

19: NIST Genome in a Bottle

Consortium

Application:

NIST/Genome in a Bottle Consortium integrates data

from multiple sequencing technologies and methods to develop

highly confident characterization of whole human genomes as

reference materials, and develop methods to use these Reference

Materials to assess performance of any genome sequencing run.

Current Approach:

The storage of ~40TB NFS at NIST is full; there

are also PBs of genomics data at NIH/NCBI. Use Open-source

sequencing bioinformatics software from academic groups

(UNIX-based) on a 72 core cluster at NIST supplemented by larger

systems at collaborators.

Futures:

DNA sequencers can generate ~300GB compressed

data/day which volume has increased much faster than Moore’s

Law. Future data could include other ‘omics’ measurements,

which will be even larger than DNA sequencing. Clouds have been

explored.

16

Healthcare Life Sciences

PP, MR, MRIter, Classification

Parallelism over Gene fragments at various stages

(17)

Big Data Applications & Analytics MOOC Use Case Analysis Fall 2013

12/26/13

38: Large Survey Data for

Cosmology

Application: For DES (Dark Energy Survey) the data are sent from the mountaintop via a microwave link to La Serena, Chile. From there, an optical link forwards them to the NCSA as well as NERSC for storage and "reduction”. Here galaxies and stars in both the individual and stacked images are identified, catalogued, and finally their

properties measured and stored in a database.

17

Astronomy & Physics

Current Approach: Subtraction pipelines are run using extant imaging data to find new optical transients through machine learning algorithms. Linux cluster, Oracle RDBMS server, Postgres PSQL, large memory machines, standard Linux interactive hosts, GPFS. For simulations, HPC resources.

Standard astrophysics reduction software as well as Perl/Python wrapper scripts, Linux Cluster

scheduling.

Futures: Techniques for handling Cholesky

decompostion for thousands of simulations with matrices of order 1M on a side and parallel image storage would be important. LSST will generate 60PB of imaging data and 15PB of catalog data and a correspondingly large (or larger) amount of

simulation data. Over 20TB of data per night.

Victor M. Blanco Telescope Chile where new wide angle 520 mega pixel camera DECam installed

(18)

June 1, 2017

Typical Big Data Pattern 2. Perform real time analytics on

data source streams and notify users when specified

events occur

Storm (Heron), Kafka, Hbase, Zookeeper

Streaming Data

Streaming Data

Streaming Data

Posted Data

Identified

Events

Filter Identifying Events

Repository

Specify filter

Archive

Post Selected Events

Fetch

(19)

June 1, 2017

Typical Big Data Pattern 5A. Perform interactive

analytics on observational scientific data

Grid or Many Task Software, Hadoop, Spark, Giraph, Pig …

Data Storage: HDFS, Hbase, File Collection

Streaming Twitter data for Social Networking

Science Analysis Code, Mahout, R, SPIDAL

Transport batch of data to primary analysis data system

Record Scientific Data in “field”

Local Accumulate

and initial computing Direct Transfer

NIST examples include LHC, Remote Sensing, Astronomy and

(20)

June 1, 2017

10. Orchestrate multiple sequential and parallel data

transformations and/or analytic processing using a workflow

manager

Hadoop, Spark, Giraph, Pig …

Data Storage: HDFS, Hbase

Analytic-1

Analytic-2

Orchestration Layer (Workflow)

Specify Analytics Pipeline

Analytic-3

(Visualize)

(21)

June 1, 2017

Volume 3, Use Cases and General Requirements

Version 2 Opportunities for Contribution

More use cases. (Roll up

sleeves; budget an hour.)

Soliciting greater application

domain diversity:

Smart cars (Smart X)

Large scale utility IoT

Geolocation applications

involving people

Energy from discovery to

generation

Scientific studies involving

human subjects at large scale

Highly distributed use cases

bridging multiple enterprises

21

Choose a domain and

collect/analyze a set of related

use-cases

Develop technology requirements

for applications in domain

Feed lessons into version 3 of

template

Compare different big data

applications in needed

architecture, interfaces

(22)

June 1, 2017

Volume 3, Use Cases and General Requirements

Possible Version 3 Topics

Identify gaps in use cases

Develop plausible, semi-fictionalized use cases

from industry reports, white papers, academic

project reports

Identify important parameters for classifying

systems

Microservice use cases

Map Use cases to work in Vol 4 (SnP)

Vol 6 (Big Data Reference Architecture)

Vol 7 (standards) Vol 8 (interfaces)

Container-oriented use cases

Forensic and provenance-centric use cases

Review fitness of the BDRA to use cases

References

Related documents

Low battery icon present: Change batteries in the transmitter, and then hold the CH button until the station beeps to search for the outdoor transmitter again. End of

If the Roger MyLink has been dropped or damaged, if it overheats during charging, has a damaged cord or plug, or has been dropped into liquid, stop using your Roger MyLink

With respect to products (goods and services), they are divided in those whose demand is caused basically by the visitors, referred to as specific products, and the

In Romania, this quite poorly exploited segment could represent an attraction for foreigners, in fields such as cultural trips, nature tourism, rural tourism,

Therefore, various laboratory equipment used in learning media with the help of ICT can be developed simulation application.. Particularly in the field of

The goals of this thesis is to expand the range of target lithologies suitable for cosmogenic 3 He dating by calibrating production rates of cosmogenic 3 He in accessory

The focus group interview method was applied to gather factors influencing transportation investment projects in order to develop the criteria of route project

• Faculty – New York State Bar Association, Dispute Resolution Section, 3 Day Commercial Arbitration Training: Comprehensive Training for the Conducting of Commercial Arbitrations