• No results found

Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

N/A
N/A
Protected

Academic year: 2019

Share "Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

1

NSF14-43054 started October 1, 2014

Datanet: CIF21 DIBBs: Middleware

and High Performance Analytics

Libraries for Scalable Data Science

• Indiana University (Fox, Qiu, Crandall, von Laszewski) • Rutgers (Jha)

• Virginia Tech (Marathe) • Kansas (Paden)

• Stony Brook (Wang)

• Arizona State (Beckstein) • Utah (Cheatham)

Overview by Geoffrey Fox, May 17, 2016

http://spidal.org/

(2)

Some Important Components of SPIDAL Dibbs

NIST Big Data Application Analysis – features of data intensive Applications.

HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High

Performance Computing) and the rich functionality of the commodity Apache Big

Data Stack.

This is a reservoir of software subsystems – nearly all from outside the project and being a

mix of HPC and Big Data communities.

Leads to Big Data – Simulation HPC Convergence.

MIDAS: Integrating Middleware – from project.

Applications: Biomolecular Simulations, Network and Computational Social

Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics.

SPIDAL (Scalable Parallel Interoperable Data Analytics Library):

Scalable Analytics for:

Domain specific data analytics libraries – mainly from project.Add Core Machine learning libraries – mainly from community.Performance of Java and MIDAS Inter- and Intra-node.

Benchmarks – project adds to communityWBDB2015 Benchmarking Workshop

Implementations: XSEDE and Blue Waters as well as clouds (OpenStack, Docker)

2

(3)

Big Data - Big Simulation (Exascale) Convergence

Discuss Data and Model together as built around problems which combine

them, but we can get insight by separating which allows better understanding of Big Data - Big Simulation “convergence”

Big Data implies Data is large but Model varies

e.g. LDA with many topics or deep learning has large modelClustering or Dimension reduction can be quite small

Simulations can also be considered as Data and Model

Model is solving particle dynamics or partial differential equationsData could be small when just boundary conditions

Data large with data assimilation (weather forecasting) or when data visualizations are

produced by simulation

Data often static between iterations (unless streaming); Model varies between iterations

Take 51 NIST and other use cases derive multiple specific featuresGeneralize and systematize with features termed “facets”

50 Facets (Big Data) or 64 Facets (Big Simulation and Data) divided into 4 sets or views where each view has “similar” facets

Allows one to study coverage of benchmark sets and architectures

3

(4)

64 Features in 4 views for Unified Classification of Big Data

and Simulation Applications

4 Lo ca l ( A na ly ti cs /I nf or m ati cs /S im ul ati o ns ) 2 M

Data Source and Style View

Pleasingly Parallel Classic MapReduce Map-Collective Map Point-to-Point

Shared Memory Single Program Multiple Data Bulk Synchronous Parallel Fusion Dataflow Agents Workflow

Geospatial Information System HPC Simulations

Internet of Things Metadata/Provenance

Shared / Dedicated / Transient / Permanent Archived/Batched/Streaming – S1, S2, S3, S4, S5 HDFS/Lustre/GPFS

Files/Objects

Enterprise Data Model SQL/NoSQL/NewSQL 1 M M ic ro -b e nc h m ar ks Execution View Processing View 1 2 3 4 6 7 8 9 10 11M 12 10D 9 8D 7D 6D 5D 4D 3D 2D 1D

Map Streaming 5

Convergence Diamonds Views and

Facets

Problem Architecture View

15 M C or e Li br ar ie s V is ua liz a ti on 14 M G ra p h A lg or it h m s 13 M Li ne ar A lg e br a Ke rn e ls /M a ny s u bc la ss e s 12 M G lo ba l ( A na ly ti cs /I nf or m ati cs /S im ul ati o ns ) 3 M R ec om m e nd e r E ng in e 5 M 4 M B as e D at a St a ti sti cs 10 M St re am in g D a ta A lg or it hm s O pti m iz a ti o n M e th od o lo gy 9 M Le ar n in g 8 M D a ta C la ss ifi ca ti o n 7 M D at a Se ar ch /Q ue ry /I nd ex 6 M 11 M D a ta A li gn m en t

Big Data Processing Diamonds M u lti sc al e M et ho d 17 M 16 M It er ati ve P D E S ol ve rs 22 M N at ur e of m es h if u se d Ev ol uti on o f D is cr e te S ys te m s 21 M Pa rti cl e s an d Fi el ds 20 M N -b od y M e th od s 19 M Sp e ct ra l M et ho d s 18 M Simulation (Exascale) Processing Diamonds D a ta A b st ra cti o n D 12 M o de l A bs tra cti on M 12 D at a M e tri c = M / N on -M e tri c = N D 13 D at a M et ric = M / N on -M et ric = N M 13 ܱ ܰ ଶ = N N / ܱ ሺ ܰ ሻ = N M 14 R eg u la r= R / Irr e gu la r = I M o de l M 10 Ve ra cit y 7 Ite ra ti ve / Sim p le M 11 C om m un ic ati o n St ru ct u re M 8 D yn am ic = D / St ati c = S D 9 D yn a m ic = D / St ati c = S M 9 R e gu la r = R / Irr eg ul ar = I D a ta D 10 M o de l V ar ie ty M 6 D at a Ve lo cit y D 5 Pe rfo rm an ce M et ric s 1 D a ta V ar ie ty D 6 Flo ps p er B yt e /M e m or y IO /F lo ps p er w att 2 Ex e cu ti on En vir on m en t; C or e lib ra rie s 3 D a ta Vo lu m e D 4 M od el Siz e M 4 Simulations Analytics

(Model for Data)

Both

(All Model)

(Nearly all Data+Model)

(Nearly all Data)

(Mix of Data and Model)

(5)

6 Forms of

MapReduce

Cover “all” circumstances

Describes

- Problem (Model reflecting data)

- Machine - Software

Architecture

5

(6)

6

5/17/2016 HPC-ABDS

Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies

Cross-Cutting Functions

1) Message and Data Protocols:

Avro, Thrift, Protobuf

2) Distributed Coordination : Google Chubby, Zookeeper, Giraffe, JGroups

3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca

17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML

16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK

15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere

15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird

14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, AzureStream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine

14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem

13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs

12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store

12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC

12) Extraction Tools: UIMA, Tika

11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL

11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame

Public Cloud: Azure Table, Amazon Dynamo, Google DataStore

11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet

10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST

9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs

8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS

Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage

7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis

6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api

5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds

Networking: Google Cloud DNS, Amazon Route 53

(7)

HPC-ABDS Mapping of Activities

Level 17: Orchestration: Apache Beam (Google Cloud Dataflow)

integrated with Cloudmesh on HPC cluster

Level 16: Applications: Datamining for molecular dynamics, Image

processing for remote sensing and pathology, graphs, streaming, bioinformatics, social media, financial informatics, text mining

Level 16: Algorithms: Generic and custom for applications SPIDAL

Level 14: Programming: Storm, Heron (Twitter replaces Storm), Hadoop,

Spark, Flink. Improve Inter- and Intra-node performance

Level 13: Communication: Enhanced Storm and Hadoop using HPC

runtime technologies, Harp

Level 11: Data management: Hbase and MongoDB integrated via use of

Beam and other Apache tools; enhance Hbase

Level 9: Cluster Management: Integrate Pilot Jobs with Yarn, Mesos,

Spark, Hadoop; integrate Storm and Heron with Slurm

Level 6: DevOps: Python Cloudmesh virtual Cluster Interoperability

7

Green is MIDAS

(8)

5/17/2016

(9)

MIDAS: Software Activities in DIBBS

Developing HPC-ABDS concept to integrate HPC and Apache

Technologies

Java: relook at Java Grande to make performance “best simply possible”DevOps: Cloudmesh provides interoperability between HPC and Cloud

(OpenStack, AWS, Docker) platforms based on virtual clusters with software defined systems using Ansible (Chef )

Scheduling: Integrate Slurm and Pilot jobs with Yarn & Mesos (ABDS

schedulers), Programming layer (Hadoop, Spark, Flink, Heron/Storm)

Communication and scientific data abstractions: Harp plug-in to

Hadoop outperforms ABDS programming layers

Data Management: use Hbase, MongoDB with customization

Workflow: Use Apache Crunch and Beam (Google Cloud Data flow) as

they link to other ABDS technologies.

Starting to integrate MIDAS components and move into algorithms of

SPIDAL Library

9

(10)

Java MPI performs better than Threads

128 24 core Haswell nodes on SPIDAL DA-MDS Code

10

5/17/2016

Best Threads intra node And MPI inter node

Best MPI; inter and intra node

SM = Optimized Shared memory for intra-node MPI

(11)

Cloudmesh Interoperability DevOps Tool

Model: Define software configuration with tools like Ansible; instantiate on a virtual

cluster

An easy-to-use command line program/shell and portal to interface with

heterogeneous infrastructures

Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud supported

clouds as well as classic HPC and Docker infrastructures

Has an abstraction layer that makes it possible to integrate other IaaS

frameworks

Uses defaults that help interacting with various cloudsManaging VMs across different IaaS providers is easyThe client saves state between consecutive calls

Demonstrated interaction with various cloud providers:

FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera, AWS, Azure,

virtualbox

Status: AWS, and Azure, VirtualBox, Docker need improvements; we focus currently

on Comet and NSF resources that use OpenStack

Currently evaluating 40 team projects from “Big Data Open Source Software Projects

Class” which used this approach running on VirtualBox, Chameleon and FutureSystems

11

(12)

Cloudmesh Interoperability DevOps Tool

Model: Define software configuration with tools like Ansible; instantiate on a virtual

cluster

An easy-to-use command line program/shell and portal to interface with

heterogeneous infrastructures

Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud supported

clouds as well as classic HPC and Docker infrastructures

Has an abstraction layer that makes it possible to integrate other IaaS

frameworks

Uses defaults that help interacting with various cloudsManaging VMs across different IaaS providers is easyThe client saves state between consecutive calls

Demonstrated interaction with various cloud providers:

FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera, AWS, Azure,

virtualbox

Status: AWS, and Azure, VirtualBox, Docker need improvements; we focus currently

on Comet and NSF resources that use OpenStack

Currently evaluating 40 team projects from “Big Data Open Source Software Projects

Class” which used this approach running on VirtualBox, Chameleon and FutureSystems

12

(13)

Cloudmesh Client - Architecture

Cloudmesh Client

Cloudmesh Data Access Data

Services Terminal

Compute Security

Choreography

Batch Queue

Abstraction AbstractionVM PaaS Launchers

Reservation Group Management

Workflow

Authentication

Virtual Cluster Abstraction Access

Access Abstraction REST Command

Shell & Line

Policy Management

Authorization

Container Abstraction create, read, update, delete

Cloudmesh Portal

Database Abstraction

SQLite MongoDB create, read, update delete (CRUD)

Inventory

C

ho

re

og

ra

ph

y

13

Component View

Layered View

Systems Configuration

(14)

Cloudmesh Client – OSG management

While using Comet it is

possible to use the same

images that are used on

the internal OSG cluster.

This reduces overall

management effort.

The client is used to

manage the VMs.

Comet

Virtual Cluster SupercomputerComet OSG Metascheduling

Service OSG

Access Point

Local Cluster

Other Resources

Goal to use the same images/setup

Different setup

14

OSG: Open Science Grid.

LIGO data analysis was conducted on Comet supported

by the Cloudmesh client.

Funded by NSF Comet.

Cloudmesh

(15)

Cloudmesh Client

Cloudmesh Client –

In support of Experiment Workflow

Create IaaS Create

IaaS

Deploy PaaS Deploy

PaaS

Deploy Data Deploy

Data

Execute Scripts Execute

Scripts Evaluate

Evaluate Repeat Repeat

15

Integrate with Ansible Integrate with Ansible

Buildin gcopy and rsync commands

Cloudmesh scripts Integrate with shell Integrate with Python Manage VMs and virtual

clusters

Scripts Variables Shared state

Integration into Ipython and Apache Beam

Choose general

Infrastructure HPC to Clouds

(16)

Pilot-Hadoop/Spark Architecture

16

http://arxiv.org/abs/1602.00345

HPC into Scheduling Layer

(17)

Pilot-Hadoop Example

17

(18)

Pilot-Data/Memory for Iterative

Processing

18

Provide common API for distributed

cluster memory

Scalable K-Means

(19)

Harp Implementations

Basic Harp: Iterative communication and scientific data abstractionsCareful support of distributed data AND distributed model

Avoids parameter server approach but distributes model over worker nodes

and supports collective communication to bring global model to each node

Applied first to Latent Dirichlet Allocation LDA with large model and data

19

(20)

Clueweb

20

5/17/2016 enwiki

Bi-gram

Clueweb

(21)

21 Harp LDA on Big Red II

Supercomputer (Cray)

0 20 40 60 80 100 120 140

0 5 10 15 20 25 30 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

Execution Time (hours)Nodes Parallel Efficiency

E xe cu ti o n T im e (h o u rs ) P a ra lle l E ff ic ie n c y

0 5 10 15 20 25 30 35

0 2 4 6 8 10 12 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

Execution Time (hours) Parallel Efficiency

Nodes E xe cu ti o n T im e (h o u rs ) P a ra lle l E ff ic ie n c y

Harp LDA on Juliet (Intel Haswell)

Big Red II: tested on 25, 50, 75, 100 and 125 nodes; each node uses 32 parallel threads; Gemini interconnect

Juliet: tested on 10, 15, 20, 25, 30 nodes; each node uses 64 parallel threads on 36 core Intel Haswell nodes (each with 2 chips);

Infiniband interconnect

Harp LDA Scaling Tests

Corpus: 3,775,554 Wikipedia documents, Vocabulary: 1 million words;

Topics: 10k topics; alpha: 0.01; beta: 0.01; iteration: 200

(22)

SPIDAL Algorithms – Subgraph mining

Finding patterns in graphs is very important

Counting the number of embeddings of a given labeled/unlabeled

template subgraph

Finding the most frequent subgraphs/motifs efficiently from a

given set of candidate templates

Computing the graphlet frequency distribution.

Reworking existing parallel VT algorithm Sahad with MIDAS

middleware giving HarpSahad which runs 5 (Google) to 9 (Miami) times faster than original Hadoop version

Work in progress

22

5/17/2016

Network Nodes No. Of

(in million)

No. Of Edges (in million)

Size (MB)

Web-google 0.9 4.3 65

Miami 2.1 51.2 740

Template

s

(23)

SPIDAL Algorithms – Random Graph Generation

Random graphs, important and needed with particular degree distribution

and clustering coefficients.

Preferential attachment (PA) model, Chung-Lu (CL), stochastic Kronecker,

stochastic block model (SBM), and block two–level Erdos-Renyi (BTER)

Generative algorithms for these models are mostly sequential and take a

prohibitively long time to generate large-scale graphs.

SPIDAL working on developing efficient parallel algorithms for generating random

graphs using different models with new DG method with low memory and high performance, almost optimal load balancing and excellent scaling.

Algorithms are about 3-4 times faster than the previous ones.

Generate a network with 250 billion edges in 12 seconds using 1024 processors.Needs to be packaged for SPIDAL using MIDAS (currently MPI)

23 5/17/2016 2 .4 5 5 .3

9 67

.3 4 1 8 0 .5 9 8 7 .7 3 2 3 7 .0 9 1 6 8 .9 5 3 5 9 .6 2 1 2 .1 5 7 2 .6 7 0 200 400 600

Miami Twitter Weblinks Friendster UK-Union

DataSet G e n e ra ti o n T im e ( s) MH DG 0 200 400 600 800

1 64 128 256 512 768 1,024

# of Processors

Sp e e d u p

(24)

SPIDAL Algorithms – Triangle Counting

Triangle counting; important special case of subgraph mining and

specialized programs can outperform general program

Previous work used Hadoop but MPI based PATRIC is much faster

SPIDAL version uses much more efficient decomposition (non-overlapping

graph decomposition) – a factor of 25 lower memory than PATRIC

Next graph problem – Community detection

24

0 50 100 150 200 250 300 350 400 450

Suri et al. Park et al. 2013 Park et al. 2014 PATRIC Our Algo.

R

un

tim

e

(m

in

ut

es

)

Algorithms

Runtime Performance on Twitter

SPIDAL

MPI version

complete. Need

to package for

SPIDAL and add

MIDAS -- Harp

(25)

SPIDAL Algorithms – Core I

Several parallel core machine learning algorithms; need to add SPIDAL

Java optimizations to complete parallel codes except MPI MDS

https://www.gitbook.com/book/esaliya/global-machine-learning-with-dsc-spidal/details

O(N2) distance matrices calculation with Hadoop parallelism and various

options (storage MongoDB vs. distributed files), normalization, packing to save memory usage, exploiting symmetry

WDA-SMACOF: Multidimensional scaling MDS is optimal nonlinear

dimension reduction enhanced by SMACOF, deterministic annealing and Conjugate gradient for non-uniform weights. Used in many applications

MPI (shared memory) and MIDAS (Harp) versions

MDS Alignment to optimally align related point sets, as in MDS time

series

WebPlotViz data management (MongoDB) and browser visualization for

3D point sets including time series. Available as source or SaaS

MDS as 2 using Manxcat. Alternative more general but less reliable

solution of MDS. Latest version of WDA-SMACOF usually preferable

Other Dimension Reduction: SVD, PCA, GTM to do

25

(26)

SPIDAL Algorithms – Core II

Latent Dirichlet Allocation LDA for topic finding in text collections; new algorithm

with MIDAS runtime outperforming current best practice

DA-PWC Deterministic Annealing Pairwise Clustering for case where points

aren’t in a vector space; used extensively to cluster DNA and proteomic sequences; improved algorithm over other published. Parallelism good but needs SPIDAL Java

DAVS Deterministic Annealing Clustering for vectors; includes specification of

errors and limit on cluster sizes. Gives very accurate answers for cases where

distinct clustering exists. Being upgraded for new LC-MS proteomics data with one million clusters in 27 million size data set

K-means basic vector clustering: fast and adequate where clusters aren’t

needed accurately

Elkan’s improved K-means vector clustering: for high dimensional spaces; uses

triangle inequality to avoid expensive distance calcs

Future work Classification: logistic regression, Random Forest, SVM, (deep

learning); Collaborative Filtering, TF-IDF search and Spark MLlib algorithms

Harp-DaaL extends Intel DAAL’s local batch mode to multi-node distributed modes

Leveraging Harp’s benefits of communication for iterative compute models

26

(27)

SPIDAL Algorithms – Optimization I

Manxcat: Levenberg Marquardt Algorithm for non-linear

2

optimization with sophisticated version of Newton’s method

calculating value and derivatives of objective function. Parallelism in

calculation of objective function and in parameters to be determined.

Complete – needs SPIDAL Java optimization

Viterbi

algorithm, for finding the maximum a posteriori (MAP) solution

for a Hidden Markov Model (HMM). The running time is O(n*s^2)

where n is the number of variables and s is the number of possible

states each variable can take. We will provide an "embarrassingly

parallel" version that processes multiple problems (e.g. many images)

independently; parallelizing within the same problem not needed in

our application space.

Needs Packaging in SPIDAL

Forward-backward algorithm

, for computing marginal distributions

over HMM variables. Similar characteristics as Viterbi above.

Needs

Packaging in SPIDAL

27

(28)

SPIDAL Algorithms – Optimization II

Loopy belief propagation (LBP) for approximately finding the maximum a

posteriori (MAP) solution for a Markov Random Field (MRF). Here the running time is O(n^2*s^2*i) in the worst case where n is number of

variables, s is number of states per variable, and i is number of iterations required (which is usually a function of n, e.g. log(n) or sqrt(n)). Here there are various parallelization strategies depending on values of s and n for any given problem.

We will provide two parallel versions: embarrassingly parallel version for

when s and n are relatively modest, and parallelizing each iteration of the same problem for common situation when s and n are quite large so that each iteration takes a long time relative to number of iterations

required.

Needs Packaging in SPIDAL

Markov Chain Monte Carlo (MCMC) for approximately computing marking

distributions and sampling over MRF variables. Similar to LBP with the same two parallelization strategies. Needs Packaging in SPIDAL

28

(29)

Imaging Applications: Remote Sensing,

Pathology, Spatial Systems

Both Pathology/Remote sensing working on 3D images

Each pathology image could have 10 billion pixels, and we may extract a million

spatial objects per image and 100 million features (dozens to 100 features per object) per image. We often tile the image into 4K x 4K tiles for processing. We develop buffering-based tiling to handle boundary-crossing objects. For each typical research study, we may have hundreds to thousands of pathology images

Remote sensing aimed at radar images of ice and snow sheets

2D problems need modest parallelism “intra-image” but often need parallelism

over images

3D problems need parallelism for an individual image

Use Optimization algorithms to support applications (e.g. Markov Chain, Integer

Programming, Bayesian Maximum a posteriori, variational level set, Euler-Lagrange Equation)

Classification (deep learning convolution neural network, SVM, random forest,

etc.) will be important

29

(30)

2D Radar Polar Remote Sensing

Need to estimate structure of earth (ice, snow, rock) from radar signals from

plane in 2 or 3 dimensions.

Original 2D analysis ([11])used Hidden Markov Methods; better results using

MCMC (our solution)

30

5/17/2016

(31)

3D Radar Polar Remote Sensing

Uses LBP to analyze 3D radar images

31

5/17/2016

Reconstructing bedrock in 3D, for (left) ground truth, (center) existing

algorithm based on maximum likelihood estimators, and (right) our technique based on a Markov Random Field formulation.

Radar gives a cross-section view, parameterized by angle and

range, of the ice structure, which yields a set of 2-d tomographic slices (right) along the flight path.

Each image represents a 3d depth map, with

(32)

Algorithms – Nuclei Segmentation for

Pathology Images

Segment boundaries of nuclei from pathology

images and extract features for each nucleus

Consist of tiling, segmentation,

vectorization, boundary object aggregation

Could be executed on MapReduce

(MIDAS Harp) 32 Step 1: Preliminary Region Partition Step 2: Background Normalization Step 3: Nuclear Boundary Refinement Step 4: Nuclei Separation D is tr ib u te d F ile S ys te m M ap R ed u ce C o m p u ti n g Fr am ew o rk Segmentation Mask Images Boundary Vectorization Raw Polygons Boundary Normalization Normalized Polygons

Isolated Buffer Polygon Removal

Non-boundary Polygons

Whole Slide Polygon Aggregation

Ti lin g M ap R ed u ce

Cleaned Polygons

Boundary Polygons Final Segmented Polygons Tiled Images F1 F2 F3 F5 F4 F6 F7 F8 F9 F10 Tiling

Whole Slide Images

F11

Nuclear segmentation algorithm

Execution pipeline on MapReduce (MIDAS Harp)

(33)

Algorithms – Spatial Querying Methods

33

Spatial Queries Architecture of Spatial Query Engine

Hadoop-GIS is a general framework to support high performance spatial

queries and analytics for spatial big data on MapReduce.

It supports multiple types of spatial queries on MapReduce through spatial

partitioning, customizable spatial query engine and on-demand indexing.

SparkGIS is a variation of Hadoop-GIS which runs on Spark to take

advantage of in-memory processing.

Will extend Hadoop/Spark to Harp MIDAS runtime.2D complete; 3D in progress

(34)

Some Applications Enabled

KU/IU: Remote Sensing in Polar RegionsSB: Digital Pathology Imaging

SB: Large scale GIS applications, including public healthVT: Graph Analysis in studies of networks in many areasUT, ASU, Rutgers: Analysis of Biomolecular simulations

Applications not part of Dibbs project but algorithms/software used

IU: Bioinformatics and Financial Modeling with MDSIntegration with Network Science Infrastructure

VT/IU: CINET: SPIDAL algorithms will be made availableIU: Osome Observatory on Social Media, currently Twitter

https://peerj.com/preprints/2008/ using enhanced HBase

IU: Topic Analysis of text data

IU/Rutgers: Streaming with HPC enhanced Storm/Heron

34

(35)

Enabled Applications – Digital Pathology

Digital pathology images scanned from human tissue specimens provide

rich information about morphological and functional characteristics of biological systems.

Pathology image analysis has high potential to provide diagnostic

assistance, identify therapeutic targets, and predict patient outcomes and therapeutic responses.

It relies on both pathology image analysis algorithms and spatial querying

methods.

Extremely large image scale.

35

Glass Slides Scanning Whole Slide Images Image Analysis

(36)

Applications – Public Health

GIS-oriented public health research has a strong focus on the locations of

patients and the agents of disease, and studies the spatial patterns and variations.

Integrating multiple spatial big data sources at fine spatial resolutions allow

public health researchers and health officials to adequately identify, analyze, and monitor health problems at the community level.

This will rely on high performance spatial querying methods on data

integration.

Note synergy between GIS and Large image processing as in pathology.

36

(37)

Biomolecular Simulation Data Analysis

37

5/17/2016

Path Similarity Analysis (PSA) with Hausdorff distance

Utah (CPPTraj), Arizona State (MDAnalysis), Rutgers

• Parallelize key algorithms including O(N

2

) distance

computations between trajectories

(38)

Clustered distances for two methods for sampling macromolecular

transitions (200 trajectories each) showing that both methods produce distinctly different pathways.

38

RADICAL Pilot benchmark run for three different test sets of trajectories, using 12x12

“blocks” per task.

RADICAL-Pilot Hausdorff distance:

all-pairs

problem

(39)

Classification of lipids in membranes

Biological membranes are lipid bilayers with distinct inner and outer surfaces that are formed by lipid mono layers (leaflets). Movement of lipids between leaflets or change of topology (merging of leaflets during fusion events) is difficult to detect in simulations.

39

Lipids colored by leaflet

Same color: continuous leaflet.

(40)

LeafletFinder

LeafletFinder is a graph-based algorithm to detect continuous

lipid membrane leaflets in a MD simulation*. The current

implementation is slow and does not work well for large systems

(>100,000 lipids).

40

Phosphate

atom

coordinates

Build

nearest-neighbors

adjacency matrix

Find largest

connected

subgraphs

* N. Michaud-Agrawal, E. J. Denning, T. B.

Woolf, and O. Beckstein. MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. J Comp Chem, 32:2319–2327, 2011.

(41)

41

Apple

Mid Cap

Energy

S&P

Dow Jones

Finance

Origin 0% change

+10%

+20%

Time series of Stock Values projected to 3D

Using one day stock values measured from January 2004 and starting after one year January 2005 (Filled circles are final values)

References

Related documents

“Disability” means, unless otherwise set forth in an Award Agreement or other written agreement between the Company, the Partnership, the General Partner or one of their Affiliates

Figure 1 shows the thickness values of the CdTe films deposited as a function of the different substrate temperatures and deposition times.. From Figure 1 can be appreciated

Furthermore, the fact that “opening up the public sector that has been responsible for the development and operation of domestic infrastructure to the private sector [...] leads

 Improved fineness further resulted to the improvement of the material transport properties of the blended cement concrete through the discontinuation of the interconnected

Amdahl’s law states that the speedup of any parallel program has an upper bound which is determined by the amount of time spent on the sequential fraction of the program, no matter

As part of the Canada-US Free Trade negotiations in the 1980s, Canada agreed to eliminate any import permit requirements for American wheat coming into Canada

Individuals or groups may alternatively wish to place a memorial plaque on existing park or street furniture or other object, as either a donation to the community, or as a memorial