Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

(1)

1

NSF14-43054 started October 1, 2014

Datanet: CIF21 DIBBs: Middleware

and High Performance Analytics

Libraries for Scalable Data Science

• Indiana University (Fox, Qiu, Crandall, von Laszewski) • Rutgers (Jha)

• Virginia Tech (Marathe) • Kansas (Paden)

• Stony Brook (Wang)

• Arizona State (Beckstein) • Utah (Cheatham)

Overview by Geoffrey Fox, May 17, 2016

http://spidal.org/

(2)

Some Important Components of SPIDAL Dibbs

• _{NIST Big Data Application Analysis}_{– features of data intensive Applications.}

• _HPC-ABDS:_{Cloud-HPC interoperable software}_{performance of HPC}_(High

Performance Computing) and the rich functionality of the commodity Apache Big

Data Stack.

– _{This is a reservoir of software subsystems – nearly all from outside the project and being a}

mix of HPC and Big Data communities.

– _{Leads to}_{Big Data – Simulation}_–_{HPC Convergence.}

• _MIDAS:_{Integrating Middleware – from project.}

• _{Applications:}_{Biomolecular Simulations, Network and Computational Social}

Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics.

• _{SPIDAL (Scalable Parallel Interoperable Data Analytics Library):}

Scalable Analytics for:

– _{Domain specific data analytics libraries – mainly from project.} – _{Add Core Machine learning libraries – mainly from community.} – _{Performance of Java and MIDAS Inter- and Intra-node}_.

• _Benchmarks_{– project adds to communityWBDB2015 Benchmarking Workshop}

• _{Implementations}_{: XSEDE and Blue Waters as well as clouds (OpenStack, Docker)}

2

(3)

Big Data - Big Simulation (Exascale) Convergence

• _Discuss_Data_and_Model_{together as built around problems which combine}

them, but we can get insight by separating which allows better understanding of Big Data - Big Simulation “convergence”

• _{Big Data implies Data is large but Model varies}

– _e.g._LDA_{with many topics or}_{deep learning}_{has large model} – _{Clustering or Dimension reduction can be quite small}

• _Simulations _{can also be considered as}_Data_and_Model

– _Model_{is solving particle dynamics or partial differential equations} – _Data_{could be small when just boundary conditions}

– _Data_{large with data assimilation (weather forecasting) or when data visualizations are}

produced by simulation

• _Data_{often static between iterations (unless streaming);}_Model_varies between iterations

• _{Take 51 NIST and other use cases}__{derive multiple specific features} • _{Generalize and systematize with features termed “facets”}

• _{50 Facets (Big Data) or 64 Facets (Big Simulation and Data)}_{divided into} 4 sets or views where each view has “similar” facets

– _{Allows one to study coverage of benchmark sets and architectures}

3

(4)

64 Features in 4 views for Unified Classification of Big Data

and Simulation Applications

4 Lo ca l ( A na ly ti cs /I nf or m ati cs /S im ul ati o ns ) 2 M

Data Source and Style View

Pleasingly Parallel Classic MapReduce Map-Collective Map Point-to-Point

Shared Memory Single Program Multiple Data Bulk Synchronous Parallel Fusion Dataflow Agents Workflow

Geospatial Information System HPC Simulations

Internet of Things Metadata/Provenance

Shared / Dedicated / Transient / Permanent Archived/Batched/Streaming – S1, S2, S3, S4, S5 HDFS/Lustre/GPFS

Files/Objects

Enterprise Data Model SQL/NoSQL/NewSQL 1 M M ic ro -b e nc h m ar ks Execution View Processing View 1 2 3 4 6 7 8 9 10 11M 12 10D 9 8D 7D 6D 5D 4D 3D 2D 1D

Map Streaming 5

Convergence Diamonds Views and

Facets

Problem Architecture View

15 M C or e Li br ar ie s V is ua liz a ti on 14 M G ra p h A lg or it h m s 13 M Li ne ar A lg e br a Ke rn e ls /M a ny s u bc la ss e s 12 M G lo ba l ( A na ly ti cs /I nf or m ati cs /S im ul ati o ns ) 3 M R ec om m e nd e r E ng in e 5 M 4 M B as e D at a St a ti sti cs 10 M St re am in g D a ta A lg or it hm s O pti m iz a ti o n M e th od o lo gy 9 M Le ar n in g 8 M D a ta C la ss ifi ca ti o n 7 M D at a Se ar ch /Q ue ry /I nd ex 6 M 11 M D a ta A li gn m en t

Big Data Processing Diamonds M u lti sc al e M et ho d 17 M 16 M It er ati ve P D E S ol ve rs 22 M N at ur e of m es h if u se d Ev ol uti on o f D is cr e te S ys te m s 21 M Pa rti cl e s an d Fi el ds 20 M N -b od y M e th od s 19 M Sp e ct ra l M et ho d s 18 M Simulation (Exascale) Processing Diamonds D a ta A b st ra cti o n D 12 M o de l A bs tra cti on M 12 D at a M e tri c = M / N on -M e tri c = N D 13 D at a M et ric = M / N on -M et ric = N M 13 ܱ ܰ ଶ = N N / ܱ ሺ ܰ ሻ = N M 14 R eg u la r= R / Irr e gu la r = I M o de l M 10 Ve ra cit y 7 Ite ra ti ve / Sim p le M 11 C om m un ic ati o n St ru ct u re M 8 D yn am ic = D / St ati c = S D 9 D yn a m ic = D / St ati c = S M 9 R e gu la r = R / Irr eg ul ar = I D a ta D 10 M o de l V ar ie ty M 6 D at a Ve lo cit y D 5 Pe rfo rm an ce M et ric s 1 D a ta V ar ie ty D 6 Flo ps p er B yt e /M e m or y IO /F lo ps p er w att 2 Ex e cu ti on En vir on m en t; C or e lib ra rie s 3 D a ta Vo lu m e D 4 M od el Siz e M 4 Simulations Analytics

(Model for Data)

Both

(All Model)

(Nearly all Data+Model)

(Nearly all Data)

(Mix of Data and Model)

(5)

6 Forms of

MapReduce

Cover “all” circumstances

Describes

- Problem (Model reflecting data)

- Machine - Software

Architecture

5

(6)

6

5/17/2016 HPC-ABDS

Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies

Cross-Cutting Functions

1) Message and Data Protocols:

Avro, Thrift, Protobuf

2) Distributed Coordination : Google Chubby, Zookeeper, Giraffe, JGroups

3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca

17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML

16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK

15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere

15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird

14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, AzureStream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine

14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem

13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs

12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store

12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC

12) Extraction Tools: UIMA, Tika

11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL

11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame

Public Cloud: Azure Table, Amazon Dynamo, Google DataStore

11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet

10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST

9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs

8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS

Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage

7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis

6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api

5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds

Networking: Google Cloud DNS, Amazon Route 53

(7)

HPC-ABDS Mapping of Activities

• _{Level 17: Orchestration: Apache Beam (Google Cloud Dataflow)}

integrated with Cloudmesh on HPC cluster

• _{Level 16: Applications:}_{Datamining for molecular dynamics, Image}

processing for remote sensing and pathology, graphs, streaming, bioinformatics, social media, financial informatics, text mining

• _{Level 16: Algorithms:}_{Generic and custom for applications}_SPIDAL

• _{Level 14: Programming: Storm, Heron (Twitter replaces Storm), Hadoop,}

Spark, Flink. Improve Inter- and Intra-node performance

• _{Level 13: Communication: Enhanced Storm and Hadoop using HPC}

runtime technologies, Harp

• _{Level 11: Data management: Hbase and MongoDB integrated via use of}

Beam and other Apache tools; enhance Hbase

• _{Level 9: Cluster Management: Integrate Pilot Jobs with Yarn, Mesos,}

Spark, Hadoop; integrate Storm and Heron with Slurm

• _{Level 6: DevOps: Python Cloudmesh virtual Cluster Interoperability}

7

Green is MIDAS

(8)

5/17/2016

(9)

MIDAS: Software Activities in DIBBS

• _Developing_HPC-ABDS_{concept to integrate HPC and Apache}

Technologies

• _Java:_{relook at Java Grande to make performance “best simply possible”} • _DevOps:_{Cloudmesh provides interoperability between HPC and Cloud}

(OpenStack, AWS, Docker) platforms based on virtual clusters with software defined systems using Ansible (Chef )

• _Scheduling:_{Integrate Slurm and Pilot jobs with Yarn & Mesos (ABDS}

schedulers), Programming layer (Hadoop, Spark, Flink, Heron/Storm)

• _{Communication}_{and scientific}_{data abstractions}_{: Harp plug-in to}

Hadoop outperforms ABDS programming layers

• _{Data Management}_{: use Hbase, MongoDB with customization}

• _Workflow_{: Use Apache Crunch and Beam (Google Cloud Data flow) as}

they link to other ABDS technologies.

• _{Starting to integrate MIDAS components and move into algorithms of}

SPIDAL Library

9

(10)

Java MPI performs better than Threads

128 24 core Haswell nodes on SPIDAL DA-MDS Code

10

5/17/2016

Best Threads intra node And MPI inter node

Best MPI; inter and intra node

SM = Optimized Shared memory for intra-node MPI

(11)

Cloudmesh Interoperability DevOps Tool

• _Model:_{Define software configuration with tools like Ansible; instantiate on a virtual}

cluster

• _{An easy-to-use command line program/shell and portal to interface with}

heterogeneous infrastructures

– _{Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud supported}

clouds as well as classic HPC and Docker infrastructures

– _{Has an abstraction layer that makes it possible to integrate other IaaS}

frameworks

– _{Uses defaults that help interacting with various clouds} – _{Managing VMs across different IaaS providers is easy} – _{The client}_{saves state}_{between consecutive calls}

• _{Demonstrated interaction with various cloud providers:}

– _{FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera, AWS, Azure,}

virtualbox

• _Status:_{AWS, and Azure, VirtualBox, Docker need improvements; we focus currently}

on Comet and NSF resources that use OpenStack

• _{Currently evaluating 40 team projects from “Big Data Open Source Software Projects}

Class” which used this approach running on VirtualBox, Chameleon and FutureSystems

11

(12)

Cloudmesh Interoperability DevOps Tool

• _Model:_{Define software configuration with tools like Ansible; instantiate on a virtual}

cluster

• _{An easy-to-use command line program/shell and portal to interface with}

heterogeneous infrastructures

– _{Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud supported}

clouds as well as classic HPC and Docker infrastructures

– _{Has an abstraction layer that makes it possible to integrate other IaaS}

frameworks

– _{Uses defaults that help interacting with various clouds} – _{Managing VMs across different IaaS providers is easy} – _{The client}_{saves state}_{between consecutive calls}

• _{Demonstrated interaction with various cloud providers:}

– _{FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera, AWS, Azure,}

virtualbox

• _Status:_{AWS, and Azure, VirtualBox, Docker need improvements; we focus currently}

on Comet and NSF resources that use OpenStack

• _{Currently evaluating 40 team projects from “Big Data Open Source Software Projects}

Class” which used this approach running on VirtualBox, Chameleon and FutureSystems

12

(13)

Cloudmesh Client - Architecture

Cloudmesh Client

Cloudmesh Data Access Data

Services Terminal

Compute Security

Choreography

Batch Queue

Abstraction AbstractionVM PaaS Launchers

Reservation Group Management

Workﬂow

Authentication

Virtual Cluster Abstraction Access

Access Abstraction REST Command

Shell & Line

Policy Management

Authorization

Container Abstraction create, read, update, delete

Cloudmesh Portal

Database Abstraction

SQLite MongoDB create, read, update delete (CRUD)

Inventory

C

ho

re

og

ra

ph

y

13

Component View

Layered View

Systems Configuration

(14)

Cloudmesh Client – OSG management

• _{While using Comet it is}

possible to use the same

images that are used on

the internal OSG cluster.

• _{This reduces overall}

management effort.

• _{The client is used to}

manage the VMs.

Comet

Virtual Cluster SupercomputerComet OSG Metascheduling

Service OSG

Access Point

Local Cluster

Other Resources

Goal to use the same images/setup

Different setup

14

OSG: Open Science Grid.

LIGO data analysis was conducted on Comet supported

by the Cloudmesh client.

Funded by NSF Comet.

Cloudmesh

(15)

Cloudmesh Client

Cloudmesh Client –

In support of Experiment Workflow

Create IaaS Create

IaaS

Deploy PaaS Deploy

PaaS

Deploy Data Deploy

Data

Execute Scripts Execute

Scripts Evaluate

Evaluate Repeat Repeat

15

Integrate with Ansible Integrate with Ansible

Buildin gcopy and rsync commands

Cloudmesh scripts Integrate with shell Integrate with Python Manage VMs and virtual

clusters

Scripts Variables Shared state

Integration into Ipython and Apache Beam

Choose general

Infrastructure HPC to Clouds

(16)

Pilot-Hadoop/Spark Architecture

16

http://arxiv.org/abs/1602.00345

HPC into Scheduling Layer

(17)

Pilot-Hadoop Example

17

(18)

Pilot-Data/Memory for Iterative

Processing

18

Provide common API for distributed

cluster memory

Scalable K-Means

(19)

Harp Implementations

• _{Basic Harp: Iterative communication and scientific data abstractions} • _{Careful support of distributed data AND distributed model}

• _{Avoids parameter server approach but distributes model over worker nodes}

and supports collective communication to bring global model to each node

• _{Applied first to Latent Dirichlet Allocation LDA with large model and data}

19

(20)

Clueweb

20

5/17/2016 enwiki

Bi-gram

Clueweb

(21)

21 Harp LDA on Big Red II

Supercomputer (Cray)

0 20 40 60 80 100 120 140

0 5 10 15 20 25 30 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

Execution Time (hours)Nodes Parallel Efficiency

E xe cu ti o n T im e (h o u rs ) P a ra lle l E ff ic ie n c y

0 5 10 15 20 25 30 35

0 2 4 6 8 10 12 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

Execution Time (hours) Parallel Efficiency

Nodes E xe cu ti o n T im e (h o u rs ) P a ra lle l E ff ic ie n c y

Harp LDA on Juliet (Intel Haswell)

• Big Red II: tested on 25, 50, 75, 100 and 125 nodes; each node uses 32 parallel threads; Gemini interconnect

• Juliet: tested on 10, 15, 20, 25, 30 nodes; each node uses 64 parallel threads on 36 core Intel Haswell nodes (each with 2 chips);

Infiniband interconnect

Harp LDA Scaling Tests

Corpus: 3,775,554 Wikipedia documents, Vocabulary: 1 million words;

Topics: 10k topics; alpha: 0.01; beta: 0.01; iteration: 200

(22)

SPIDAL Algorithms – Subgraph mining

• _{Finding patterns in graphs is very important}

– _{Counting the number of embeddings of a given labeled/unlabeled}

template subgraph

– _{Finding the most frequent subgraphs/motifs efficiently from a}

given set of candidate templates

– _{Computing the graphlet frequency distribution.}

• _{Reworking existing parallel VT algorithm Sahad with MIDAS}

middleware giving HarpSahad which runs 5 (Google) to 9 (Miami) times faster than original Hadoop version

• _{Work in progress}

22

5/17/2016

Network Nodes No. Of

(in million)

No. Of Edges (in million)

Size (MB)

Web-google 0.9 4.3 65

Miami 2.1 51.2 740

Template

s

(23)

SPIDAL Algorithms – Random Graph Generation

• _{Random graphs, important and needed with particular degree distribution}

and clustering coefficients.

– _{Preferential attachment (PA) model, Chung-Lu (CL), stochastic Kronecker,}

stochastic block model (SBM), and block two–level Erdos-Renyi (BTER)

– _{Generative algorithms for these models are mostly sequential and take a}

prohibitively long time to generate large-scale graphs.

• _{SPIDAL working on developing efficient parallel algorithms for generating random}

graphs using different models with new DG method with low memory and high performance, almost optimal load balancing and excellent scaling.

– _{Algorithms are about 3-4 times faster than the previous ones.}

– _{Generate a network with 250 billion edges in 12 seconds using 1024 processors.} • _{Needs to be packaged for SPIDAL using MIDAS (currently MPI)}

23 5/17/2016 2 .4 5 5 .3

9 67

.3 4 1 8 0 .5 9 8 7 .7 3 2 3 7 .0 9 1 6 8 .9 5 3 5 9 .6 2 1 2 .1 5 7 2 .6 7 0 200 400 600

Miami Twitter Weblinks Friendster UK-Union

DataSet G e n e ra ti o n T im e ( s) _MH DG 0 200 400 600 800

1 64 128 256 512 768 1,024

# of Processors

Sp e e d u p

(24)

SPIDAL Algorithms – Triangle Counting

• _{Triangle counting; important special case of subgraph mining and}

specialized programs can outperform general program

• _{Previous work used Hadoop but MPI based PATRIC is much faster}

• _{SPIDAL version uses much more efficient decomposition (non-overlapping}

graph decomposition) – a factor of 25 lower memory than PATRIC

• _{Next graph problem – Community detection}

24

0 50 100 150 200 250 300 350 400 450

Suri et al. Park et al. 2013 Park et al. 2014 PATRIC Our Algo.

R

un

tim

e

(m

in

ut

es

)

Algorithms

Runtime Performance on Twitter

SPIDAL

MPI version

complete. Need

to package for

SPIDAL and add

MIDAS -- Harp

(25)

SPIDAL Algorithms – Core I

• _{Several parallel core machine learning algorithms; need to add SPIDAL}

Java optimizations to complete parallel codes except MPI MDS

– _{https://www.gitbook.com/book/esaliya/global-machine-learning-with-dsc-spidal/details}

• _O(N2_{) distance matrices}_{calculation with Hadoop parallelism and various}

options (storage MongoDB vs. distributed files), normalization, packing to save memory usage, exploiting symmetry

• _{WDA-SMACOF: Multidimensional scaling MDS}_{is optimal nonlinear}

dimension reduction enhanced by SMACOF, deterministic annealing and Conjugate gradient for non-uniform weights. Used in many applications

– _{MPI (shared memory) and MIDAS (Harp) versions}

• _{MDS Alignment}_{to optimally align related point sets, as in MDS time}

series

• _WebPlotViz_{data management (MongoDB) and browser visualization for}

3D point sets including time series. Available as source or SaaS

• _{MDS as}2_{using Manxcat.}_{Alternative more general but less reliable}

solution of MDS. Latest version of WDA-SMACOF usually preferable

• _Other_{Dimension Reduction}_{: SVD, PCA, GTM to do}

25

(26)

SPIDAL Algorithms – Core II

• _{Latent Dirichlet Allocation LDA}_{for topic finding in text collections; new algorithm}

with MIDAS runtime outperforming current best practice

• _{DA-PWC Deterministic Annealing Pairwise Clustering}_{for case where points}

aren’t in a vector space; used extensively to cluster DNA and proteomic sequences; improved algorithm over other published. Parallelism good but needs SPIDAL Java

• _{DAVS Deterministic Annealing Clustering for vectors}_{; includes specification of}

errors and limit on cluster sizes. Gives very accurate answers for cases where

distinct clustering exists. Being upgraded for new LC-MS proteomics data with one million clusters in 27 million size data set

• _{K-means basic vector clustering}_{: fast and adequate where clusters aren’t}

needed accurately

• _{Elkan’s improved K-means vector clustering}_{: for high dimensional spaces; uses}

triangle inequality to avoid expensive distance calcs

• _{Future work}_–_{Classification}_{: logistic regression, Random Forest, SVM, (deep}

learning); Collaborative Filtering, TF-IDF search and Spark MLlib algorithms

• _Harp-DaaL_{extends Intel DAAL’s local batch mode to multi-node distributed modes}

– _{Leveraging Harp’s benefits of communication for iterative compute models}

26

(27)

SPIDAL Algorithms – Optimization I

• _{Manxcat: Levenberg Marquardt Algorithm for non-linear}



2

optimization with sophisticated version of Newton’s method

calculating value and derivatives of objective function. Parallelism in

calculation of objective function and in parameters to be determined.

Complete – needs SPIDAL Java optimization

• _Viterbi

_{algorithm, for finding the maximum a posteriori (MAP) solution}

for a Hidden Markov Model (HMM). The running time is O(n*s^2)

where n is the number of variables and s is the number of possible

states each variable can take. We will provide an "embarrassingly

parallel" version that processes multiple problems (e.g. many images)

independently; parallelizing within the same problem not needed in

our application space.

Needs Packaging in SPIDAL

• _{Forward-backward algorithm}

_{, for computing marginal distributions}

over HMM variables. Similar characteristics as Viterbi above.

Needs

Packaging in SPIDAL

27

(28)

SPIDAL Algorithms – Optimization II

• _{Loopy belief propagation (LBP)}_{for approximately finding the maximum a}

posteriori (MAP) solution for a Markov Random Field (MRF). Here the running time is O(n^2*s^2*i) in the worst case where n is number of

variables, s is number of states per variable, and i is number of iterations required (which is usually a function of n, e.g. log(n) or sqrt(n)). Here there are various parallelization strategies depending on values of s and n for any given problem.

– _{We will provide two parallel versions: embarrassingly parallel version for}

when s and n are relatively modest, and parallelizing each iteration of the same problem for common situation when s and n are quite large so that each iteration takes a long time relative to number of iterations

required.

– _{Needs Packaging in SPIDAL}

• _{Markov Chain Monte Carlo (MCMC)}_{for approximately computing marking}

distributions and sampling over MRF variables. Similar to LBP with the same two parallelization strategies. Needs Packaging in SPIDAL

28

(29)

Imaging Applications: Remote Sensing,

Pathology, Spatial Systems

• _{Both Pathology/Remote sensing working on 3D images}

• _{Each pathology image could have 10 billion pixels, and we may extract a million}

spatial objects per image and 100 million features (dozens to 100 features per object) per image. We often tile the image into 4K x 4K tiles for processing. We develop buffering-based tiling to handle boundary-crossing objects. For each typical research study, we may have hundreds to thousands of pathology images

• _{Remote sensing aimed at radar images of ice and snow sheets}

• _{2D problems need modest parallelism “intra-image” but often need parallelism}

over images

• _{3D problems need parallelism for an individual image}

• _{Use Optimization algorithms to support applications (e.g. Markov Chain, Integer}

Programming, Bayesian Maximum a posteriori, variational level set, Euler-Lagrange Equation)

• _{Classification (deep learning convolution neural network, SVM, random forest,}

etc.) will be important

29

(30)

2D Radar Polar Remote Sensing

• _{Need to estimate structure of earth (ice, snow, rock) from radar signals from}

plane in 2 or 3 dimensions.

• _{Original 2D analysis ([11])used Hidden Markov Methods; better results using}

MCMC (our solution)

30

5/17/2016

(31)

3D Radar Polar Remote Sensing

• _{Uses LBP to analyze 3D radar images}

31

5/17/2016

Reconstructing bedrock in 3D, for (left) ground truth, (center) existing

algorithm based on maximum likelihood estimators, and (right) our technique based on a Markov Random Field formulation.

Radar gives a cross-section view, parameterized by angle and

range, of the ice structure, which yields a set of 2-d tomographic slices (right) along the flight path.

Each image represents a 3d depth map, with

(32)

Algorithms – Nuclei Segmentation for

Pathology Images

• _{Segment boundaries of nuclei from pathology}

images and extract features for each nucleus

• _{Consist of tiling, segmentation,}

vectorization, boundary object aggregation

• _{Could be executed on MapReduce}

(MIDAS Harp) 32 Step 1: Preliminary Region Partition Step 2: Background Normalization Step 3: Nuclear Boundary Refinement Step 4: Nuclei Separation D is tr ib u te d F ile S ys te m M ap R ed u ce C o m p u ti n g Fr am ew o rk Segmentation Mask Images Boundary Vectorization Raw Polygons Boundary Normalization Normalized Polygons

Isolated Buffer Polygon Removal

Non-boundary Polygons

Whole Slide Polygon Aggregation

Ti lin g M ap R ed u ce

Cleaned Polygons

Boundary Polygons Final Segmented Polygons Tiled Images F1 F2 F3 F5 F4 F6 F7 F8 F9 F10 Tiling

Whole Slide Images

F11

Nuclear segmentation algorithm

Execution pipeline on MapReduce (MIDAS Harp)

(33)

Algorithms – Spatial Querying Methods

33

Spatial Queries _{Architecture of Spatial Query Engine}

• _{Hadoop-GIS is a general framework to support high performance spatial}

queries and analytics for spatial big data on MapReduce.

• _{It supports multiple types of spatial queries on MapReduce through spatial}

partitioning, customizable spatial query engine and on-demand indexing.

• _{SparkGIS is a variation of Hadoop-GIS which runs on Spark to take}

advantage of in-memory processing.

• _{Will extend Hadoop/Spark to Harp MIDAS runtime.} • _{2D complete; 3D in progress}

(34)

Some Applications Enabled

• _{KU/IU: Remote Sensing}_{in Polar Regions} • _{SB: Digital Pathology}_Imaging

• _SB:_{Large scale}_GIS_{applications, including public health} • _{VT: Graph Analysis}_{in studies of networks in many areas} • _{UT, ASU, Rutgers:}_{Analysis of}_{Biomolecular simulations}

Applications not part of Dibbs project but algorithms/software used

• _{IU: Bioinformatics}_and_{Financial Modeling}_{with MDS} • _{Integration with}_{Network Science Infrastructure}

– _{VT/IU: CINET:}_{SPIDAL algorithms will be made available} – _{IU: Osome}_{Observatory on Social Media, currently Twitter}

https://peerj.com/preprints/2008/ using enhanced HBase

– _{IU: Topic Analysis}_{of text data}

• _IU/Rutgers:_{Streaming with HPC enhanced Storm/Heron}

34

(35)

Enabled Applications – Digital Pathology

• _{Digital pathology images scanned from human tissue specimens provide}

rich information about morphological and functional characteristics of biological systems.

• _{Pathology image analysis has high potential to provide diagnostic}

assistance, identify therapeutic targets, and predict patient outcomes and therapeutic responses.

• _{It relies on both pathology image analysis algorithms and spatial querying}

methods.

• _{Extremely large image scale.}

35

Glass Slides Scanning Whole Slide Images Image Analysis

(36)

Applications – Public Health

• _{GIS-oriented public health research has a strong focus on the locations of}

patients and the agents of disease, and studies the spatial patterns and variations.

• _{Integrating multiple spatial big data sources at fine spatial resolutions allow}

public health researchers and health officials to adequately identify, analyze, and monitor health problems at the community level.

• _{This will rely on high performance spatial querying methods on data}

integration.

• _{Note synergy between GIS and Large image processing as in pathology.}

36

(37)

Biomolecular Simulation Data Analysis

37

5/17/2016

Path Similarity Analysis (PSA) with Hausdorff distance

• Utah (CPPTraj), Arizona State (MDAnalysis), Rutgers

• Parallelize key algorithms including O(N

2

) distance

computations between trajectories

(38)

Clustered distances for two methods for sampling macromolecular

transitions (200 trajectories each) showing that both methods produce distinctly different pathways.

38

RADICAL Pilot benchmark run for three different test sets of trajectories, using 12x12

“blocks” per task.

RADICAL-Pilot Hausdorff distance:

all-pairs

problem

(39)

Classification of lipids in membranes

Biological membranes are lipid bilayers with distinct inner and outer surfaces that are formed by lipid mono layers (leaflets). Movement of lipids between leaflets or change of topology (merging of leaflets during fusion events) is difficult to detect in simulations.

39

Lipids colored by leaflet

Same color: continuous leaflet.

(40)

LeafletFinder

LeafletFinder is a graph-based algorithm to detect continuous

lipid membrane leaflets in a MD simulation*. The current

implementation is slow and does not work well for large systems

(>100,000 lipids).

40

Phosphate

atom

coordinates

Build

nearest-neighbors

adjacency matrix

Find largest

connected

subgraphs

* N. Michaud-Agrawal, E. J. Denning, T. B.

Woolf, and O. Beckstein. MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. J Comp Chem, 32:2319–2327, 2011.

(41)

41

Apple

Mid Cap

Energy

S&P

Dow Jones

Finance

Origin 0% change

+10%

+20%

Time series of Stock Values projected to 3D

Using one day stock values measured from January 2004 and starting after one year January 2005 (Filled circles are final values)