1
NSF14-43054 started October 1, 2014
Datanet: CIF21 DIBBs: Middleware
and High Performance Analytics
Libraries for Scalable Data Science
• Indiana University (Fox, Qiu, Crandall, von Laszewski) • Rutgers (Jha)
• Virginia Tech (Marathe) • Kansas (Paden)
• Stony Brook (Wang)
• Arizona State (Beckstein) • Utah (Cheatham)
Overview by Geoffrey Fox, May 17, 2016
http://spidal.org/
Some Important Components of SPIDAL Dibbs
• NIST Big Data Application Analysis – features of data intensive Applications.
• HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High
Performance Computing) and the rich functionality of the commodity Apache Big
Data Stack.
– This is a reservoir of software subsystems – nearly all from outside the project and being a
mix of HPC and Big Data communities.
– Leads to Big Data – Simulation – HPC Convergence.
• MIDAS: Integrating Middleware – from project.
• Applications: Biomolecular Simulations, Network and Computational Social
Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics.
• SPIDAL (Scalable Parallel Interoperable Data Analytics Library):
Scalable Analytics for:
– Domain specific data analytics libraries – mainly from project. – Add Core Machine learning libraries – mainly from community. – Performance of Java and MIDAS Inter- and Intra-node.
• Benchmarks – project adds to communityWBDB2015 Benchmarking Workshop
• Implementations: XSEDE and Blue Waters as well as clouds (OpenStack, Docker)
2
Big Data - Big Simulation (Exascale) Convergence
• Discuss Data and Model together as built around problems which combine
them, but we can get insight by separating which allows better understanding of Big Data - Big Simulation “convergence”
• Big Data implies Data is large but Model varies
– e.g. LDA with many topics or deep learning has large model – Clustering or Dimension reduction can be quite small
• Simulations can also be considered as Data and Model
– Model is solving particle dynamics or partial differential equations – Data could be small when just boundary conditions
– Data large with data assimilation (weather forecasting) or when data visualizations are
produced by simulation
• Data often static between iterations (unless streaming); Model varies between iterations
• Take 51 NIST and other use cases derive multiple specific features • Generalize and systematize with features termed “facets”
• 50 Facets (Big Data) or 64 Facets (Big Simulation and Data) divided into 4 sets or views where each view has “similar” facets
– Allows one to study coverage of benchmark sets and architectures
3
64 Features in 4 views for Unified Classification of Big Data
and Simulation Applications
4 Lo ca l ( A na ly ti cs /I nf or m ati cs /S im ul ati o ns ) 2 M
Data Source and Style View
Pleasingly Parallel Classic MapReduce Map-Collective Map Point-to-Point
Shared Memory Single Program Multiple Data Bulk Synchronous Parallel Fusion Dataflow Agents Workflow
Geospatial Information System HPC Simulations
Internet of Things Metadata/Provenance
Shared / Dedicated / Transient / Permanent Archived/Batched/Streaming – S1, S2, S3, S4, S5 HDFS/Lustre/GPFS
Files/Objects
Enterprise Data Model SQL/NoSQL/NewSQL 1 M M ic ro -b e nc h m ar ks Execution View Processing View 1 2 3 4 6 7 8 9 10 11M 12 10D 9 8D 7D 6D 5D 4D 3D 2D 1D
Map Streaming 5
Convergence Diamonds Views and
Facets
Problem Architecture View
15 M C or e Li br ar ie s V is ua liz a ti on 14 M G ra p h A lg or it h m s 13 M Li ne ar A lg e br a Ke rn e ls /M a ny s u bc la ss e s 12 M G lo ba l ( A na ly ti cs /I nf or m ati cs /S im ul ati o ns ) 3 M R ec om m e nd e r E ng in e 5 M 4 M B as e D at a St a ti sti cs 10 M St re am in g D a ta A lg or it hm s O pti m iz a ti o n M e th od o lo gy 9 M Le ar n in g 8 M D a ta C la ss ifi ca ti o n 7 M D at a Se ar ch /Q ue ry /I nd ex 6 M 11 M D a ta A li gn m en t
Big Data Processing Diamonds M u lti sc al e M et ho d 17 M 16 M It er ati ve P D E S ol ve rs 22 M N at ur e of m es h if u se d Ev ol uti on o f D is cr e te S ys te m s 21 M Pa rti cl e s an d Fi el ds 20 M N -b od y M e th od s 19 M Sp e ct ra l M et ho d s 18 M Simulation (Exascale) Processing Diamonds D a ta A b st ra cti o n D 12 M o de l A bs tra cti on M 12 D at a M e tri c = M / N on -M e tri c = N D 13 D at a M et ric = M / N on -M et ric = N M 13 ܱ ܰ ଶ = N N / ܱ ሺ ܰ ሻ = N M 14 R eg u la r= R / Irr e gu la r = I M o de l M 10 Ve ra cit y 7 Ite ra ti ve / Sim p le M 11 C om m un ic ati o n St ru ct u re M 8 D yn am ic = D / St ati c = S D 9 D yn a m ic = D / St ati c = S M 9 R e gu la r = R / Irr eg ul ar = I D a ta D 10 M o de l V ar ie ty M 6 D at a Ve lo cit y D 5 Pe rfo rm an ce M et ric s 1 D a ta V ar ie ty D 6 Flo ps p er B yt e /M e m or y IO /F lo ps p er w att 2 Ex e cu ti on En vir on m en t; C or e lib ra rie s 3 D a ta Vo lu m e D 4 M od el Siz e M 4 Simulations Analytics
(Model for Data)
Both
(All Model)
(Nearly all Data+Model)
(Nearly all Data)
(Mix of Data and Model)
6 Forms of
MapReduce
Cover “all” circumstances
Describes
- Problem (Model reflecting data)
- Machine - Software
Architecture
5
6
5/17/2016 HPC-ABDS
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies
Cross-Cutting Functions
1) Message and Data Protocols:
Avro, Thrift, Protobuf
2) Distributed Coordination : Google Chubby, Zookeeper, Giraffe, JGroups
3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca
17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML
16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK
15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere
15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird
14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, AzureStream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine
14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem
13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs
12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store
12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC
12) Extraction Tools: UIMA, Tika
11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL
11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame
Public Cloud: Azure Table, Amazon Dynamo, Google DataStore
11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet
10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST
9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs
8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS
Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage
7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis
6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api
5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds
Networking: Google Cloud DNS, Amazon Route 53
HPC-ABDS Mapping of Activities
• Level 17: Orchestration: Apache Beam (Google Cloud Dataflow)
integrated with Cloudmesh on HPC cluster
• Level 16: Applications: Datamining for molecular dynamics, Image
processing for remote sensing and pathology, graphs, streaming, bioinformatics, social media, financial informatics, text mining
• Level 16: Algorithms: Generic and custom for applications SPIDAL
• Level 14: Programming: Storm, Heron (Twitter replaces Storm), Hadoop,
Spark, Flink. Improve Inter- and Intra-node performance
• Level 13: Communication: Enhanced Storm and Hadoop using HPC
runtime technologies, Harp
• Level 11: Data management: Hbase and MongoDB integrated via use of
Beam and other Apache tools; enhance Hbase
• Level 9: Cluster Management: Integrate Pilot Jobs with Yarn, Mesos,
Spark, Hadoop; integrate Storm and Heron with Slurm
• Level 6: DevOps: Python Cloudmesh virtual Cluster Interoperability
7
Green is MIDAS
5/17/2016
MIDAS: Software Activities in DIBBS
• Developing HPC-ABDS concept to integrate HPC and Apache
Technologies
• Java: relook at Java Grande to make performance “best simply possible” • DevOps: Cloudmesh provides interoperability between HPC and Cloud
(OpenStack, AWS, Docker) platforms based on virtual clusters with software defined systems using Ansible (Chef )
• Scheduling: Integrate Slurm and Pilot jobs with Yarn & Mesos (ABDS
schedulers), Programming layer (Hadoop, Spark, Flink, Heron/Storm)
• Communication and scientific data abstractions: Harp plug-in to
Hadoop outperforms ABDS programming layers
• Data Management: use Hbase, MongoDB with customization
• Workflow: Use Apache Crunch and Beam (Google Cloud Data flow) as
they link to other ABDS technologies.
• Starting to integrate MIDAS components and move into algorithms of
SPIDAL Library
9
Java MPI performs better than Threads
128 24 core Haswell nodes on SPIDAL DA-MDS Code
10
5/17/2016
Best Threads intra node And MPI inter node
Best MPI; inter and intra node
SM = Optimized Shared memory for intra-node MPI
Cloudmesh Interoperability DevOps Tool
• Model: Define software configuration with tools like Ansible; instantiate on a virtual
cluster
• An easy-to-use command line program/shell and portal to interface with
heterogeneous infrastructures
– Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud supported
clouds as well as classic HPC and Docker infrastructures
– Has an abstraction layer that makes it possible to integrate other IaaS
frameworks
– Uses defaults that help interacting with various clouds – Managing VMs across different IaaS providers is easy – The client saves state between consecutive calls
• Demonstrated interaction with various cloud providers:
– FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera, AWS, Azure,
virtualbox
• Status: AWS, and Azure, VirtualBox, Docker need improvements; we focus currently
on Comet and NSF resources that use OpenStack
• Currently evaluating 40 team projects from “Big Data Open Source Software Projects
Class” which used this approach running on VirtualBox, Chameleon and FutureSystems
11
Cloudmesh Interoperability DevOps Tool
• Model: Define software configuration with tools like Ansible; instantiate on a virtual
cluster
• An easy-to-use command line program/shell and portal to interface with
heterogeneous infrastructures
– Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud supported
clouds as well as classic HPC and Docker infrastructures
– Has an abstraction layer that makes it possible to integrate other IaaS
frameworks
– Uses defaults that help interacting with various clouds – Managing VMs across different IaaS providers is easy – The client saves state between consecutive calls
• Demonstrated interaction with various cloud providers:
– FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera, AWS, Azure,
virtualbox
• Status: AWS, and Azure, VirtualBox, Docker need improvements; we focus currently
on Comet and NSF resources that use OpenStack
• Currently evaluating 40 team projects from “Big Data Open Source Software Projects
Class” which used this approach running on VirtualBox, Chameleon and FutureSystems
12
Cloudmesh Client - Architecture
Cloudmesh Client
Cloudmesh Data Access Data
Services Terminal
Compute Security
Choreography
Batch Queue
Abstraction AbstractionVM PaaS Launchers
Reservation Group Management
Workflow
Authentication
Virtual Cluster Abstraction Access
Access Abstraction REST Command
Shell & Line
Policy Management
Authorization
Container Abstraction create, read, update, delete
Cloudmesh Portal
Database Abstraction
SQLite MongoDB create, read, update delete (CRUD)
Inventory
C
ho
re
og
ra
ph
y
13
Component View
Layered View
Systems Configuration
Cloudmesh Client – OSG management
•
While using Comet it is
possible to use the same
images that are used on
the internal OSG cluster.
•
This reduces overall
management effort.
•
The client is used to
manage the VMs.
Comet
Virtual Cluster SupercomputerComet OSG Metascheduling
Service OSG
Access Point
Local Cluster
Other Resources
Goal to use the same images/setup
Different setup
14
OSG: Open Science Grid.
LIGO data analysis was conducted on Comet supported
by the Cloudmesh client.
Funded by NSF Comet.
Cloudmesh
Cloudmesh Client
Cloudmesh Client –
In support of Experiment Workflow
Create IaaS Create
IaaS
Deploy PaaS Deploy
PaaS
Deploy Data Deploy
Data
Execute Scripts Execute
Scripts Evaluate
Evaluate Repeat Repeat
15
Integrate with Ansible Integrate with Ansible
Buildin gcopy and rsync commands
Cloudmesh scripts Integrate with shell Integrate with Python Manage VMs and virtual
clusters
Scripts Variables Shared state
Integration into Ipython and Apache Beam
Choose general
Infrastructure HPC to Clouds
Pilot-Hadoop/Spark Architecture
16
http://arxiv.org/abs/1602.00345
HPC into Scheduling Layer
Pilot-Hadoop Example
17
Pilot-Data/Memory for Iterative
Processing
18
Provide common API for distributed
cluster memory
Scalable K-Means
Harp Implementations
• Basic Harp: Iterative communication and scientific data abstractions • Careful support of distributed data AND distributed model
• Avoids parameter server approach but distributes model over worker nodes
and supports collective communication to bring global model to each node
• Applied first to Latent Dirichlet Allocation LDA with large model and data
19
Clueweb
20
5/17/2016 enwiki
Bi-gram
Clueweb
21 Harp LDA on Big Red II
Supercomputer (Cray)
0 20 40 60 80 100 120 140
0 5 10 15 20 25 30 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
Execution Time (hours)Nodes Parallel Efficiency
E xe cu ti o n T im e (h o u rs ) P a ra lle l E ff ic ie n c y
0 5 10 15 20 25 30 35
0 2 4 6 8 10 12 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
Execution Time (hours) Parallel Efficiency
Nodes E xe cu ti o n T im e (h o u rs ) P a ra lle l E ff ic ie n c y
Harp LDA on Juliet (Intel Haswell)
• Big Red II: tested on 25, 50, 75, 100 and 125 nodes; each node uses 32 parallel threads; Gemini interconnect
• Juliet: tested on 10, 15, 20, 25, 30 nodes; each node uses 64 parallel threads on 36 core Intel Haswell nodes (each with 2 chips);
Infiniband interconnect
Harp LDA Scaling Tests
Corpus: 3,775,554 Wikipedia documents, Vocabulary: 1 million words;
Topics: 10k topics; alpha: 0.01; beta: 0.01; iteration: 200
SPIDAL Algorithms – Subgraph mining
• Finding patterns in graphs is very important
– Counting the number of embeddings of a given labeled/unlabeled
template subgraph
– Finding the most frequent subgraphs/motifs efficiently from a
given set of candidate templates
– Computing the graphlet frequency distribution.
• Reworking existing parallel VT algorithm Sahad with MIDAS
middleware giving HarpSahad which runs 5 (Google) to 9 (Miami) times faster than original Hadoop version
• Work in progress
22
5/17/2016
Network Nodes No. Of
(in million)
No. Of Edges (in million)
Size (MB)
Web-google 0.9 4.3 65
Miami 2.1 51.2 740
Template
s
SPIDAL Algorithms – Random Graph Generation
• Random graphs, important and needed with particular degree distribution
and clustering coefficients.
– Preferential attachment (PA) model, Chung-Lu (CL), stochastic Kronecker,
stochastic block model (SBM), and block two–level Erdos-Renyi (BTER)
– Generative algorithms for these models are mostly sequential and take a
prohibitively long time to generate large-scale graphs.
• SPIDAL working on developing efficient parallel algorithms for generating random
graphs using different models with new DG method with low memory and high performance, almost optimal load balancing and excellent scaling.
– Algorithms are about 3-4 times faster than the previous ones.
– Generate a network with 250 billion edges in 12 seconds using 1024 processors. • Needs to be packaged for SPIDAL using MIDAS (currently MPI)
23 5/17/2016 2 .4 5 5 .3
9 67
.3 4 1 8 0 .5 9 8 7 .7 3 2 3 7 .0 9 1 6 8 .9 5 3 5 9 .6 2 1 2 .1 5 7 2 .6 7 0 200 400 600
Miami Twitter Weblinks Friendster UK-Union
DataSet G e n e ra ti o n T im e ( s) MH DG 0 200 400 600 800
1 64 128 256 512 768 1,024
# of Processors
Sp e e d u p
SPIDAL Algorithms – Triangle Counting
• Triangle counting; important special case of subgraph mining and
specialized programs can outperform general program
• Previous work used Hadoop but MPI based PATRIC is much faster
• SPIDAL version uses much more efficient decomposition (non-overlapping
graph decomposition) – a factor of 25 lower memory than PATRIC
• Next graph problem – Community detection
24
0 50 100 150 200 250 300 350 400 450
Suri et al. Park et al. 2013 Park et al. 2014 PATRIC Our Algo.
R
un
tim
e
(m
in
ut
es
)
Algorithms
Runtime Performance on Twitter
SPIDAL
MPI version
complete. Need
to package for
SPIDAL and add
MIDAS -- Harp
SPIDAL Algorithms – Core I
• Several parallel core machine learning algorithms; need to add SPIDAL
Java optimizations to complete parallel codes except MPI MDS
– https://www.gitbook.com/book/esaliya/global-machine-learning-with-dsc-spidal/details
• O(N2) distance matrices calculation with Hadoop parallelism and various
options (storage MongoDB vs. distributed files), normalization, packing to save memory usage, exploiting symmetry
• WDA-SMACOF: Multidimensional scaling MDS is optimal nonlinear
dimension reduction enhanced by SMACOF, deterministic annealing and Conjugate gradient for non-uniform weights. Used in many applications
– MPI (shared memory) and MIDAS (Harp) versions
• MDS Alignment to optimally align related point sets, as in MDS time
series
• WebPlotViz data management (MongoDB) and browser visualization for
3D point sets including time series. Available as source or SaaS
• MDS as 2 using Manxcat. Alternative more general but less reliable
solution of MDS. Latest version of WDA-SMACOF usually preferable
• Other Dimension Reduction: SVD, PCA, GTM to do
25
SPIDAL Algorithms – Core II
• Latent Dirichlet Allocation LDA for topic finding in text collections; new algorithm
with MIDAS runtime outperforming current best practice
• DA-PWC Deterministic Annealing Pairwise Clustering for case where points
aren’t in a vector space; used extensively to cluster DNA and proteomic sequences; improved algorithm over other published. Parallelism good but needs SPIDAL Java
• DAVS Deterministic Annealing Clustering for vectors; includes specification of
errors and limit on cluster sizes. Gives very accurate answers for cases where
distinct clustering exists. Being upgraded for new LC-MS proteomics data with one million clusters in 27 million size data set
• K-means basic vector clustering: fast and adequate where clusters aren’t
needed accurately
• Elkan’s improved K-means vector clustering: for high dimensional spaces; uses
triangle inequality to avoid expensive distance calcs
• Future work – Classification: logistic regression, Random Forest, SVM, (deep
learning); Collaborative Filtering, TF-IDF search and Spark MLlib algorithms
• Harp-DaaL extends Intel DAAL’s local batch mode to multi-node distributed modes
– Leveraging Harp’s benefits of communication for iterative compute models
26
SPIDAL Algorithms – Optimization I
•
Manxcat: Levenberg Marquardt Algorithm for non-linear
2optimization with sophisticated version of Newton’s method
calculating value and derivatives of objective function. Parallelism in
calculation of objective function and in parameters to be determined.
Complete – needs SPIDAL Java optimization
•
Viterbi
algorithm, for finding the maximum a posteriori (MAP) solution
for a Hidden Markov Model (HMM). The running time is O(n*s^2)
where n is the number of variables and s is the number of possible
states each variable can take. We will provide an "embarrassingly
parallel" version that processes multiple problems (e.g. many images)
independently; parallelizing within the same problem not needed in
our application space.
Needs Packaging in SPIDAL
•
Forward-backward algorithm
, for computing marginal distributions
over HMM variables. Similar characteristics as Viterbi above.
Needs
Packaging in SPIDAL
27
SPIDAL Algorithms – Optimization II
• Loopy belief propagation (LBP) for approximately finding the maximum a
posteriori (MAP) solution for a Markov Random Field (MRF). Here the running time is O(n^2*s^2*i) in the worst case where n is number of
variables, s is number of states per variable, and i is number of iterations required (which is usually a function of n, e.g. log(n) or sqrt(n)). Here there are various parallelization strategies depending on values of s and n for any given problem.
– We will provide two parallel versions: embarrassingly parallel version for
when s and n are relatively modest, and parallelizing each iteration of the same problem for common situation when s and n are quite large so that each iteration takes a long time relative to number of iterations
required.
– Needs Packaging in SPIDAL
• Markov Chain Monte Carlo (MCMC) for approximately computing marking
distributions and sampling over MRF variables. Similar to LBP with the same two parallelization strategies. Needs Packaging in SPIDAL
28
Imaging Applications: Remote Sensing,
Pathology, Spatial Systems
• Both Pathology/Remote sensing working on 3D images
• Each pathology image could have 10 billion pixels, and we may extract a million
spatial objects per image and 100 million features (dozens to 100 features per object) per image. We often tile the image into 4K x 4K tiles for processing. We develop buffering-based tiling to handle boundary-crossing objects. For each typical research study, we may have hundreds to thousands of pathology images
• Remote sensing aimed at radar images of ice and snow sheets
• 2D problems need modest parallelism “intra-image” but often need parallelism
over images
• 3D problems need parallelism for an individual image
• Use Optimization algorithms to support applications (e.g. Markov Chain, Integer
Programming, Bayesian Maximum a posteriori, variational level set, Euler-Lagrange Equation)
• Classification (deep learning convolution neural network, SVM, random forest,
etc.) will be important
29
2D Radar Polar Remote Sensing
• Need to estimate structure of earth (ice, snow, rock) from radar signals from
plane in 2 or 3 dimensions.
• Original 2D analysis ([11])used Hidden Markov Methods; better results using
MCMC (our solution)
30
5/17/2016
3D Radar Polar Remote Sensing
• Uses LBP to analyze 3D radar images
31
5/17/2016
Reconstructing bedrock in 3D, for (left) ground truth, (center) existing
algorithm based on maximum likelihood estimators, and (right) our technique based on a Markov Random Field formulation.
Radar gives a cross-section view, parameterized by angle and
range, of the ice structure, which yields a set of 2-d tomographic slices (right) along the flight path.
Each image represents a 3d depth map, with
Algorithms – Nuclei Segmentation for
Pathology Images
• Segment boundaries of nuclei from pathology
images and extract features for each nucleus
• Consist of tiling, segmentation,
vectorization, boundary object aggregation
• Could be executed on MapReduce
(MIDAS Harp) 32 Step 1: Preliminary Region Partition Step 2: Background Normalization Step 3: Nuclear Boundary Refinement Step 4: Nuclei Separation D is tr ib u te d F ile S ys te m M ap R ed u ce C o m p u ti n g Fr am ew o rk Segmentation Mask Images Boundary Vectorization Raw Polygons Boundary Normalization Normalized Polygons
Isolated Buffer Polygon Removal
Non-boundary Polygons
Whole Slide Polygon Aggregation
Ti lin g M ap R ed u ce
Cleaned Polygons
Boundary Polygons Final Segmented Polygons Tiled Images F1 F2 F3 F5 F4 F6 F7 F8 F9 F10 Tiling
Whole Slide Images
F11
Nuclear segmentation algorithm
Execution pipeline on MapReduce (MIDAS Harp)
Algorithms – Spatial Querying Methods
33
Spatial Queries Architecture of Spatial Query Engine
• Hadoop-GIS is a general framework to support high performance spatial
queries and analytics for spatial big data on MapReduce.
• It supports multiple types of spatial queries on MapReduce through spatial
partitioning, customizable spatial query engine and on-demand indexing.
• SparkGIS is a variation of Hadoop-GIS which runs on Spark to take
advantage of in-memory processing.
• Will extend Hadoop/Spark to Harp MIDAS runtime. • 2D complete; 3D in progress
Some Applications Enabled
• KU/IU: Remote Sensing in Polar Regions • SB: Digital Pathology Imaging
• SB: Large scale GIS applications, including public health • VT: Graph Analysis in studies of networks in many areas • UT, ASU, Rutgers: Analysis of Biomolecular simulations
Applications not part of Dibbs project but algorithms/software used
• IU: Bioinformatics and Financial Modeling with MDS • Integration with Network Science Infrastructure
– VT/IU: CINET: SPIDAL algorithms will be made available – IU: Osome Observatory on Social Media, currently Twitter
https://peerj.com/preprints/2008/ using enhanced HBase
– IU: Topic Analysis of text data
• IU/Rutgers: Streaming with HPC enhanced Storm/Heron
34
Enabled Applications – Digital Pathology
• Digital pathology images scanned from human tissue specimens provide
rich information about morphological and functional characteristics of biological systems.
• Pathology image analysis has high potential to provide diagnostic
assistance, identify therapeutic targets, and predict patient outcomes and therapeutic responses.
• It relies on both pathology image analysis algorithms and spatial querying
methods.
• Extremely large image scale.
35
Glass Slides Scanning Whole Slide Images Image Analysis
Applications – Public Health
• GIS-oriented public health research has a strong focus on the locations of
patients and the agents of disease, and studies the spatial patterns and variations.
• Integrating multiple spatial big data sources at fine spatial resolutions allow
public health researchers and health officials to adequately identify, analyze, and monitor health problems at the community level.
• This will rely on high performance spatial querying methods on data
integration.
• Note synergy between GIS and Large image processing as in pathology.
36
Biomolecular Simulation Data Analysis
37
5/17/2016
Path Similarity Analysis (PSA) with Hausdorff distance
•
Utah (CPPTraj), Arizona State (MDAnalysis), Rutgers
• Parallelize key algorithms including O(N
2) distance
computations between trajectories
Clustered distances for two methods for sampling macromolecular
transitions (200 trajectories each) showing that both methods produce distinctly different pathways.
38
RADICAL Pilot benchmark run for three different test sets of trajectories, using 12x12
“blocks” per task.
RADICAL-Pilot Hausdorff distance:
all-pairs
problem
Classification of lipids in membranes
Biological membranes are lipid bilayers with distinct inner and outer surfaces that are formed by lipid mono layers (leaflets). Movement of lipids between leaflets or change of topology (merging of leaflets during fusion events) is difficult to detect in simulations.
39
Lipids colored by leaflet
Same color: continuous leaflet.
LeafletFinder
LeafletFinder is a graph-based algorithm to detect continuous
lipid membrane leaflets in a MD simulation*. The current
implementation is slow and does not work well for large systems
(>100,000 lipids).
40
Phosphate
atom
coordinates
Build
nearest-neighbors
adjacency matrix
Find largest
connected
subgraphs
* N. Michaud-Agrawal, E. J. Denning, T. B.Woolf, and O. Beckstein. MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. J Comp Chem, 32:2319–2327, 2011.
41
Apple
Mid Cap
Energy
S&P
Dow Jones
Finance
Origin 0% change
+10%
+20%
Time series of Stock Values projected to 3D
Using one day stock values measured from January 2004 and starting after one year January 2005 (Filled circles are final values)