CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

0
1
27
1 week ago
PDF Preview
Full text
(1)NSF14-43054 start October 1, 2014. • • • • • • •. Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science. Indiana University (Fox, Qiu, Crandall, von Laszewski), Rutgers (Jha) Virginia Tech (Marathe) Kansas (Paden) Stony Brook (Wang) Arizona State(Beckstein) Utah(Cheatham) Overview by Geoffrey Fox (PI) June 24 2015 http://news.indiana.edu/releases/iu/2014/10/big-data-dibbs-grant.shtml http://www.nsf.gov/awardsearch/showAward?AWD_ID=1443054 02/14/2020. 1.

(2) Important Components. • NIST Big Data Application Analysis – mainly from project • HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack.. – This is reservoir of software subsystems – nearly all from outside project and mix of HPC and Big Data communities. • MIDAS: Integrating Middleware – from project • SPIDAL (Scalable Parallel Interoperable Data Analytics Library): Scalable Analytics for Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics. – Domain specific data analytics libraries – mainly from project – Add Core Machine learning Libraries – mainly from community – Benchmarks – project adds to community 02/14/2020. 2.

(3) Application Analysis. 02/14/2020. 3.

(4) Use Case Template • • • • • • • • • •. 26 fields completed for 51 areas Government Operation: 4 Commercial: 8 Defense: 3 Healthcare and Life Sciences: 10 Deep Learning and Social Media: 6 The Ecosystem for Research: 4 Astronomy and Physics: 5 Earth, Environmental and Polar Science: 10 Energy: 1 02/14/2020. 4.

(5) 51 Detailed Use Cases: Contributed July-September 2013. Covers goals, data features such as 3 V’s, software, hardware • • • • • • • • • •. •. 26 Features for each use case. http://bigdatawg.nist.gov/usecases.php Biased to science https://bigdatacoursespring2014.appspot.com/course (Section 5) Government Operation(4): National Archives and Records Administration, Census Bureau Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS) Defense(3): Sensors, Image surveillance, Situation Assessment Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source experiments Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron Collider at CERN, Belle Accelerator II in Japan Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors Energy(1): Smart grid. 02/14/2020. 5.

(6) 51 Use Cases: What is Parallelism Over? • • •. • • • • • • •. People: either the users (but see below) or subjects of application and often both Decision makers like researchers or doctors (users of application) Items such as Images, EMR, Sequences below; observations or contents of online store – – – – –. Images or “Electronic Information nuggets” EMR: Electronic Medical Records (often similar to people parallelism) Protein or Gene Sequences; Material properties, Manufactured Object specifications, etc., in custom dataset Modelled entities like vehicles and people. Sensors – Internet of Things Events such as detected anomalies in telescope or credit card data or atmosphere (Complex) Nodes in RDF Graph Simple nodes as in a learning network Tweets, Blogs, Documents, Web Pages, etc. – And characters/words in them Files or data to be backed up, moved or assigned metadata Particles/cells/mesh points as in parallel simulations. 02/14/2020. 6.

(7) Features of 51 Use Cases I. • PP (26) “All” Pleasingly Parallel or Map Only • MR (18) Classic MapReduce MR (add MRStat below for full count) • MRStat (7) Simple version of MR where key computations are simple reduction as found in statistical averages such as histograms and averages • MRIter (23) Iterative MapReduce or MPI (Spark, Twister) • Graph (9) Complex graph data structure needed in analysis • Fusion (11) Integrate diverse data to aid discovery/decision making; could involve sophisticated algorithms or could just be a portal • Streaming (41) Some data comes in incrementally and is processed this way • Classify (30) Classification: divide data into categories • S/Q (12) Index, Search and Query. 02/14/2020. 7.

(8) Features of 51 Use Cases II • • •. • • • •. CF (4) Collaborative Filtering for recommender engines LML (36) Local Machine Learning (Independent for each parallel entity) – application could have GML as well GML (23) Global Machine Learning: Deep Learning, Clustering, LDA, PLSI, MDS,. – Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can call EGO or Exascale Global Optimization with scalable parallel algorithm. Workflow (51) Universal GIS (16) Geotagged data and often displayed in ESRI, Microsoft Virtual Earth, Google Earth, GeoServer etc. HPC (5) Classic large-scale simulation of cosmos, materials, etc. generating (visualization) data Agent (2) Simulations of models of data-defined macroscopic entities represented as agents. 02/14/2020. 8.

(9) Data Source and Style View. 7. 6 5. 4. 3. 2. 1. 4. 3 2 1. Geospatial Information System HPC Simulations Internet of Things Metadata/Provenance Shared / Dedicated / Transient / Permanent Archived/Batched/Streaming HDFS/Lustre/GPFS Files/Objects Enterprise Data Model SQL/NoSQL/NewSQL. 4 Ogre Views and 50 Facets. Pleasingly Parallel Classic MapReduce Map-Collective Map Point-to-Point Map Streaming Shared Memory Single Program Multiple Data Bulk Synchronous Parallel Fusion Problem Dataflow Architecture Agents Workflow View. 1 2 3 4 5 6. 7 8 9 10 11. 12. 1 2. 3 4 5. Execution View. 6 7 8 9 10 11 12 13 14. O N2 = NN / O(N) = N Metric = M / Non-Metric = N Data Abstraction Iterative / Simple Regular = R / Irregular = I Dynamic = D / Static = S Communication Structure Veracity Variety Velocity Volume Execution Environment; Core libraries Flops per Byte; Memory I/O Performance Metrics. Processing View. 8. Micro-benchmarks Local Analytics Global Analytics Base Statistics. Recommendations Search / Query / Index Classification. Learning Optimization Methodology Streaming Alignment Linear Algebra Kernels. Graph Algorithms Visualization. 14 13 12 11 10 9. 10 9 8 7 6 5.

(10) 6 Forms of MapReduce cover “all”. circumstances. 02/14/2020. 10.

(11) • • •. Benchmarks/Mini-apps spanning Facets. Look at NSF SPIDAL Project, NIST 51 use cases, Baru-Rabl review Catalog facets of benchmarks and choose entries to cover “all facets” Micro Benchmarks: SPEC, EnhancedDFSIO (HDFS), Terasort, Wordcount, Grep, MPI, Basic Pub-Sub …. • SQL and NoSQL Data systems, Search, Recommenders: TPC (-C to x–HS for Hadoop), BigBench, Yahoo Cloud Serving, Berkeley Big Data, HiBench, BigDataBench, Cloudsuite, Linkbench. – includes MapReduce cases Search, Bayes, Random Forests, Collaborative Filtering. • Spatial Query: select from image or earth data • Alignment: Biology as in BLAST • Streaming: Online classifiers, Cluster tweets, Robotics, Industrial Internet of Things, Astronomy; BGBenchmark; choose to cover all 5 subclasses • Pleasingly parallel (Local Analytics): as in initial steps of LHC, Pathology, Bioimaging (differ in type of data analysis) • Global Analytics: Outlier, Clustering, LDA, SVM, Deep Learning, MDS, PageRank, Levenberg-Marquardt, Graph 500 entries • Workflow and Composite (analytics on xSQL) linking above. 02/14/2020. 11.

(12) HPC-ABDS 21 layer target software stack. 02/14/2020. 12.

(13) 13.

(14) http://hpc-abds.org/kaleidoscope/. 02/14/2020. 14.

(15) HPC-ABDS Stack Summarized. • The HPC-ABDS software is broken up into 21 layers so that one can discuss software systems in reasonable size groups.. – The layers where there is especial opportunity to integrate HPC are colored green in figure.. • We note that data systems that we construct from this software can run interoperably on virtualized or non-virtualized environments aimed at key scientific data analysis problems. • Most of ABDS emphasizes scalability but not performance and one of our goals is to produce high performance environments. Here there is clear need for better node performance and support of accelerators like Xeon-Phi and GPU’s. • Figure “ABDS v. HPC Architecture” contrasts modern ABDS and HPC stacks illustrating most of the 21 layers and labelling on left with layer number used in HPC-ABDS Figure. • The omitted layers in architecture figure are Interoperability, DevOps, Monitoring and Security (layers 7, 6, 4, 3) which are all important and clearly applicable to both HPC and ABDS. • We also add an extra layer “language” not discussed in HPC-ABDS Figure. 02/14/2020. 15.

(16) MIDAS and HPC-ABDS Integration. 02/14/2020. 16.

(17) High Performance Applications Scalable Parallel Interoperable Data Analytics Library (SPIDAL) High performance Mahout, R, Matlab …... Application Abstractions/Standards Graphs, Networks, Images, Geospatial ... HPC ABDS Hourglass. HPC Yarn for Resource management Horizontally scalable parallel programming model Collective and Point to Point Communication Support for iteration (in memory processing). System Abstraction/Standards Data Format and Storage >~ 300 Software Subsystems HPC ABDS SYSTEM (Middleware). 17.

(18) Applications SPIDAL MIDAS ABDS Govt. Commercial Healthcare, Deep Research Astronomy, Earth, Env., Energy Community Operations Defense Life Science Learning, Ecosystems Physics Polar & Examples Social Science Media. (Inter)disciplinary Workflow Native ABDS SQL-engines, Storm, Impala, Hive, Shark. Native HPC. MPI. Analytics Libraries. HPC-ABDS MapReduce. SPIDAL. Programming & Runtime Map – Point to Models. Map Only, PP Classic Map Many Task MapReduce Collective Point, Graph. MIddleware for Data-Intensive Analytics and Science (MIDAS) API. MIDAS. Communication Data Systems and Abstractions (MPI, RDMA, Hadoop Shuffle/Reduce, (In-Memory; HBase, Object Stores, other HARP Collectives, Giraph point-to-point) NoSQL stores, Spatial, SQL, Files). Higher-Level Workload Management (Tez, Llama). Workload Management (Pilots, Condor). External Data Access (Virtual Filesystem, GridFTP, SRM, SSH). Framework specific Scheduling (e.g. YARN). Cluster Resource Manager (YARN, Mesos, SLURM, Torque, SGE). Compute, Storage and Data Resources (Nodes, Cores, Lustre, HDFS). Resource Fabric. 18.

(19) Applications SPIDAL MIDAS ABDS. Govt. Commercial Healthcare, Deep Research Astronomy, Earth, Env., Energy Community Operations Defense Life Science Learning, Ecosystems Physics Polar & Examples Social Science Media. (Inter)disciplinary Workflow Native ABDS SQL-engines, Storm, Impala, Hive, Shark. Native HPC. MPI. Analytics Libraries. HPC-ABDS MapReduce. SPIDAL. Programming & Runtime Map – Point to Models. Map Only, PP Classic Map Many Task MapReduce Collective Point, Graph. MIddleware for Data-Intensive Analytics and Science (MIDAS) API. MIDAS. Communication Data Systems and Abstractions (MPI, RDMA, Hadoop Shuffle/Reduce, (In-Memory; HBase, Object Stores, other HARP Collectives, Giraph point-to-point) NoSQL stores, Spatial, SQL, Files). Higher-Level Workload Management (Tez, Llama). Workload Management (Pilots, Condor). External Data Access (Virtual Filesystem, GridFTP, SRM, SSH). Framework specific Scheduling (e.g. YARN). Cluster Resource Manager (YARN, Mesos, SLURM, Torque, SGE). Compute, Storage and Data Resources (Nodes, Cores, Lustre, HDFS). Resource Fabric.

(20) Data Analytics identified in proposal. 02/14/2020. 20.

(21) Machine Learning in Network Science, Imaging in Computer Vision, Pathology, Polar Science, Biomolecular Simulations Algorithm. Applications. Features. Graph Analytics. Statu Parallelism s. Community detection. Social networks, webgraph. P-DM GML-GrC. Finding diameter. Social networks, webgraph. P-DM GML-GrB. Subgraph/motif finding. Webgraph, biological/social networks. Clustering coefficient. Social networks. Page rank. Webgraph. Maximal cliques. Social networks. Shortest path. Distance based queries Spatial clustering Spatial modeling. P-DM GML-GrC P-DM GML-GrC. P-DM GML-GrB. Social networks, webgraph. Betweenness centrality. relationship. Graph .. Social networks, webgraph. Connected component. Spatial queries. P-DM GML-GrB. Social networks, webgraph based. Spatial Queries and Analytics. GIS/social networks/pathology informatics. P-DM GML-GrB Graph, static. Non-metric,. P-Shm P-Shm. GML-GRA. P-DM PP Geometric. P-DM PP Seq Seq. GML Global (parallel) 02/14/2020 ML GrA Static GrB Runtime partitioning. GML PP. 21.

(22) Some specialized data analytics in SPIDAL. Algorithm. •. aa. Image preprocessing. Applications. Features. Core Image Processing. Object detection & segmentation. Image/object feature computation. 3D image registration. Computer vision/pathology informatics. Object matching. 3D feature extraction. Learning Network, Stochastic Gradient Descent. Metric Space Point Sets, Neighborhood sets & Image features. Geometric Deep Learning. Image Understanding, in Language Translation, Voice Connections artificial neural net Recognition, Car driving. PP Pleasingly Parallel (Local ML) Seq Sequential Available GRA Good distributed algorithm needed. Status. Parallelism. P-DM. PP. P-DM. PP. P-DM. PP. Seq. PP. Todo. PP. Todo. P-DM. PP. GML. Todo No prototype Available P-DM Distributed memory Available P-Shm Shared memory Available 02/14/2020. 22.

(23) Some Core Machine Learning Building Blocks. Algorithm. DA Vector Clustering DA Non metric Clustering Kmeans; Basic, Fuzzy and Elkan Levenberg-Marquardt Optimization SMACOF Dimension Reduction Vector Dimension Reduction TFIDF Search. All-pairs similarity search Support Vector Machine SVM. Random Forest Gibbs sampling (MCMC) Latent Dirichlet Allocation LDA with Gibbs sampling or Var. Bayes Singular Value Decomposition SVD Hidden Markov Models (HMM). Applications. Accurate Clusters. Features. Status. Vectors. Accurate Clusters, Biology, Web Non metric, O(N2) Fast Clustering Vectors Non-linear Gauss-Newton, use Least Squares in MDS Squares, DA- MDS with general weights Least 2 O(N ) DA-GTM and Others Vectors Find nearest neighbors in document corpus Bag of “words” Find pairs of documents with (image features) TFIDF distance below a threshold Learn and Classify Learn and Classify. Vectors Vectors. Topic models (Latent factors) Dimension Reduction and PCA. P-DM. GML. P-DM. GML. P-DM P-DM. GML GML. P-DM. GML. P-DM. PP. Todo. GML. Seq. GML. Todo. GML. P-DM. P-DM. Solve global inference problems Graph. //ism. GML. PP. Bag of “words”. P-DM. GML. Vectors. Seq. GML. 02/14/2020. Global inference on sequence Vectors. 23. Seq. PP. &.

(24) Timeline. 02/14/2020. 24.

(25) Year 1. SPIDAL MIDAS. Community: HPC Biomolecular Simulations Community: Network Science and Comp. Social Science Community: Computational Epidemiology. Year 2. Community requirement and technology evaluation. SPIDAL-MIDAS Interface and SPIDAL V1.0. Integrated testing with Algorithms & MIDAS. Extend to V2.0. Community requirements gathering. CPPTRAJ to integrate with MIDAS for ensemble analysis on Blue Waters. (i) Parallel Trajectory and MDAnalysis with MR (ii) iBIOMES data mgmt. in MIDAS (iii) End-toend Integration of CPPTrajMIDAS with SPIDAL (iv) Use SPIDAL Kmeans (v) Tutorials and outreach. (i) Arch and design spec (ii) In-memory pilot abstract., integrate with XSEDE. SPIDAL scheduling components and execution proceesing. MIDAS on Blue Waters. V1.0 release. i) Gather community requirement i) Giraph-based clustering and ii) study existing network analytic community detection problems algorithms ii) Integ of CINET in SPIDAL Community requirement gathering. Design i) Wrapper for EpiSimdemics and EpiFast ii) Giraph simulation tool. Community: Spatial. i. ii.. Community: Pathology. (i) Implementation of 2D image preproc., segment and feature extraction and tumor research. i.. Community: Computer vision:. Port image processing, feature extraction, image matching, pleasingly parallel ML algos. i.. Community: Radar informatics:. Years 3-5. i.. ii.. Community reqs i. Spatial queries library and ii. 2D parallel. single-echogram layer finding, tile matching. spatial 2D clustering and Geospatial & pathology apps. Scalability testing, adaptors for new platforms, Support for tools and developers, Optimization, Phase II of execution-processing models,V2.0. i) Algorithm implementation for subgraph problems ii) Develop new algorithms as necessary. i) Implement the wrappers ii) Start implementing Giraphbased tool iii) Integrate EpiSimdemics and Epifast with SPIDAL. (i) Implementation of 3D spatial queries. (ii) Application to 3D pathology. ii.. Image registration, object i. matching & feature extraction (3D) ii. Integrate MIDAS. ii.. ii.. Implement ML and optimization algorithms; large-scale image recognition. i.. Continued implementation of 3D image processing library Application to liver and neuroblastoma Continue implementing ML and global optimization; large-scale 3D recognition in social images. (i) Develop and implement 02/14/2020Develop and implement continent-scale layer finding (i) change detection and 25 (ii) flow field estimation in satellite.

(26) Compute Systems. 02/14/2020. 26.

(27) Relevant DSC and XSEDE Computing Systems •. DSC adding128 node Haswell based (2 chips, 24 or 36 cores per node) system (Juliet) (arrived June 19) – – – –. 128 GB memory per node Substantial conventional disk per node (8TB) plus PCI based 400 GB SSD Infiniband with SR-IOV Back end Lustre or equivalent hosted on Echo. •. DSC Older or Very Old (tired) machines. •. – Optimized for Cloud research and Large scale Data analytics exploring storage models, algorithms – Bare-metal v. Openstack virtual clusters – Extensively used in Education – Bravo set up as an Hadoop Cluster XSEDE – Wrangler Blue Waters and Comet likely to be especially useful. – India (128 nodes, 1024 cores), Bravo (16 nodes, 128 cores), Delta(16 nodes, 192 cores), Echo(16 nodes, 192 cores), Tempest (32 nodes, 768 cores); some with large memory, large disk and GPU. 02/14/2020. 27.

(28)

New documents

Are Your Students Flipping Prepared?
Laboratory quiz scores, percentage of weekly videos watched, multiple video views, and reported student preparedness were compared between groups.. The results showed statistically
Peer Health Coach Training Practicum: Evidence from a Flipped Classroom
These preliminary results support the utility of a flipped classroom practicum course designed to teach undergraduate students the skills of MI; however, extended practice with the use
Towards a Systematic Approach to Big Data Benchmarking
Solution Statement Apply the multidimensional multifaceted Ogre classification to characterize Big Data applications and benchmarks Study the benchmarking process in existing Big Data
Plasma inverse scattering theory
In this section we use the equations describing the time dependent fields inside the plasma to derive an integral equation relating the reflected wave to the plasma density.. The
Outcomes of a Course Design Workshop Series Implemented in a Team Based and Diverse Classroom Setting
In this course, the graduate students and postdoctoral fellows from widely divergent backgrounds at Iowa State University designed a new course related to their own research
Big Data Applications, Software, Hardware and Curricula
• Proposed classification of Big Data applications with features generalized as facets and kernels for analytics • Data intensive algorithms do not have the well developed high
Ab Initio and Experimental Techniques for Studying Non Stoichiometry and Oxygen Transport in Mixed Conducting Oxides
2, ab initio calculations are integrated with lattice Monte Carlo simulations to predict, for the first time, the phase diagram and macroscopic thermodynamic properties such as
Classification of Big Data Applications and Implications for the Algorithms and Software Needed for Scalable Data Analytics
• Proposed classification of Big Data applications and Benchmarks with features generalized as facets • Data intensive algorithms do not have the well developed high performance
Microfluidic technolgies for continuous culture and genetic circuit characterization
In other words, the chemostat culture gravitates toward a steady state condition, during which the microbial growth rate is just sufficient to replenish the cells lost in the effluent
Network Coding and Distributed Compression over Large Networks: Some Basic Principles
This work investigates several new approaches to bounding the achievable rate region for general network source coding problems - reducing a network to an equivalent network or

Tags

Related documents

Developing our capability in cyber security: Academic Centres of Excellence in Cyber Security Research
identity, cloud computing ●● Science of cyber security Cyber physical systems security ●● Real-time network analytics and virtualisation ●● High performance/resource constrained
0
0
30
Towards Multimodal Data Analytics: Integrating Natural Language into Visual Analytics
In visual analytics, equivalent roles can be defined for the targets that match the stages and elements of analytical workflows, as follows: Adjust/enrich: when the user modifies the
0
0
5
Big Data Analytics for Wireless and Wired Network Design: A Survey
The authors of [27] noted that processing the characteristics of both the regional user and service using big data analytics can be of great help in flexible network and functionality
0
0
20
Regions from the ground up: a network partitioning approach to regional delineation
Urban Analytics and City Science Article Regions from the ground up: a network partitioning approach to regional delineation Environment and Planning B: Urban Analytics and City Science
0
0
16
full
SQL like 5 Perform interactive analytics on data in analytics-optimized database with 5A Science 6 Visualize data extracted from horizontally scalable Big Data store 7 Move data from a
0
0
166
Integration of HPC, Big Data Analytics and The Software Ecosystem:Tutorial  Overview
• This tutorial covers both data analytics and the hardware/software system that supports them – Analytics are parallel what I call Global Machine Learning – Software systems support
0
0
29
Scalable High Performance Data Analytics: Harp and Harp-DAAL: Indiana University
Motivation for faster and bigger problems • Machine Learning ML Needs high performance – Big data and Big model – Iterative algorithms are fundamental in learning a non-trivial model
0
1
48
Scalable High Performance Data Analytics
Motivation for faster and bigger problems • Machine Learning Needs high performance – Big data and Big model – Iterative algorithms are fundamental in learning a non-trivial model –
0
0
62
Big Data Use Cases: February 2017
SQL like 5 Perform interactive analytics on data in analytics-optimized database 6 Visualize data extracted from horizontally scalable Big Data store 7 Move data from a highly
0
0
63
General Discussion of HPC-ABDS: February 2017
Functionality of ABDS and Performance of HPC Workflow: Apache Crunch, Python or Kepler Data Analytics: Mahout, R, ImageJ, Scalapack High level Programming: Hive, Pig Batch Parallel
0
0
37
Designing and Building an Analytics Library with the  Convergence of High Performance Computing and Big Data
Applications, Benchmarks and Libraries – 51 NIST Big Data Use Cases, 7 Computational Giants of the NRC Massive Data Analysis, 13 Berkeley dwarfs, 7 NAS parallel benchmarks – Unified
0
0
73
Data-enabled Science and Engineering: Scalable High Performance Data Analytics
Progress in HPC-ABDS Runtime • Standalone Twister: Iterative Execution caching and High performance communication extended to first Map-Collective runtime • HPC-ABDS Plugin Harp: adds
0
0
71
Data Science and Big Data Analytics pdf
1 Introduction to Big Data Analytics Key Concepts Big Data overview State of the practice in analytics Business Intelligence versus Data Science Key roles for the new Big Data ecosystem
0
0
435
Data enabled Science and Engineering: Scalable High Performance Data Analytics
Large Scale Data Analysis Applications Case Studies • Bioinformatics: Multi-Dimensional Scaling MDS on gene sequence data • Computer Vision: Kmeans Clustering on image data high
0
0
65
Towards HPC ABDS: An Initial Experience Optimizing Hadoop for Scalable High Performance Data Analytics,
improve Mahout and MLlib; don’t compete with them – Use Hadoop plug-ins like Harp rather than replacing Hadoop and Spark • Identification of Six Computation Models for Data Analytics
0
0
52
Show more