Big Data Use Cases: February 2017

0
0
63
1 month ago
PDF Preview
Full text
(1)NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science. Big Data Use Cases February 2017. Software: MIDAS HPC-ABDS. 1. 1 Spidal.org.

(2) Application Nexus of HPC, Big Data, Simulation Convergence Use-case Data and Model NIST Collection Big Data Ogres Convergence Diamonds 2. Spidal.org. 2.

(3) Data and Model in Big Data and Simulations I. • Need to discuss Data and Model as problems have both intermingled, but we can get insight by separating which allows better understanding of Big Data - Big Simulation “convergence” (or differences!) • The Model is a user construction and it has a “concept”, parameters and gives results determined by the computation. We use term “model” in a general fashion to cover all of these. • Big Data problems can be broken up into Data and Model. – For clustering, the model parameters are cluster centers while the data is set of points to be clustered – For queries, the model is structure of database and results of this query while the data is whole database queried and SQL query – For deep learning with ImageNet, the model is chosen network with model parameters as the network link weights. The data is set of images used for training or classification 3 3 Spidal.org.

(4) Data and Model in Big Data and Simulations II. • Simulations can also be considered as Data plus Model – Model can be formulation with particle dynamics or partial differential equations defined by parameters such as particle positions and discretized velocity, pressure, density values – Data could be small when just boundary conditions – Data large with data assimilation (weather forecasting) or when data visualizations are produced by simulation • Big Data implies Data is large but Model varies in size – e.g. LDA with many topics or deep learning has a large model – Clustering or Dimension reduction can be quite small in model size • Data often static between iterations (unless streaming); Model parameters vary between iterations • Data and Model Parameters are often confused in papers as term data used to describe the parameters of models. 4. Spidal.org. 4.

(5) • • • • • • • • • •. Use Case Template. 26 fields completed for 51 areas Government Operation: 4 Commercial: 8 Defense: 3 Healthcare and Life Sciences: 10 Deep Learning and Social Media: 6 The Ecosystem for Research: 4 Astronomy and Physics: 5 Earth, Environmental and Polar Science: 10 Energy: 1 5. 5 Spidal.org.

(6) http://hpc-abds.org/kaleidoscope/survey/. Online Use Case Form 6. 6. Spidal.org.

(7) 51 Detailed Use Cases: Contributed July-September 2013 • • • • • • • •. • •. Covers goals, data features such as 3 V’s, software, hardware. Government Operation(4): National Archives and Records Administration, Census Bureau Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS) Defense(3): Sensors, Image surveillance, Situation Assessment Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source experiments Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron Collider at CERN, Belle Accelerator II in Japan Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors Energy(1): Smart grid Published by NIST as http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-3.pdf with common set of 26 features recorded for each use-case; “Version 2” being prepared. 7 Biased to science 26 Features for each use case 7 Spidal.org.

(8) Part of Property Summary Table. 8. Spidal.org.

(9) 51 Use Cases: What is Parallelism Over? • • •. • • • • • • •. People: either the users (but see below) or subjects of application and often both Decision makers like researchers or doctors (users of application) Items such as Images, EMR, Sequences below; observations or contents of online store – Images or “Electronic Information nuggets” – EMR: Electronic Medical Records (often similar to people parallelism) – Protein or Gene Sequences; – Material properties, Manufactured Object specifications, etc., in custom dataset – Modelled entities like vehicles and people Sensors – Internet of Things Events such as detected anomalies in telescope or credit card data or atmosphere (Complex) Nodes in RDF Graph Simple nodes as in a learning network Tweets, Blogs, Documents, Web Pages, etc. – And characters/words in them Files or data to be backed up, moved or assigned metadata Particles/cells/mesh points as in parallel simulations. 9. Spidal.org.

(10) Sample Features of 51 Use Cases I. • PP (26) “All” Pleasingly Parallel or Map Only • MR (18) Classic MapReduce MR (add MRStat below for full count) • MRStat (7) Simple version of MR where key computations are simple reduction as found in statistical averages such as histograms and averages • MRIter (23) Iterative MapReduce or MPI (Flink, Spark, Twister) • Graph (9) Complex graph data structure needed in analysis • Fusion (11) Integrate diverse data to aid discovery/decision making; could involve sophisticated algorithms or could just be a portal • Streaming (41) Some data comes in incrementally and is processed this way • Classify (30) Classification: divide data into categories • S/Q (12) Index, Search and Query. 10. Spidal.org.

(11) • • •. • • • •. Sample Features of 51 Use Cases II. CF (4) Collaborative Filtering for recommender engines LML (36) Local Machine Learning (Independent for each parallel entity) – application could have GML as well GML (23) Global Machine Learning: Deep Learning, Clustering, LDA, PLSI, MDS, – Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can call EGO or Exascale Global Optimization with scalable parallel algorithm. Workflow (51) Universal GIS (16) Geotagged data and often displayed in ESRI, Microsoft Virtual Earth, Google Earth, GeoServer etc. HPC (5) Classic large-scale simulation of cosmos, materials, etc. generating (visualization) data Agent (2) Simulations of models of data-defined macroscopic entities represented as agents. 11. Spidal.org.

(12) Other Benchmark Sets. 12. Spidal.org.

(13) 7 Computational Giants of NRC Massive Data Analysis Report http://www.nap.edu/catalog.php?record_id=18374 Big Data Models? 1) 2) 3) 4) 5) 6) 7). G1: G2: G3: G4: G5: G6: G7:. Basic Statistics e.g. MRStat Generalized N-Body Problems Graph-Theoretic Computations Linear Algebraic Computations Optimizations e.g. Linear Programming Integration e.g. LDA and other GML Alignment Problems e.g. BLAST. 13. Spidal.org.

(14) HPC (Simulation) Benchmark Classics. • Linpack or HPL: Parallel LU factorization for solution of linear equations; HPCG • NPB version 1: Mainly classic HPC solver kernels – – – – – – – –. MG: Multigrid CG: Conjugate Gradient FT: Fast Fourier Transform Simulation Models IS: Integer sort EP: Embarrassingly Parallel BT: Block Tridiagonal SP: Scalar Pentadiagonal LU: Lower-Upper symmetric Gauss Seidel. 14. Spidal.org.

(15) 13 Berkeley Dwarfs 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11). Dense Linear Algebra Sparse Linear Algebra Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound 12) Graphical Models 13) Finite State Machines. Largely Models for Data or Simulation First 6 of these correspond to Colella’s original. (Classic simulations) Monte Carlo dropped. N-body methods are a subset of Particle in Colella. Note a little inconsistent in that MapReduce is a programming model and spectral method is a numerical method. Need multiple facets to classify use cases! 15. Spidal.org.

(16) Classifying Use cases. 16. Spidal.org.

(17) Classifying Use Cases • •. The Big Data Ogres built on a collection of 51 big data uses gathered by the NIST Public Working Group where 26 properties were gathered for each application.. This information was combined with other studies including the Berkeley dwarfs, the NAS parallel benchmarks and the Computational Giants of the NRC Massive Data Analysis Report. • The Ogre analysis led to a set of 50 features divided into four views that could be used to categorize and distinguish between applications. • The four views are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing View or runtime features. • We generalized this approach to integrate Big Data and Simulation applications into a single classification looking separately at Data and Model with the total facets growing to 64 in number, called convergence diamonds, and split between the same 4 views. • A mapping of facets into work of the SPIDAL project has been given. 17 Spidal.org.

(18) 18 18 Spidal.org.

(19) 64 Features in 4 views for Unified Classification of Big Data and Simulation Applications Both. Simulations Analytics. (Model for Big Data). (Nearly all Data). (All Model). (Mix of Data and Model) (Nearly all Data+Model). 19. Spidal.org.

(20) Convergence Diamonds and their 4 Views I. • One view is the overall problem architecture or macropatterns which is naturally related to the machine architecture needed to support application. – Unchanged from Ogres and describes properties of problem such as “Pleasing Parallel” or “Uses Collective Communication” • The execution (computational) features or micropatterns view, describes issues such as I/O versus compute rates, iterative nature and regularity of computation and the classic V’s of Big Data: defining problem size, rate of change, etc. – Significant changes from ogres to separate Data and Model and add characteristics of Simulation models. e.g. both model and data have “V’s”; Data Volume, Model Size – e.g. O(N2) Algorithm relevant to big data or big simulation model 20. Spidal.org.

(21) •. •. • •. Convergence Diamonds and their 4 Views II The data source & style view includes facets specifying how the data is collected, stored and accessed. Has classic database characteristics – Simulations can have facets here to describe input or output data – Examples: Streaming, files versus objects, HDFS v. Lustre. Processing view has model (not data) facets which describe types of processing steps including nature of algorithms and kernels by model e.g. Linear Programming, Learning, Maximum Likelihood, Spectral methods, Mesh type, – mix of Big Data Processing View and Big Simulation Processing View and includes some facets like “uses linear algebra” needed in both: has specifics of key simulation kernels and in particular includes facets seen in NAS Parallel Benchmarks and Berkeley Dwarfs Instances of Diamonds are particular problems and a set of Diamond instances that cover enough of the facets could form a comprehensive benchmark/mini-app set Diamonds and their instances can be atomic or composite 21. Spidal.org.

(22) Problem Architecture View. 22. Spidal.org.

(23) 23. Spidal.org.

(24) 1.. Problem Architecture View (Meta or MacroPatterns). Pleasingly Parallel – as in BLAST, Protein docking, some (bio-)imagery including Local Analytics or Machine Learning – ML or filtering pleasingly parallel, as in bioimagery, radar images (pleasingly parallel but sophisticated local analytics) 2. Classic MapReduce: Search, Index and Query and Classification algorithms like collaborative filtering (G1 for MRStat in Features, G7) 3. Map-Collective: Iterative maps + communication dominated by “collective” operations as in reduction, broadcast, gather, scatter. Common datamining pattern 4. Map-Point to Point: Iterative maps + communication dominated by many small point to point messages as in graph algorithms 5. Map-Streaming: Describes streaming, steering and assimilation problems 6. Shared Memory: Some problems are asynchronous and are easier to parallelize on shared rather than distributed memory – see some graph algorithms 7. SPMD: Single Program Multiple Data, common parallel programming feature 8. BSP or Bulk Synchronous Processing: well-defined compute-communication phases 9. Fusion: Knowledge discovery often involves fusion of multiple methods. 10. Dataflow: Important application features often occurring in composite Ogres 11. Use Agents: as in epidemiology (swarm approaches) This is Model only 12. Workflow: All applications often involve orchestration (workflow) of multiple components. Most (11 of total 12) are properties of Data+Model. 24. Spidal.org.

(25) Structure of Applications •. • •. •. Real-time (streaming) data is increasingly common in scientific and engineering research, and it is ubiquitous in commercial Big Data (e.g., social network analysis, recommender systems and consumer behavior classification) – So far little use of commercial and Apache technology in analysis of scientific streaming data Pleasingly parallel applications important in science (long tail) and data communities Commercial-Science application differences: Search and recommender engines have different structure to deep learning, clustering, topic models, graph analyses such as subgraph mining – Latter very sensitive to communication and can be hard to parallelize – Search typically not as important in Science as in commercial use as search volume scales by number of users Should discuss data and model separately – Term data often used rather sloppily and often refers to model 25. Spidal.org.

(26) Local and Global Machine Learning. • Many applications use LML or Local machine Learning where machine learning (often from R or Python or Matlab) is run separately on every data item such as on every image • But others are GML Global Machine Learning where machine learning is a basic algorithm run over all data items (over all nodes in computer) – maximum likelihood or 2 with a sum over the N data items – documents, sequences, items to be sold, images etc. and often links (point-pairs). – GML includes Graph analytics, clustering/community detection, mixture models, topic determination, Multidimensional scaling, (Deep) Learning Networks • Note Facebook may need lots of small graphs (one per person and ~LML) rather than one giant graph of connected people (GML) 26. Spidal.org.

(27) 6 Forms of MapReduce Describes Architecture of - Problem (Model reflecting data) - Machine - Software 2 important variants (software) of Iterative MapReduce and Map-Streaming a) “In-place” HPC b) Flow for model and data. 27 27 Spidal.org.

(28) Examples in Problem Architecture View PA •. •. The facets in the Problem architecture view include 5 very common ones describing synchronization structure of a parallel job: – MapOnly or Pleasingly Parallel (PA1): the processing of a collection of independent events; – MapReduce (PA2): independent calculations (maps) followed by a final consolidation via MapReduce; – MapCollective (PA3): parallel machine learning dominated by scatter, gather, reduce and broadcast; – MapPoint-to-Point (PA4): simulations or graph processing with many local linkages in points (nodes) of studied system. – MapStreaming (PA5): The fifth important problem architecture is seen in recent approaches to processing real-time data. – We do not focus on pure shared memory architectures PA6 but look at hybrid architectures with clusters of multicore nodes and find important performances issues dependent on the node programming model. Most of our codes are SPMD (PA-7) and BSP (PA-8). 28. Spidal.org.

(29) Relation of Problem and Machine Architecture • Problem is Model plus Data • In my old papers (especially book Parallel Computing Works!), I discussed computing as multiple complex systems mapped into each other. Problem  Numerical formulation  Software  Hardware • • • •. Each of these 4 systems has an architecture that can be described in similar language One gets an easy programming model if architecture of problem matches that of Software One gets good performance if architecture of hardware matches that of software and problem So “MapReduce” can be used as architecture of software (programming model) or “Numerical formulation of problem” 29. Spidal.org.

(30) Processing View. 30. Spidal.org. 30.

(31) • • • •. • • •. Diamond Facets in Processing (runtime) View I used in Big Data and Big Simulation. Pr-1M Micro-benchmarks ogres that exercise simple features of hardware such as communication, disk I/O, CPU, memory performance Pr-2M Local Analytics executed on a single core or perhaps node Pr-3M Global Analytics requiring iterative programming models (G5,G6) across multiple nodes of a parallel system Pr-12M Uses Linear Algebra common in Big Data and simulations – Subclasses like Full Matrix – Conjugate Gradient, Krylov, Arnoldi iterative subspace methods – Structured and unstructured sparse matrix methods Pr-13M Graph Algorithms (G3) Clear important class of algorithms -- as opposed to vector, grid, bag of words etc. – often hard especially in parallel Pr-14M Visualization is key application capability for big data and simulations Pr-15M Core Libraries Functions of general value such as Sorting, Math functions, Hashing 31. Spidal.org.

(32) • • • • • •. • •. Diamond Facets in Processing (runtime) View II used in Big Data. Pr-4M Basic Statistics (G1): MRStat in NIST problem features Pr-5M Recommender Engine: core to many e-commerce, media businesses; collaborative filtering key technology Pr-6M Search/Query/Index: Classic database which is well studied (Baru, Rabl tutorial) Pr-7M Data Classification: assigning items to categories based on many methods – MapReduce good in Alignment, Basic statistics, S/Q/I, Recommender, Classification. Pr-8M Learning of growing importance due to Deep Learning success in speech recognition etc.. Pr-9M Optimization Methodology: overlapping categories including – Machine Learning, Nonlinear Optimization (G6), Maximum Likelihood or 2 least squares minimizations, Expectation Maximization (often Steepest descent), Combinatorial Optimization, Linear/Quadratic Programming (G5), Dynamic Programming. Pr-10M Streaming Data or online Algorithms. Related to DDDAS (Dynamic DataDriven Application Systems) Pr-11M Data Alignment (G7) as in BLAST compares samples with repository. 32. Spidal.org.

(33) Diamond Facets in Processing (runtime) View III used in Big Simulation. • Pr-16M Iterative PDE Solvers: Jacobi, Gauss Seidel etc. • Pr-17M Multiscale Method? Multigrid and other variable resolution approaches • Pr-18M Spectral Methods as in Fast Fourier Transform • Pr-19M N-body Methods as in Fast multipole, Barnes-Hut • Pr-20M Both Particles and Fields as in Particle in Cell method • Pr-21M Evolution of Discrete Systems as in simulation of Electrical Grids, Chips, Biological Systems, Epidemiology. Needs Ordinary Differential Equation solvers • Pr-22M Nature of Mesh if used: Structured, Unstructured, Adaptive. Covers NAS Parallel Benchmarks and Berkeley Dwarfs. 33. Spidal.org.

(34) Execution View. 34. Spidal.org. 34.

(35) • • • •. •. Examples in Execution View EV. The Execution view is a mix of facets describing either data or model; PA was largely the overall Data+Model EV-M14 is Complexity of model (O(N2) for N points) seen in the nonmetric space models EV-M13 such as one gets with DNA sequences. EV-M11 describes iterative structure distinguishing Spark, Flink, and Harp from the original Hadoop. The facet EV-M8 describes the communication structure which is a focus of our research as much data analytics relies on collective communication which is in principle understood but we find that significant new work is needed compared to basic HPC releases which tend to address point to point communication. The model size EV-M4 and data volume EV-D4 are important in describing the algorithm performance as just like in simulation problems, the grain size (the number of model parameters held in the unit – thread or process – of parallel computing) is a critical measure of performance. 35. Spidal.org.

(36) View for Micropatterns or Execution Features 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.. Performance Metrics; property found by benchmarking Diamond Flops per byte; memory or I/O Execution Environment; Core libraries needed: matrix-matrix/vector algebra, conjugate gradient, reduction, broadcast; Cloud, HPC etc. Volume: property of a Diamond instance: a) Data Volume and b) Model Size Velocity: qualitative property of Diamond with value associated with instance. Only Data Variety: important property especially of composite Diamonds; Data and Model separately Veracity: important property of applications but not kernels; Model Communication Structure; Interconnect requirements; Is communication BSP, Asynchronous, Pub-Sub, Collective, Point to Point? Is Data and/or Model (graph) static or dynamic? Much Data and/or Models consist of a set of interconnected entities; is this regular as a set of pixels or is it a complicated irregular graph? Are Models Iterative or not? Data Abstraction: key-value, pixel, graph(G3), vector, bags of words or items; Model can have same or different abstractions e.g. mesh points, finite element, Convolutional Network Are data points in metric or non-metric spaces? Data and Model separately? Is Model algorithm O(N2) or O(N) (up to logs) for N points per iteration (G2). 36. Spidal.org.

(37) Data Source and Style View. 37. Spidal.org. 37.

(38) Examples in Data View DV. • We can highlight DV-5 streaming where there is a lot of recent progress; • DV-9 categorizes our Biomolecular simulation application with data produced by an HPC simulation • DV-10 is Geospatial Information Systems covered by our spatial algorithms. • DV-7 provenance, is an example of an important feature that we are not covering. • The data storage and access DV-3 and D-4 is covered in our pilot data work. • The Internet of Things DV-8 is not a focus of our project although our recent streaming work relates to this and our addition of HPC to Apache Heron and Storm is an example of the value of HPC-ABDS to IoT. 38. Spidal.org.

(39) i. ii. iii. iv. v.. Data Source and Style Diamond View I. SQL NewSQL or NoSQL: NoSQL includes Document, Column, Key-value, Graph, Triple store; NewSQL is SQL redone to exploit NoSQL performance Other Enterprise data systems: 10 examples from NIST integrate SQL/NoSQL Set of Files or Objects: as managed in iRODS and extremely common in scientific research File systems, Object, Blob and Data-parallel (HDFS) raw storage: Separated from computing or colocated? HDFS v Lustre v. Openstack Swift v. GPFS Archive/Batched/Streaming: Streaming is incremental update of datasets with new algorithms to achieve real-time response (G7); Before data gets to compute system, there is often an initial data gathering phase which is characterized by a block size and timing. Block size varies from month (Remote Sensing, Seismic) to day (genomic) to seconds or lower (Real time control, streaming) • Streaming divided into categories overleaf 39 Spidal.org.

(40) •. Data Source and Style Diamond View II Streaming divided into 5 categories depending on event size and synchronization and integration • • • •. vi.. •. Set of independent events where precise time sequencing unimportant. Time series of connected small events where time ordering important. Set of independent large events where each event needs parallel processing with time sequencing not critical Set of connected large events where each event needs parallel processing with time sequencing critical. Stream of connected small or large events to be integrated in a complex way.. Shared/Dedicated/Transient/Permanent: qualitative property of data; Other characteristics are needed for permanent auxiliary/comparison datasets and these could be interdisciplinary, implying nontrivial data movement/replication vii. Metadata/Provenance: Clear qualitative property but not for kernels as important aspect of data collection process viii. Internet of Things: 24 to 50 Billion devices on Internet by 2020 ix. HPC simulations: generate major (visualization) output that often needs to be mined x. Using GIS: Geographical Information Systems provide attractive access to geospatial data. 40. Spidal.org.

(41) Data Source and Style View 10 Bob Marcus Data Processing Use Cases. 41. Spidal.org. 41.

(42) 10 Bob Marcus Data Processing Use Cases. 1) Multiple users performing interactive queries and updates on a database with basic availability and eventual consistency (BASE = (Basically Available, Soft state, Eventual consistency) as opposed to ACID = (Atomicity, Consistency, Isolation, Durability) ) 2) Perform real time analytics on data source streams and notify users when specified events occur 3) Move data from external data sources into a highly horizontally scalable data store, transform it using highly horizontally scalable processing (e.g. Map-Reduce), and return it to the horizontally scalable data store (ELT Extract Load Transform) 4) Perform batch analytics on the data in a highly horizontally scalable data store using highly horizontally scalable processing (e.g MapReduce) with a user-friendly interface (e.g. SQL like) 5) Perform interactive analytics on data in analytics-optimized database 6) Visualize data extracted from horizontally scalable Big Data store 7) Move data from a highly horizontally scalable data store into a traditional Enterprise Data Warehouse (EDW) 8) Extract, process, and move data from data stores to archives 9) Combine data from Cloud databases and on premise data stores for analytics, data mining, and/or machine learning 10) Orchestrate multiple sequential and parallel data transformations and/or analytic processing using a workflow manager. 42. Spidal.org.

(43) 1. Multiple users performing interactive queries and updates on a database with basic availability and eventual consistency. Generate a SQL Query Process SQL Query (RDBMS Engine, Hive, Hadoop, Drill). Data Storage: RDBMS, HDFS, Hbase Data, Streaming, Batch .... Includes access to traditional ACID database 43. Spidal.org.

(44) Typical Big Data Pattern 2. Perform real time analytics on data source streams and notify users when specified events occur Specify filter. Streaming Data Streaming Data Streaming Data. Fetch streamed Data. Filter Identifying Events. Post Selected Events. Identified Events. Posted Data Archive. Repository Storm (Heron), Kafka, Hbase, Zookeeper. 44. Spidal.org.

(45) 3. Move data from external data sources into a highly horizontally scalable data store, transform it using highly horizontally scalable processing (e.g. Map-Reduce), and return it to the horizontally scalable data store (ELT) Ente Dat rprise a War eho use. Transform with Hadoop, Spark, Giraph…. Data Storage: HDFS, Hbase Streaming Data. Web Services. OLTP Database. ETL is Extract Load Transform. http://www.dzone.com/articles/hadoop-t-etl. 45. Spidal.org.

(46) 4. Perform batch analytics on the data in a highly horizontally scalable data store using highly horizontally scalable processing (e.g MapReduce) with a user-friendly interface (e.g. SQL like). SQL Query HCatalog. Hive. General Analytics Mahout, R. Hadoop, Spark, Giraph, Pig … Data Storage: HDFS, Hbase. Data, Streaming, Batch …... 46. Spidal.org.

(47) Hive Example •. http://venublog.com/2013/07/16/hadoop-summit2013-hive-authorization/. 47. Spidal.org.

(48) 5. Perform interactive analytics on data in analytics-optimized database. Mahout, R SPIDAL. Hadoop, Spark, Flink, Giraph, Pig … Data Storage: HDFS, Hbase, MongoDB. Data, Streaming, Batch …... 48. Spidal.org.

(49) Typical Big Data Pattern 5A. Perform interactive analytics on observational scientific data Science Analysis Code, Mahout, R, SPIDAL Grid or Many Task Software, Hadoop, Spark, Giraph, Pig … Data Storage: HDFS, Hbase, File Collection Direct Transfer. Streaming Twitter data for Social Networking Record Scientific Data in “field”. Transport batch of data to primary analysis data system Local Accumulate and initial computing. NIST examples include LHC, Remote Sensing, Astronomy and Bioinformatics. 49. Spidal.org.

(50) Particle Physics (LHC). LHC Data analyzes ~30 petabytes of data per year produced at CERN using ~300,000 cores around the world Data reduced in size, replicated and looked at by physicists 50. Spidal.org.

(51) Astronomy – Dark Energy Survey I. Victor M. Blanco Telescope Chile where new wide angle 520 mega pixel camera DECam installed. https://indico.cern.ch/event/214784/ session/5/contribution/410 Ends up as part of International Virtual observatory (IVOA), which is a collection of interoperating data archives and software tools which utilize the internet to form a scientific research environment in which astronomical research programs can be51 conducted.. Spidal.org.

(52) Astronomy – Dark Energy Survey II For DES (Dark Energy Survey) the data are sent from the mountaintop via a microwave link to La Serena, Chile. From there, an optical link forwards them to the NCSA (UIUC) as well as NERSC (LBNL) for storage and "reduction”. Here galaxies and stars in both the individual and stacked images are identified, catalogued, and finally their properties measured and stored in a database. DES Machine room at NCSA. 52. Spidal.org.

(53) Astronomy Hubble Space Telescope HST Processing in Baltimore Md. 53. http://asd.gsfc.nasa.gov/archive/hubble/a_pdf/news/facts/FS14.pdf. Spidal.org.

(54) CReSIS Remote Sensing: Radar Surveys Expeditions last 1-2 months and gather up to 100 TB data. Most is saved on removable disks and flown back to continental US at end. A sample is analyzed in field to check instrument. 54. Spidal.org.

(55) Gene Sequencing. Distributed (Illumina) devices distributed across world in many laboratories take data in form of “reads” that are aligned into a full sequence This processing often local but data needs to be compared with world’s other gene so uploaded to central repository Illumina HiSeq X 10 can sequence 18,000 genomes per year at $1000 each. Produces 0.6Terabases per day. 55. Spidal.org.

(56) 6. Visualize data extracted from horizontally scalable Big Data store. Interactive Visualization. Specify Analytics. Orchestration Layer. Prepare Interactive Visualization. Mahout, R. Hadoop, Spark, Giraph, Pig … Data Storage: HDFS, Hbase 56. Spidal.org.

(57) 7. Move data from a highly horizontally scalable data store into a traditional Enterprise Data Warehouse Enterprise Data Warehouse. Data Warehouse Query. Transform with Hadoop, Spark, Giraph …. Data Storage: HDFS, Hbase, (RDBMS). Streaming Data. Web Services. 57. Spidal.org. OLTP Database.

(58) Moving to EDW Example from Teradata. Moving data from HDFS to Teradata Data Warehouse and Aster Discovery Platform http://blogs.teradata.com/data-points/announcing-teradata-aster-big-analytics-appliance/. 58. Spidal.org.

(59) 8. Extract, process, and move data from data stores to archives Transform as needed Transform with Hive, Drill, Hadoop, Spark, Giraph, Pig …. Archive. Data Storage: HDFS, Hbase, RDBMS. Streaming Data. http://www.dzone.com/articles/hadoop-t-etl. OLTP Database. Web Services. ETL is Extract Load Transform. 59. Spidal.org.

(60) 9. Combine data from Cloud databases and on premise data stores for analytics, data mining, and/or machine learning. Mahout, R. Similar to 4 and 5. Hadoop, Spark, Giraph, Pig … Data Storage: HDFS, Hbase On premise Data. Streaming Data. 60. Spidal.org.

(61) http://wikibon.org/w/images/2/20/Cloud-BigData.png. Example: Integrate Cloud and local data. 61. Spidal.org.

(62) 10. Orchestrate multiple sequential and parallel data transformations and/or analytic processing using a workflow manager This can be used for science by Specify Analytics Pipeline. Analytic-3 (Visualize). adding data staging phases as in case 5A. Orchestration Layer (Workflow) Analytic-2. Analytic-1. Hadoop, Spark, Giraph, Pig … Data Storage: HDFS, Hbase. 62. Spidal.org.

(63) Example from Hortonworks http://hortonworks.com/hadoop/yarn/. 63. Spidal.org.

(64)

New documents

Butyl 3 [2 (3,4 di­meth­oxy­phen­yl) 7 meth­­oxy 1 benzo­furan 5 yl] 2 propenoate
Data collection: SMART Bruker, 2000; cell refinement: SAINT Bruker, 2000; data reduction: SAINT; programs used to solve structure: SHELXS97 Sheldrick, 1990; programs used to refine
A balance of rights and protections in public order policing: A case study on Rotherham
The second HMIC Adapting to Protest report from 2009 has it too that police intelligence should be used to plan the management of a protest where there is no notification under Section
Countering extremism and recording dissent: intelligence analysis and the Prevent agenda in UK Higher Education
For context see Equality and Human Rights Commission, Delivering the Prevent duty in a proportionate and fair way: A guide for Higher Education providers in England on how to use
5 Amino 1 [2 (di­ethyl­amino)ethyl] 1H imidazole 4 carboxamide
The carbonyl atom O1 of the carboxamide group accepts a hydrogen bond from the amino group at the 5position of the imidazole ring for geometric details, see Table 1 and forms a
Simulation of natural convective boundary layer flow of a nanofluid past a convectively heated inclined plate in the presence of magnetic field
Uniform magnetic field strength Brownian diffusion coefficient Thermophoresis diffusion coefficient Error Dimensionless stream function Acceleration due to gravity Dimensionless
The 3,3,5,5[2H4] 4 methacrylamido 2,2,6,6 tetra­([2H3]meth­yl)piperidin 1 yloxyl radical
Data collection: CrysAlis CCD Oxford Diffraction, 2004; cell refinement: CrysAlis RED Oxford Diffraction, 2004; data reduction: CrysAlis RED; programs used to solve structure: SHELXS97
N [tert But­oxy­carbonyl­glycyl (Z) α,β de­hydro­phenyl­alanyl­glycyl (E) α,β de­hydro­phenyl­alanyl]­glycine methyl ester dihydrate
Studies of sequences containing more than one Phe residue, or one Phe and another dehydroamino acid residue, separated by one or more saturated residues, have shown that these peptides
1′ Benzyl 1 (2,4 di­meth­oxy­phenyl) 3 phen­oxy­spiro­[azetidine 2,3′(3H) indole] 2′,4(1H) dione
The authors acknowledge the Faculty of Arts and Sciences, Ondokuz Mayıs University, Turkey, for the use of the Stoe IPDS2 diffractometer purchased under grant F.279 of the University
N [2 (Benzyl­sulfan­yl)benzyl­­idene] 2 (methyl­sulfan­yl)aniline
Data collection: CAD-4 EXPRESS Enraf–Nonius, 1994; cell refinement: CAD-4 EXPRESS; data reduction: XCAD4 Harms & Wocadlo, 1995; programs used to solve structure: DIRDIF99 Beurskens et
1,3,4,4 Tetra­chloro 4 (4 chloro­phenyl­sulfan­yl) 2 nitro­buta 1,3 diene
Data collection Rigaku R-AXIS RAPID diffractometer Detector resolution: 10.00 pixels mm-1 ω scans Absorption correction: multi-scan ABSCOR; Higashi, 1995 Tmin = 0.677, Tmax = 0.760 8204

Tags

Related documents

Big data analytics and mining for effective visualization and trends forecasting of crime data
The major contribution of this paper can be summarized as follows: 1 A series of investigative explorations are conducted to explore and explain the crime data in three US cities; 2 We
0
0
13
Monitoring Data Integrity in Big Data Analytics Services
For 1-to-1 transformations this approach does not require any changes in the monitoring rule we employ — it simply creates a single event per R DD partition read/written instead of
0
0
5
Big data analytics for time critical maritime and aerial mobility forecasting
The ever-increasing volume of data emphasizes the need for advanced methods supporting real-time detection and prediction of events and trajectories, together with advanced visual
0
0
13
Motivation to use big data and big data analytics in external auditing
In the case of a client being a small business company, audit companies even have to show the value of using BDA in the audit process: We indicate the main advantages of using BDA for
0
0
28
Motivation to use big data and big data analytics in external auditing
High importance3 High importance Not disclosed 2 Governance Information related to top management – government, structure foreign management, national shareholders, global networking
0
0
11
Using Machine Learning and Big Data Analytics to Prioritize Outpatients in HetNets
This is a repository copy of Using Machine Learning and Big Data Analytics to Prioritize Outpatients in HetNets White Rose Research Online URL for this paper http
0
0
7
Patient-Centric Cellular Networks Optimization using Big Data Analytics
Received March 12, 2019, accepted March 22, 2019, date of publication April 11, 2019, date of current version April 24, 2019 Digital Object Identifier 10
0
0
18
Big Data Analytics for Wireless and Wired Network Design: A Survey
The authors of [27] noted that processing the characteristics of both the regional user and service using big data analytics can be of great help in flexible network and functionality
0
0
20
Big Data Text Analytics an enabler of Knowledge Management
Knowledge management and big data share similar objectives, as both role is to create competitive advantage for organizations Chen et al., 2012; George et al., 2014; Grant, 1996, while
0
0
27
Requirements for big data analytics supporting decision making: A sensemaking perspective
underpinning sensemaking frameworks we adopted to guide our study: an intelligent analysis framework that presents how an individual analyst makes sense of large volume of data; and a
0
0
23
Sketch of Big Data Real-Time Analytics Model
We discuss and compare several Big Data technologies for real-time processing along with various challenges and issues in adapting Big Data.. Realtime Big Data analysis based on cloud
0
0
7
Data Analytics with the Cloudmesh Multi-Cloud Interfaces for the NIST Big Data Reference Architecture
Data Analytics with the Cloudmesh Multi-Cloud Interfaces for the NIST Big Data Reference Architecture Gregor von Laszewski, laszewski@gmail.com https://laszewski.github.io/
0
0
60
full
SQL like 5 Perform interactive analytics on data in analytics-optimized database with 5A Science 6 Visualize data extracted from horizontally scalable Big Data store 7 Move data from a
0
0
166
Integration of HPC, Big Data Analytics and The Software Ecosystem:Tutorial  Overview
• This tutorial covers both data analytics and the hardware/software system that supports them – Analytics are parallel what I call Global Machine Learning – Software systems support
0
0
29
Scalable High Performance Data Analytics
Motivation for faster and bigger problems • Machine Learning Needs high performance – Big data and Big model – Iterative algorithms are fundamental in learning a non-trivial model –
0
0
62
Show more