Big Data in Clouds
Cloud based platforms and tools for Big Data
applications
Data Centric Computing
UvA workshop 3 November 2015
Yuri Demchenko
Outline
• Big Data and Data Centric Computing
– Need for new paradigm?
• Cloud Computing as a platform of choice for Big
Data applications
• Big Data Stack and cloud advantages
• Cloud platforms for Big Data
• Questions and discussion
Acknowledgement
Cloud Computing is the New Pervasive Ubiquitous
Intelligence & Communications Platform
Education
Entertainment
Infrastructure
Relationships
Day to Day Life
Transportation
Commerce
Public Clouds are All Over the Place
Cloud Centers are All Over the Place, from datacentermap.com November 2015
Cloud Computing as a key IT technology
transformation factor
• Cloud Computing is powering modern business and facilitating new technologies development that require elastic computing resources on-demand
– Mobile applications – Big Data applications – Internet of Things
– Changes telecom market landscape
• Demand from other technologies accelerate Cloud Computing development
– Big Data and Data Intensive Applications, Research Infrastructures, Mobile technologies, IoT
• Cloud Computing is increasing business agility and speed up new services/technologies development
– However requires IT platforms/solutions transformation and applications re-factoring/re-design
Source: VMware Business white paper
Cloud
IT Agility Business Agility
Corporate Performance
Gartner's 2014 Hype Cycle
Source: Gartner (August 2014)
[ref] Gartner's 2014 Hype Cycle for Emerging Technologies Maps the Journey to Digital Business. http://www.gartner.com/newsroom/id/2819918
Cloud Computing:
Cisco Cloud Index 2013-2018: Cloud Service
Delivery Models Market Share
Cloud workload processed by different cloud service delivery models by 2018
• Software-as-a-Service (SaaS): 59 percent, up from 41 percent in 2013
• Infrastructure-as-a-Service (PaaS): 28 percent, down from 44 percent in 2013
• Platform-as-a-Service (IaaS): 13 percent, down from 15 percent in 2013
[ref] Cisco Global Cloud Index: Forecast and Methodology, 2013–2018 [online]
http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html Data Center Virtualization and Cloud Computing Growth by 2018 • 78% of workloads will be processed by cloud data centers • 22% will be processed by traditional data centers.
Big Data is Driving Cloud Usage
Data is becoming the new raw material of business: an economic input almost on a par with capital and labour.“Every day I wake up and ask, ‘how can I flow data better, manage data better, analyse data better?” says Rollin Ford, the CIO of Wal-Mart.
Source: Data, Data Everywhere, The Economist, February 25, 2010
There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days, and the pace is increasing
Eric Schmidt, Google CEO, Techonomy Conference, August 4, 2010
Supp
ly
NIST Big Data Working Group (NBD-WG) and
ISO/IEC JTC1 Study Group on Big Data (SGBD)
• NIST Big Data Working Group (NBD-WG) is leading the development of the Big Data Technology Roadmap - http://bigdatawg.nist.gov/home.php
– Built on experience of developing the Cloud Computing standards fully accepted by industry
• Set of documents delivered September 2014 (to be published as NIST Special Publication documents) - http://bigdatawg.nist.gov/V1_output_docs.php
Volume 1: NIST Big Data Definitions
Volume 2: NIST Big Data Taxonomies
Volume 3: NIST Big Data Use Case & Requirements
Volume 4: NIST Big Data Security and Privacy Requirements
Volume 5: NIST Big Data Architectures White Paper Survey Volume 6: NIST Big Data Reference Architecture
Volume 7: NIST Big Data Technology Roadmap
• NBD-WG defined 3 main components of the new technology:
– Big Data Paradigm
– Big Data Science and Data Scientist as a new profession – Big Data Architecture
The Big Data Paradigm consists of the distribution of data systems across horizontally-coupled
independent resources to achieve the scalability needed for the efficient processing of extensive datasets.
NIST Big Data Reference Architecture
Main components of the Big Data ecosystem
• Data Provider
• Big Data Applications Provider • Big Data Framework Provider • Data Consumer
• Service Orchestrator Big Data Lifecycle and
Applications Provider activities
• Collection • Preparation
• Analysis and Analytics • Visualization
• Access
Big Data Ecosystem includes all components that are involved into Big Data production, processing, delivery, and consuming
K E Y :
SW
Service Use
Big Data Information Flow
SW Tools and Algorithms Transfer
Big Data Application Provider
Visualization Access Analytics Preparation Collection System Orchestrator S e c u r it y & P r iv a c y M a n a g e m e n t DATA SW DATA SW I N F O R M AT I O N V A L U E C H A I N IT V A L U E C H A IN D a ta C o n su m er D a ta Pro vi d er DATA
Horizontally Scalable (VM clusters)
Vertically Scalable Horizontally Scalable
Vertically Scalable Horizontally Scalable
Vertically Scalable
Big Data Framework Provider
Processing Frameworks (analytic tools, etc.)
Platforms (databases, etc.)
Infrastructures
Physical and Virtual Resources (networking, computing, etc.)
D
A
T
A
SW
Big Data Architecture Framework (BDAF)
(1) Data Models, Structures, Types
– Data formats, non/relational, file systems, etc.
(2) Big Data Management
– Big Data Lifecycle (Management) Model
• Big Data transformation/staging– Provenance, Curation, Archiving
(3) Big Data Analytics and Tools
– Big Data Applications
• Target use, presentation, visualisation
(4) Big Data Infrastructure (BDI)
– Storage, Compute, (High Performance Computing,) Network
– Sensor network, target/actionable devices
– Big Data Operational support
(5) Big Data Security
Big Data Infrastructure and Analytics Tools
Big Data Infrastructure
• Heterogeneous multi-provider inter-cloud infrastructure
• Data management infrastructure • Collaborative Environment
• Advanced high performance (programmable) network • Security infrastructure
• Federated Access and Delivery Infrastructure (FADI)
Big Data Analytics Infrastructure/Tools
• High Performance Computer Clusters (HPCC)
• Big Data storage and databases SQL and NoSQL
• Analytics/processing: Real-time, Interactive, Batch, Streaming • Big Data Analytics tools and
Data Lifecycle/Transformation Model
• Data Model changes along data lifecycle or evolution
• Data provenance is a discipline to track all data transformations along lifecycle
• Identifying and linking data
– Persistent data/object identifiers (PID/OID) – Traceability vs Opacity
– Referral integrity
Multiple Data Models and structures • Data Variety and Variability • Semantic Interoperability Data Model (1)
Data Model (1) Data Model (3)
Data Model (4) Data (inter)linking • PID/OID • Identification • Traceability vs Opacity • Privacy, Opacity
Data Storage (Big Data capable)
Cons um er ap pl ic ati on s Data Source Data Filter/Enrich, Classification Data Delivery, Visualisation Data Analytics, Modeling, Prediction Data Collection& Registration Data repurposing, Analytics re-factoring, Secondary processing
Big Data Cloud Based Service
Parallel compute server d1 d2 d3 External Client Parallel data server Query Source dataset Derived datasets Parallel file system (e.g., GFS, HDFS) Result Data-intensive computing system (e.g. Hadoop)Parallel query server External data sources Examples:
Search, scene completion service, log processing
Characteristics:
Massive data and
computation on cloud, small queries and results
Big Data Stack
Data Store & Historical Analytics • PERIODIC QUERY
• OLAP / DATA WAREHOUSE MODEL • BI, Reports • Dashboards • User Interactive Real-Time Decisions • CONTINUOUS QUERY • OLTP MODEL • Programmatic Request-Response • Stored Procedures
• Export for Pipelining
Data Ingestion Message/Data Queue
Data Export
Hook into an existing queue to get copy / subset of data. Queue handles partitioning, replication, and ordering of data, can manage backpressure from slower downstream components
Use direct Ingestion API to capture entire stream at wire speed Ingestion will transform,
normalize, distribute and integrate to one or more of the Analytic / Decision Engines of choice
Use one or more Analytic / Decision engines to accomplish specific task on the Fast Data window
Streaming Analytics • NON QUERY
• TIME WINDOW MODEL • Counting, Statistics • Rules
• Triggers
Export will transform, normalize, distribute and integrate to one or more of the Data Warehouse / Storage Engines of choice The OLAP Engines of
choice will append to historical data and store for later, further Big Data analysis on entire data set
Real-Time Decisions and/or Results can be fed back “up-stream” to influence the “next step”
Big Data Stack
Data Store & Historical Analytics • PERIODIC QUERY
• OLAP / DATA WAREHOUSE MODEL • BI, Reports • Dashboards • User Interactive Real-Time Decisions • CONTINUOUS QUERY • OLTP MODEL • Programmatic Request-Response • Stored Procedures
• Export for Pipelining
Data Ingestion Message/Data Queue
Data Export
Hook into an existing queue to get copy / subset of data. Queue handles partitioning, replication, and ordering of data, can manage backpressure from slower downstream components
Use direct Ingestion API to capture entire stream at wire speed Ingestion will transform,
normalize, distribute and integrate to one or more of the Analytic / Decision Engines of choice
Use one or more Analytic / Decision engines to accomplish specific task on the Fast Data window
Streaming Analytics • NON QUERY
• TIME WINDOW MODEL • Counting, Statistics • Rules
• Triggers
Export will transform, normalize, distribute and integrate to one or more of the Data Warehouse / Storage Engines of choice The OLAP Engines of
choice will append to historical data and store for later, further Big Data analysis on entire data set
Real-Time Decisions and/or Results can be fed back “up-stream” to influence the “next step”
Big Data Stack Components and used Interaction
• Ingestion: provides interface to the streaming data sources
including data transformation and normalisation
– Direct Ingestion using the data generating API at “wire speed”
– Message Queue: effective for heterogeneous data sources with
different speed/velocity
• Streaming analytics and real-time decisions
– Streaming analytics uses non-query time-window model
– Real-time analytics uses continuous query OLTP model
• Data export, data store and historical analytics
– Transform, normalize, distribute and integrate to one or more of the
Data Warehouse or storage platform
– Uses periodic queries OLTP or data warehouse model
– Data are stored for future Big Data analysis
Important Big Data Technologies
Data Store & Historical Analytics
Real-Time Decisions Data Ingestion
Message/Data Queue
Data Export Streaming Analytics
Data Factory, Kafka, Flume, Scribe Samza,
Dataflow
,
Kinesis
Event Hubs
HDInsight, DocumentDB, Hadoop, Hive, Spark
SQL, Druid, BigQuery,DynamoDB, Vertica,
MongoDB, EMR, CouchDB,
VoltDB Sqoop, HDFS Microsoft Azure: Event Hubs Data Factory Stream Analytics HDInsight DocumentDB Google GCE: DataFlow BigQuery Amazon AWS: Kinesis EMR DynamoDB Open Source Samza Kafka Flume Scribe Storm Cascading S4 Crunch Spark Hadoop Hive Druid MongoDB CouchDB VoltDB Proprietary: Vertica HortonWorks Stream Analytics, Storm, Cascading, S4, Crunch, Spark-Streaming
The Big Data Platform is Cloud!
• Interesting applications are data hungry
• The data grows over time
• The data is immobile
• Compute comes to the data
• Big Data clusters are the new libraries
• Provide high-performance execution over Big Data repositories Many spindles, many CPUs
Parallel processing
• Enable multiple services to access a repository concurrently • Enable low-latency scaling of services
• Enable each service to leverage its own software stack
IaaS, file-system protections where needed • Enable rapid resource scaling for power/demand
Scaling-aware storage • Enable slow resource scaling for growth
Re qu ire men ts So lu tion
Cloud Platform Benefits for Big Data
• Segregated networks isolate traffic
– Clouds construction provides separate networks for each type of
traffic
– Big Data applications benefit from lowest latencies possible for node
to node synchronization, dynamic cluster resizing, and other scale-out
operations
• Cloud deployment on virtual machines, containers, and bare
metal
– For a traditional highly load-variable problem one might consider
using VM’s as a deployment vehicle for that.
– Some Clouds will offer Container based isolation instead of VMs.
• Cloud tools for large scale applications deployment and
support by major IDE
Cloud-powered Services Development Lifecycle
• Easily creates test environment close to real • Powered by cloud deployment automation tools
– To enable configuration Management and Orchestration, Deployment automation
• Continuous development – test – integration
– CloudFormation Template, Configuration Template, Bootstrap Template
• Can be used with Puppet and Chef, two configuration and deployment management systems for clouds
[ref] Building Powerful Web Applications in the AWS Cloud” by Louis Columbus
http://softwarestrategiesblog.com/2011/03/10/building-powerful-web-applications-in-the-aws-cloud/
Dev
Test/QA
Staging
Prod
Amazon CloudFormation + Chef (or Puppet)
2-tier app with production data Multi-AZ and HA 2-tier app with
production data 2-tier app with test DB All in one instance
Data Protection and Privacy in Cloud
Data protection and privacy will continue remaining the main restricting factor for cloud growth and will stimulate the following research and developments
• National and international data protection regulation and legislation that should be
supported by corresponding certification and assessment procedures, services and bodies • New advanced privacy enhancement technologies (PET) should protect user privacy on
stored data and activity in cloud
– Emergence of Big Data technologies will require new PET preventing user (re-)identification based on correlation between multiple sources of data
• The shared security model currently used in cloud will increase a level of customer controlled security (and privacy) compliance monitoring
• Cloud access control models will continue adopting federated access control and Identity management models
– While based on initially verified user identity the access control methods should not reveal user identity to cloud service providers and 3rd party services
• Data Encryption and key management
– New encrypted data processing models such as homomorphic encryption – New key management models
Trustworthiness of Cloud Computing Platform and
Trust Management
Trusted cloud platform includes at least two aspects: platform trustworthiness and trust management
• Trustworthiness is a property of the cloud platform that includes multiple factors
including design principles, implementation, technical mechanisms and operational methods
– Trustworthiness by design is a key to create and trusted cloud platform
– Cloud providers pay strong attention to all aspects of their platform to create a trusted environment for user services and data
• Trust management include both technical mechanisms and solutions for
establishing and managing trust relations between provider and customer or user administrative and security domains
– Technical trust is rooted in the trusted platform and typically require manual setup – Cloud will stimulate further development of the dynamic trust establishment and
management mechanisms to adopt to the dynamic and on-demand character of cloud services
– Trusted Computing Platform (TCP) Architecture by Trusted Computing Group (TCG) provides and technical based for building distributed trusted environment
• TCO-TCG is based on hardware based Trust Platform Module (TPM) that allows tamper-resistant (hardware based) secret key storage and operate as a Root of Trust
• TCG website: http://www.trustedcomputinggroup.org/
• Trust management mechanisms are typically hidden to users but require precise
Big Data platforms and Providers
• AWS Big Data services
• Microsoft Azure Analytics Platform System and
HDInsight
Cloud HPC and Big Data Platforms
• HPC on cloud platform
– Special HPC and GPU VM instances as well as Hadoop/HPC clusters offered by CSPs
• Amazon Big Data services
– Amazon Elastic MapReduce
• Microsoft Analytics Platform System (APS)
– Microsoft HD Insight
• LexisNexis HPC Cluster System
– Combing both HPC cluster platform and optimized data processing languages
• Streaming analytics/processing tools
– Apache Kafka (http://kafka.apache.org/)
• High throughput distributed subscription messaging system used for real-time stream processing
– Apache Storm (https://storm.incubator.apache.org/)
• Distributed realtime computation system for unbounded streams of data
– Apache Spark (https://spark.apache.org/)
• Is a new general-purpose fast cluster computing system that targets to combine batch and real-time streaming data processing
• Spark provides primitives for in-memory cluster computing that makes it well suited to machine learning algorithms
Cloud and High Performance Computing (HPC)
• Current trends on increasing demand for HPC for Big Data Analytics and
Business Intelligence
– Data Analytics platforms and services are provided by the majority of big Cloud Service Providers as designated Data Analytics, Hadoop and HPC clusters
• Problem: Cloud Computing is not designed for HPC but rather for
scalability and parallelization
– Tightly coupled HPC workloads scale not well in a VM/cloud environment
• Restricting factors for HPC on cloud platform
– High cost of data access/download from cloud
– High-end VM configuration optimized for compute are proposed not by all clouds
• Computational challenges of Big Data can be solved by aggregating
compute power and parallelisation of tasks
– 24 hours on 1 machine can be 1 hour on 24 machines
– These 24 machines require supporting infrastructure that will do: orchestration; jobs division into tasks, and distribution of tasks across a computer cluster.
Amazon EC2 HPC oriented AMI instances
• Amazon EC2 offers dedicated computation optimized VMs and “cluster instances” that deliver better performance to HPC users
– Two Cluster Compute instances that provide a very large amount of CPU coupled with increased network performance (10GigE). Instances come in two sizes,
• Nehalem-based “Quadruple Extra Large Instance” (eight cores/node, 23GB of RAM, 1.7TB of local storage) • Sandy Bridge-based “Eight Extra Large Instance” (16 cores/node, 60.5GB of RAM, 3.4TB of local storage). – Besides the Amazon cluster instance, users may select from one of several preconfigured clusters – Additionally, two other specialized instances for HPC clusters
• Cluster GPU instance that provides two NVidia Tesla Fermi M2050 GPUs with proportionally high CPU and 10GigE network performance.
• High-I/O instance that provides two SSD-based volumes, each with 1024GB of storage.
– Pricing can vary depending upon on-demand, scheduled, or spot purchase of resources • Quadruple Extra Large Instance is US$ 1.3/hour (US$ 0.33/core·hour)
• Eight Extra Large Instance is US$ 2.4/hour (US$ 0.15/core·hour)
• Cluster GPU instance is US$ 2.1/hour, and the High I/O instance is US$ 3.1/hour
• For example
– Small usage case (80 cores, 4GB of RAM per core, and basic storage of 500GB) would cost US$ 24.00/hour (10 Eight Extra Large Instances)
– Larger usage case (256 cores, 4GB of RAM per core, and 1TB of fast global storage) would cost US$ 38.4/hour (16 Eight Extra Large Instances)
– Once created, the instances must be provisioned and configured to work as a cluster by the user
• Total cost depends on compute time, total data storage and data transfer cost
– Amazon does not charge for data transferred into EC2 but has a varying rate schedule for transfer out of the cloud; additionally, there are EC2 storage costs
Example: Univa HPC with RightScale and
Amazon AWS
• Univa offers “one-click” HPC computing with their partner RightScale
• First stage/offering: Create an internal cloud infrastructure system that provides “raw” virtual machines using VMware, KVM, Xen, or cloud platforms such as OpenStack or VMware vSphere.
– The UniCloud product uses an IaaS layer to create the machine and then provision a raw machine on the fly – Once the machine is provisioned (on one of IaaS platforms/providers), UniCloud does additional software
provisioning and configuration to turn the raw instances into nodes that can then be part of an HPC cluster
• Second stage/offering: Using external cloud infrastructure system, such as Amazon EC2 and AWS to provide the “raw” virtual machines for HPC use
– Additional provisioning and configuration management is done by UniCloud (PaaS), which turns the raw instances into HPC cluster nodes
– This process is automatic, and adding and removing nodes from HPC cluster can be done on demand.
• To launch an external cloud on Amazon EC2, just go to the Amazon Market Place, choose HPC from the list, and launch your cluster.
– Once the cloud is built, you can then install your application
• Univa’s UniCloud package provides a “ready-to-run” HPC cloud with prebuilt “kits” that include Univa Grid Engine
– UniCloud can burst from an internal cloud to an external public cloud or from one public cloud to another
• Univa costs are additive to the AWS instance charges and range from US$ 0.02 up to a max of US$ 0.08/core·hour
– Univa proposed a Medium AWS instance for the small usage case (80 cores, 4GB of RAM per core, basic storage of 500GB) that would cost US$ 14.40/hour [(AWS @ US$ 0.16/hour + Univa @ US$ 0.02/hour)x80].
AWS Cloud Big Data Services
AWS Cloud offers the following services and resources for Big Data processing:
• EC2 Virtual Machine (VM) instances for HPC optimized for computing (with
multiple cores) and with extended storage for large data processing.
• Amazon Elastic MapReduce (EMR) provides the Hadoop framework on Amazon EC2 and offers a wide range of Hadoop related tools.
• Amazon Kinesis is a managed service for real-time processing of streaming big data (throughput scaling from megabytes to gigabytes of data per second and from hundreds of thousands different sources).
• Amazon DynamoDB highly scalable NoSQL data stores with sub-millisecond response latency.
• Amazon Redshift fully-managed petabyte-scale data warehouse in cloud at cost
less than $1000 per terabyte per year. It is provided with columnar data storage with possibility to parallelise queries.
• Amazon RDS scalable relational database.
• Amazon Glacier archival storage to AWS for long time data storage at lower cost
Microsoft Azure Analytics Platform System (APS)
• Microsoft Azure cloud provides general IaaS services and reach Platform
as a Service (PaaS) services.
– Similar to AWS, Microsoft Azure offers special VM instances that have both computational and memory advanced capabilities.
• The Analytics Platform System (APS) combines the Microsoft SQL Server
based Parallel Data Warehouse (PDW) platform with HDInsight and
Apache Hadoop based scalable data analytics platform.
– APS includes the PolyBase data querying technology to simplify integration of the PDW SQL data and data from Hadoop.
• HDInsight Hadoop based platform has been co-developed with
Hortonworks
– HDInsight provides comprehensive integration and management functionality for multi-workload data processing on Hadoop platform including batch,
HDInsight: Microsoft’s Big Data Solution
• HDInsight can run both on Azure Cloud and on Windows Server (on premises)
– Data exchange via PolyBase
• Compatible with and support all products from Apache Hadoop stack
• Supports all stages of Big Data processing
• PolyBase is a new technology that allows integrating Microsoft SQL Server based Parallel Data Warehouse (PDW) with Hadoop
• Azure Blob Storage used to persistently store data
– Data are streamed to Hadoop/HDFS for processing and pushed back to Azure Blob Storage
HDInsight/Hadoop Ecosystem
Distributed Storage (HDFS) Query (Hive) Distributed Processing (MapReduce) O D BC Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and enhancements Orange = Data Movement Green = PackagesHortonworks Data Platform Architecture [ref]
• HDP includes the most recent developments of the Open Source Hadoop suite • Can run on Linux and on Windows OS
Hortonworks Data Platform (HDP)
http://hortonworks.com/
• HDP delivers a single integrated Hadoop platform for enterprises
– Provides a data platform for multi-workload data processing across an array of processing methods including batch and interactive to real-time
– Supports key capabilities of an enterprise data platform: Governance, Security and Operations – YARN and Hadoop Distributed Filesystem (HDFS) are the core components of HDP
• YARN is treated as datacenter OS and supports multiple access methods (batch,
real-time, streaming, in-memory, and more) on a common data set
– YARN is the architectural center of Hadoop that allows to process data simultaneously in multiple ways
– Allows creating multi-tenant data analytics applications
• HDP runs natively on Linux and Windows OS
– HDP provides the basis for Microsoft’s HDInsight Service meaning complete portability of data is retained on-premise and in the cloud
– Available in integrated hardware from Teradata
• Hortonworks provides a simple starters solution Hadoop Sandbox
– Hortonworks Sandbox is a single-node implementation of Hadoop based on the Hortonworks Data Platform that includes all the typical components found in a Hadoop deployment
LexisNexis HPCC Systems as an integrated Open
Source platform for Big Data Analytics
HPCC Systems data analytics environment components and HPCC Systems architecture model is based on a distributed, shared-nothing architecture and contains two cluster:
• THOR Data Refinery: Massively parallel Extract, Transform, and Load (ECL) engine that can be used for variety of tasks such as massive: joins, merges, sorts, transformations, clustering, and scaling.
• ROXIE Data Delivery: Massively parallel, high throughput, structured query response engine.
– Perform volumes of structured queries and full text ranked Boolean search.
– ROXIE also provides real-time analytics capabilities, to address real-time classifications, prediction, fraud detection and other problems that normally require stream analytics.
Other components of the HPCC environment
• Enterprise Control Language (ECL): An open source, data-centric declarative programming language used by both THOR and ROXIE for large-scale data management and query processing.
• ECL compiler and job server (ECL Server): Serves as the code generator and compiler that translates ECL code.
• System data store (Dali): Used for environment configuration, message queue maintenance, and enforcement of LDAP security restrictions.
• Archiving server (Sasha): Serves as a companion ‘housekeeping’ server to Dali.
• Distributed File Utility (DFU Server): Controls the spraying and despraying operations used to move data onto and out of THOR.
• The inter-component communication server (ESP Server): Supports multi-protocol communication between services to enable various types of functionality to client applications via multiple protocols.
LexisNexis HPCC Systems Architecture
• THOR is used for massive data processing in batch mode for ETL processing
LexisNexis HPCC Systems Data Analytics Environment
• Unstructured and structured content – Ingest disparate data sources
– Manage hundreds of TB of data • Data Refinery, Linking Fusion – Parallel Processing Architecture
– Discover non-obvious links and detect hidden patterns – Language optimised for data-intensive Application
• Analysis Applications – Entity resolution – Link Analysis – Clustering Analysis – Complex Analysis
LexisNexis Data Analytics Languages
•
Enterprise Control Language (ECL)
: An open source, data-centric
declarative programming language
– The declarative character of ECL language simplifies coding
– ECL is developed to simplify both data query design and customary data transformation programming
– ECL is explicitly parallel and relies on the platform parallelism.
• LexisNexis proprietary record linkage technology
SALT (Scalable
Automated Linking Technology)
: automates data preparation process:
profiling, parsing, cleansing, normalisation, standardisation of data.
– SALT uses weighted matching and threshold based computation, it also
enables internal, external and remote linking with external or master datasets. – Enables the power of the HPCC Systems and ECL
•
Knowledge Engineering Language (KEL)
is an ongoing development
– KEL is a domain specific data processing language that allows using
Possible research topics on Cloud Computing and
Big Data Infrastructure
• Control and management plane mechanisms for heterogeneous
intercloud/multicloud infrastructure and optimal resources provisioning and management
• Federated cloud infrastructures: user and resources provisioning under
distributed QoS requirements and policy restrictions
• Cloud powered software and applications engineering for continuous
service/quality improvement by converging development and operation (DevOps)
• Cloud based infrastructures for Big Data processing to match 3 V of Big Data:
Volume, Velocity, Variety, including on-demand provisioning of Big Data services
• Security, access control and trust management in dynamically provisioned cloud
based applications in intercloud/multi-cloud environment
• Data centric security models and services for distributed Big Data (scientific) applications
Summary and take away
• Emerging Big Data technologies facilitate cloud computing development
and to support HPC applications
– Majority of big CSPs and first of all AWS and Microsoft Azure include in their offering HPC oriented VM images, as well as special HPC and Hadoop clusters
• AWS cloud provides a number of services for Big Data processing including
Elastic MapReduce, NoSQL database DynamoBD, Petabyte scale data warehouse Redshift, Kinesis for real-time strem processing
• Microsoft Azure Analytics Platform System (APS) combines the Microsoft
SQL Server based Parallel Data Warehouse (PDW) platform with the HDInsight Hadoop platform developped in cooperation with Hortonworks
• LexisNexis HPCC Systems provides integrated not Hadoop platform for Big
Data analytics
• Cloud and Big Data are complex but rich research areas
– High entry level but front end research and engineering problems highly demanded by research infrastructure and industry