Big Data in Clouds. Cloud based platforms and tools for Big Data applications. Data Centric Computing UvA workshop 3 November 2015.

(1)

Big Data in Clouds

Cloud based platforms and tools for Big Data

applications

Data Centric Computing

UvA workshop 3 November 2015

Yuri Demchenko

(2)

Outline

• Big Data and Data Centric Computing

– Need for new paradigm?

• Cloud Computing as a platform of choice for Big

Data applications

• Big Data Stack and cloud advantages

• Cloud platforms for Big Data

• Questions and discussion

Acknowledgement

(3)

Cloud Computing is the New Pervasive Ubiquitous

Intelligence & Communications Platform

Education

Entertainment

Infrastructure

Relationships

Day to Day Life

Transportation

Commerce

(4)

Public Clouds are All Over the Place

Cloud Centers are All Over the Place, from datacentermap.com November 2015

(5)

Cloud Computing as a key IT technology

transformation factor

• Cloud Computing is powering modern business and facilitating new technologies development that require elastic computing resources on-demand

– Mobile applications – Big Data applications – Internet of Things

– Changes telecom market landscape

• Demand from other technologies accelerate Cloud Computing development

– Big Data and Data Intensive Applications, Research Infrastructures, Mobile technologies, IoT

• Cloud Computing is increasing business agility and speed up new services/technologies development

– However requires IT platforms/solutions transformation and applications re-factoring/re-design

Source: VMware Business white paper

Cloud

IT Agility Business Agility

Corporate Performance

(6)

Gartner's 2014 Hype Cycle

Source: Gartner (August 2014)

[ref] Gartner's 2014 Hype Cycle for Emerging Technologies Maps the Journey to Digital Business. http://www.gartner.com/newsroom/id/2819918

Cloud Computing:

(7)

Cisco Cloud Index 2013-2018: Cloud Service

Delivery Models Market Share

Cloud workload processed by different cloud service delivery models by 2018

• Software-as-a-Service (SaaS): 59 percent, up from 41 percent in 2013

• Infrastructure-as-a-Service (PaaS): 28 percent, down from 44 percent in 2013

• Platform-as-a-Service (IaaS): 13 percent, down from 15 percent in 2013

[ref] Cisco Global Cloud Index: Forecast and Methodology, 2013–2018 [online]

http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html Data Center Virtualization and Cloud Computing Growth by 2018 • 78% of workloads will be processed by cloud data centers • 22% will be processed by traditional data centers.

(8)

Big Data is Driving Cloud Usage

Data is becoming the new raw material of business: an economic input almost on a par with capital and labour.“Every day I wake up and ask, ‘how can I flow data better, manage data better, analyse data better?” says Rollin Ford, the CIO of Wal-Mart.

Source: Data, Data Everywhere, The Economist, February 25, 2010

There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days, and the pace is increasing

Eric Schmidt, Google CEO, Techonomy Conference, August 4, 2010

Supp

ly

(9)

NIST Big Data Working Group (NBD-WG) and

ISO/IEC JTC1 Study Group on Big Data (SGBD)

• NIST Big Data Working Group (NBD-WG) is leading the development of the Big Data Technology Roadmap - http://bigdatawg.nist.gov/home.php

– Built on experience of developing the Cloud Computing standards fully accepted by industry

• Set of documents delivered September 2014 (to be published as NIST Special Publication documents) - http://bigdatawg.nist.gov/V1_output_docs.php

Volume 1: NIST Big Data Definitions

Volume 2: NIST Big Data Taxonomies

Volume 3: NIST Big Data Use Case & Requirements

Volume 4: NIST Big Data Security and Privacy Requirements

Volume 5: NIST Big Data Architectures White Paper Survey Volume 6: NIST Big Data Reference Architecture

Volume 7: NIST Big Data Technology Roadmap

• NBD-WG defined 3 main components of the new technology:

– Big Data Paradigm

– Big Data Science and Data Scientist as a new profession – Big Data Architecture

The Big Data Paradigm consists of the distribution of data systems across horizontally-coupled

independent resources to achieve the scalability needed for the efficient processing of extensive datasets.

(10)

NIST Big Data Reference Architecture

Main components of the Big Data ecosystem

• Data Provider

• Big Data Applications Provider • Big Data Framework Provider • Data Consumer

• Service Orchestrator Big Data Lifecycle and

Applications Provider activities

• Collection • Preparation

• Analysis and Analytics • Visualization

• Access

Big Data Ecosystem includes all components that are involved into Big Data production, processing, delivery, and consuming

K E Y :

SW

Service Use

Big Data Information Flow

SW Tools and Algorithms Transfer

Big Data Application Provider

Visualization Access Analytics Preparation Collection System Orchestrator S e c u r it y & P r iv a c y M a n a g e m e n t DATA SW DATA SW I N F O R M AT I O N V A L U E C H A I N IT V A L U E C H A IN D a ta C o n su m er D a ta Pro vi d er DATA

Horizontally Scalable (VM clusters)

Vertically Scalable Horizontally Scalable

Vertically Scalable

Big Data Framework Provider

Processing Frameworks (analytic tools, etc.)

Platforms (databases, etc.)

Infrastructures

Physical and Virtual Resources (networking, computing, etc.)

D

A

T

A

SW

(11)

Big Data Architecture Framework (BDAF)

(1) Data Models, Structures, Types

– Data formats, non/relational, file systems, etc.

(2) Big Data Management

– Big Data Lifecycle (Management) Model

• Big Data transformation/staging

– Provenance, Curation, Archiving

(3) Big Data Analytics and Tools

– Big Data Applications

• Target use, presentation, visualisation

(4) Big Data Infrastructure (BDI)

– Storage, Compute, (High Performance Computing,) Network

– Sensor network, target/actionable devices

– Big Data Operational support

(5) Big Data Security

(12)

Big Data Infrastructure and Analytics Tools

Big Data Infrastructure

• Heterogeneous multi-provider inter-cloud infrastructure

• Data management infrastructure • Collaborative Environment

• Advanced high performance (programmable) network • Security infrastructure

• Federated Access and Delivery Infrastructure (FADI)

Big Data Analytics Infrastructure/Tools

• High Performance Computer Clusters (HPCC)

• Big Data storage and databases SQL and NoSQL

• Analytics/processing: Real-time, Interactive, Batch, Streaming • Big Data Analytics tools and

(13)

Data Lifecycle/Transformation Model

• Data Model changes along data lifecycle or evolution

• Data provenance is a discipline to track all data transformations along lifecycle

• Identifying and linking data

– Persistent data/object identifiers (PID/OID) – Traceability vs Opacity

– Referral integrity

Multiple Data Models and structures • Data Variety and Variability • Semantic Interoperability Data Model (1)

Data Model (1) _{Data Model (3)}

Data Model (4) Data (inter)linking • PID/OID • Identification • Traceability vs Opacity • Privacy, Opacity

Data Storage (Big Data capable)

Cons um er ap pl ic ati on s Data Source Data Filter/Enrich, Classification Data Delivery, Visualisation Data Analytics, Modeling, Prediction Data Collection& Registration Data repurposing, Analytics re-factoring, Secondary processing

(14)

Big Data Cloud Based Service

Parallel compute server d₁ d₂ d₃ External Client Parallel data server Query Source dataset Derived datasets Parallel file system (e.g., GFS, HDFS) Result Data-intensive computing system (e.g. Hadoop)

Parallel query server External data sources Examples:

Search, scene completion service, log processing

Characteristics:

Massive data and

computation on cloud, small queries and results

(15)

Big Data Stack

Data Store & Historical Analytics • PERIODIC QUERY

• OLAP / DATA WAREHOUSE MODEL • BI, Reports • Dashboards • User Interactive Real-Time Decisions • CONTINUOUS QUERY • OLTP MODEL • Programmatic Request-Response • Stored Procedures

• Export for Pipelining

Data Ingestion Message/Data Queue

Data Export

Hook into an existing queue to get copy / subset of data. Queue handles partitioning, replication, and ordering of data, can manage backpressure from slower downstream components

Use direct Ingestion API to capture entire stream at wire speed Ingestion will transform,

normalize, distribute and integrate to one or more of the Analytic / Decision Engines of choice

Use one or more Analytic / Decision engines to accomplish specific task on the Fast Data window

Streaming Analytics • NON QUERY

• TIME WINDOW MODEL • Counting, Statistics • Rules

• Triggers

Export will transform, normalize, distribute and integrate to one or more of the Data Warehouse / Storage Engines of choice The OLAP Engines of

choice will append to historical data and store for later, further Big Data analysis on entire data set

Real-Time Decisions and/or Results can be fed back “up-stream” to influence the “next step”

(16)

Big Data Stack

Data Store & Historical Analytics • PERIODIC QUERY

• OLAP / DATA WAREHOUSE MODEL • BI, Reports • Dashboards • User Interactive Real-Time Decisions • CONTINUOUS QUERY • OLTP MODEL • Programmatic Request-Response • Stored Procedures

• Export for Pipelining

Data Ingestion Message/Data Queue

Data Export

Hook into an existing queue to get copy / subset of data. Queue handles partitioning, replication, and ordering of data, can manage backpressure from slower downstream components

Use direct Ingestion API to capture entire stream at wire speed Ingestion will transform,

normalize, distribute and integrate to one or more of the Analytic / Decision Engines of choice

Use one or more Analytic / Decision engines to accomplish specific task on the Fast Data window

Streaming Analytics • NON QUERY

• TIME WINDOW MODEL • Counting, Statistics • Rules

• Triggers

Export will transform, normalize, distribute and integrate to one or more of the Data Warehouse / Storage Engines of choice The OLAP Engines of

choice will append to historical data and store for later, further Big Data analysis on entire data set

Real-Time Decisions and/or Results can be fed back “up-stream” to influence the “next step”

(17)

Big Data Stack Components and used Interaction

• Ingestion: provides interface to the streaming data sources

including data transformation and normalisation

– Direct Ingestion using the data generating API at “wire speed”

– Message Queue: effective for heterogeneous data sources with

different speed/velocity

• Streaming analytics and real-time decisions

– Streaming analytics uses non-query time-window model

– Real-time analytics uses continuous query OLTP model

• Data export, data store and historical analytics

– Transform, normalize, distribute and integrate to one or more of the

Data Warehouse or storage platform

– Uses periodic queries OLTP or data warehouse model

– Data are stored for future Big Data analysis

(18)

Important Big Data Technologies

Data Store & Historical Analytics

Real-Time Decisions Data Ingestion

Message/Data Queue

Data Export Streaming Analytics

Data Factory, Kafka, Flume, Scribe Samza,

Dataflow

,

Kinesis

Event Hubs

HDInsight, DocumentDB, Hadoop, Hive, Spark

SQL, Druid, BigQuery,DynamoDB, Vertica,

MongoDB, EMR, CouchDB,

VoltDB Sqoop, HDFS Microsoft Azure: Event Hubs Data Factory Stream Analytics HDInsight DocumentDB Google GCE: DataFlow BigQuery Amazon AWS: Kinesis EMR DynamoDB Open Source Samza Kafka Flume Scribe Storm Cascading S4 Crunch Spark Hadoop Hive Druid MongoDB CouchDB VoltDB Proprietary: Vertica HortonWorks Stream Analytics, Storm, Cascading, S4, Crunch, Spark-Streaming

(19)

The Big Data Platform is Cloud!

• Interesting applications are data hungry

• The data grows over time

• The data is immobile

• Compute comes to the data

• Big Data clusters are the new libraries

• Provide high-performance execution over Big Data repositories  Many spindles, many CPUs

 Parallel processing

• Enable multiple services to access a repository concurrently • Enable low-latency scaling of services

• Enable each service to leverage its own software stack

 IaaS, file-system protections where needed • Enable rapid resource scaling for power/demand

 Scaling-aware storage • Enable slow resource scaling for growth

Re qu ire men ts So lu tion

(20)

Cloud Platform Benefits for Big Data

• Segregated networks isolate traffic

– Clouds construction provides separate networks for each type of

traffic

– Big Data applications benefit from lowest latencies possible for node

to node synchronization, dynamic cluster resizing, and other scale-out

operations

• Cloud deployment on virtual machines, containers, and bare

metal

– For a traditional highly load-variable problem one might consider

using VM’s as a deployment vehicle for that.

– Some Clouds will offer Container based isolation instead of VMs.

• Cloud tools for large scale applications deployment and

support by major IDE

(21)

Cloud-powered Services Development Lifecycle

• Easily creates test environment close to real • Powered by cloud deployment automation tools

– To enable configuration Management and Orchestration, Deployment automation

• Continuous development – test – integration

– CloudFormation Template, Configuration Template, Bootstrap Template

• Can be used with Puppet and Chef, two configuration and deployment management systems for clouds

[ref] Building Powerful Web Applications in the AWS Cloud” by Louis Columbus

http://softwarestrategiesblog.com/2011/03/10/building-powerful-web-applications-in-the-aws-cloud/

Dev

Test/QA

Staging

Prod

Amazon CloudFormation + Chef (or Puppet)

2-tier app with production data Multi-AZ and HA 2-tier app with

production data 2-tier app with test DB All in one instance

(22)

Data Protection and Privacy in Cloud

Data protection and privacy will continue remaining the main restricting factor for cloud growth and will stimulate the following research and developments

• National and international data protection regulation and legislation that should be

supported by corresponding certification and assessment procedures, services and bodies • New advanced privacy enhancement technologies (PET) should protect user privacy on

stored data and activity in cloud

– Emergence of Big Data technologies will require new PET preventing user (re-)identification based on correlation between multiple sources of data

• The shared security model currently used in cloud will increase a level of customer controlled security (and privacy) compliance monitoring

• Cloud access control models will continue adopting federated access control and Identity management models

– While based on initially verified user identity the access control methods should not reveal user identity to cloud service providers and 3rd party services

• Data Encryption and key management

– New encrypted data processing models such as homomorphic encryption – New key management models

(23)

Trustworthiness of Cloud Computing Platform and

Trust Management

Trusted cloud platform includes at least two aspects: platform trustworthiness and trust management

• Trustworthiness is a property of the cloud platform that includes multiple factors

including design principles, implementation, technical mechanisms and operational methods

– Trustworthiness by design is a key to create and trusted cloud platform

– Cloud providers pay strong attention to all aspects of their platform to create a trusted environment for user services and data

• Trust management include both technical mechanisms and solutions for

establishing and managing trust relations between provider and customer or user administrative and security domains

– Technical trust is rooted in the trusted platform and typically require manual setup – Cloud will stimulate further development of the dynamic trust establishment and

management mechanisms to adopt to the dynamic and on-demand character of cloud services

– Trusted Computing Platform (TCP) Architecture by Trusted Computing Group (TCG) provides and technical based for building distributed trusted environment

• TCO-TCG is based on hardware based Trust Platform Module (TPM) that allows tamper-resistant (hardware based) secret key storage and operate as a Root of Trust

• TCG website: http://www.trustedcomputinggroup.org/

• Trust management mechanisms are typically hidden to users but require precise

(24)

Big Data platforms and Providers

• AWS Big Data services

• Microsoft Azure Analytics Platform System and

HDInsight

(25)

Cloud HPC and Big Data Platforms

• HPC on cloud platform

– Special HPC and GPU VM instances as well as Hadoop/HPC clusters offered by CSPs

• Amazon Big Data services

– Amazon Elastic MapReduce

• Microsoft Analytics Platform System (APS)

– Microsoft HD Insight

• LexisNexis HPC Cluster System

– Combing both HPC cluster platform and optimized data processing languages

• Streaming analytics/processing tools

– Apache Kafka (http://kafka.apache.org/)

• High throughput distributed subscription messaging system used for real-time stream processing

– Apache Storm (https://storm.incubator.apache.org/)

• Distributed realtime computation system for unbounded streams of data

– Apache Spark (https://spark.apache.org/)

• Is a new general-purpose fast cluster computing system that targets to combine batch and real-time streaming data processing

• Spark provides primitives for in-memory cluster computing that makes it well suited to machine learning algorithms

(26)

Cloud and High Performance Computing (HPC)

• Current trends on increasing demand for HPC for Big Data Analytics and

Business Intelligence

– Data Analytics platforms and services are provided by the majority of big Cloud Service Providers as designated Data Analytics, Hadoop and HPC clusters

• Problem: Cloud Computing is not designed for HPC but rather for

scalability and parallelization

– Tightly coupled HPC workloads scale not well in a VM/cloud environment

• Restricting factors for HPC on cloud platform

– High cost of data access/download from cloud

– High-end VM configuration optimized for compute are proposed not by all clouds

• Computational challenges of Big Data can be solved by aggregating

compute power and parallelisation of tasks

– 24 hours on 1 machine can be 1 hour on 24 machines

– These 24 machines require supporting infrastructure that will do: orchestration; jobs division into tasks, and distribution of tasks across a computer cluster.

(27)

Amazon EC2 HPC oriented AMI instances

• Amazon EC2 offers dedicated computation optimized VMs and “cluster instances” that deliver better performance to HPC users

– Two Cluster Compute instances that provide a very large amount of CPU coupled with increased network performance (10GigE). Instances come in two sizes,

• Nehalem-based “Quadruple Extra Large Instance” (eight cores/node, 23GB of RAM, 1.7TB of local storage) • Sandy Bridge-based “Eight Extra Large Instance” (16 cores/node, 60.5GB of RAM, 3.4TB of local storage). – Besides the Amazon cluster instance, users may select from one of several preconfigured clusters – Additionally, two other specialized instances for HPC clusters

• Cluster GPU instance that provides two NVidia Tesla Fermi M2050 GPUs with proportionally high CPU and 10GigE network performance.

• High-I/O instance that provides two SSD-based volumes, each with 1024GB of storage.

– Pricing can vary depending upon on-demand, scheduled, or spot purchase of resources • Quadruple Extra Large Instance is US$ 1.3/hour (US$ 0.33/core·hour)

• Eight Extra Large Instance is US$ 2.4/hour (US$ 0.15/core·hour)

• Cluster GPU instance is US$ 2.1/hour, and the High I/O instance is US$ 3.1/hour

• For example

– Small usage case (80 cores, 4GB of RAM per core, and basic storage of 500GB) would cost US$ 24.00/hour (10 Eight Extra Large Instances)

– Larger usage case (256 cores, 4GB of RAM per core, and 1TB of fast global storage) would cost US$ 38.4/hour (16 Eight Extra Large Instances)

– Once created, the instances must be provisioned and configured to work as a cluster by the user

• Total cost depends on compute time, total data storage and data transfer cost

– Amazon does not charge for data transferred into EC2 but has a varying rate schedule for transfer out of the cloud; additionally, there are EC2 storage costs

(28)

Example: Univa HPC with RightScale and

Amazon AWS

• Univa offers “one-click” HPC computing with their partner RightScale

• First stage/offering: Create an internal cloud infrastructure system that provides “raw” virtual machines using VMware, KVM, Xen, or cloud platforms such as OpenStack or VMware vSphere.

– The UniCloud product uses an IaaS layer to create the machine and then provision a raw machine on the fly – Once the machine is provisioned (on one of IaaS platforms/providers), UniCloud does additional software

provisioning and configuration to turn the raw instances into nodes that can then be part of an HPC cluster

• Second stage/offering: Using external cloud infrastructure system, such as Amazon EC2 and AWS to provide the “raw” virtual machines for HPC use

– Additional provisioning and configuration management is done by UniCloud (PaaS), which turns the raw instances into HPC cluster nodes

– This process is automatic, and adding and removing nodes from HPC cluster can be done on demand.

• To launch an external cloud on Amazon EC2, just go to the Amazon Market Place, choose HPC from the list, and launch your cluster.

– Once the cloud is built, you can then install your application

• Univa’s UniCloud package provides a “ready-to-run” HPC cloud with prebuilt “kits” that include Univa Grid Engine

– UniCloud can burst from an internal cloud to an external public cloud or from one public cloud to another

• Univa costs are additive to the AWS instance charges and range from US$ 0.02 up to a max of US$ 0.08/core·hour

– Univa proposed a Medium AWS instance for the small usage case (80 cores, 4GB of RAM per core, basic storage of 500GB) that would cost US$ 14.40/hour [(AWS @ US$ 0.16/hour + Univa @ US$ 0.02/hour)x80].

(29)

AWS Cloud Big Data Services

AWS Cloud offers the following services and resources for Big Data processing:

• EC2 Virtual Machine (VM) instances for HPC optimized for computing (with

multiple cores) and with extended storage for large data processing.

• Amazon Elastic MapReduce (EMR) provides the Hadoop framework on Amazon EC2 and offers a wide range of Hadoop related tools.

• Amazon Kinesis is a managed service for real-time processing of streaming big data (throughput scaling from megabytes to gigabytes of data per second and from hundreds of thousands different sources).

• Amazon DynamoDB highly scalable NoSQL data stores with sub-millisecond response latency.

• Amazon Redshift fully-managed petabyte-scale data warehouse in cloud at cost

less than $1000 per terabyte per year. It is provided with columnar data storage with possibility to parallelise queries.

• Amazon RDS scalable relational database.

• Amazon Glacier archival storage to AWS for long time data storage at lower cost

(30)

Microsoft Azure Analytics Platform System (APS)

• Microsoft Azure cloud provides general IaaS services and reach Platform

as a Service (PaaS) services.

– Similar to AWS, Microsoft Azure offers special VM instances that have both computational and memory advanced capabilities.

• The Analytics Platform System (APS) combines the Microsoft SQL Server

based Parallel Data Warehouse (PDW) platform with HDInsight and

Apache Hadoop based scalable data analytics platform.

– APS includes the PolyBase data querying technology to simplify integration of the PDW SQL data and data from Hadoop.

• HDInsight Hadoop based platform has been co-developed with

Hortonworks

– HDInsight provides comprehensive integration and management functionality for multi-workload data processing on Hadoop platform including batch,

(31)

HDInsight: Microsoft’s Big Data Solution

• HDInsight can run both on Azure Cloud and on Windows Server (on premises)

– Data exchange via PolyBase

• Compatible with and support all products from Apache Hadoop stack

• Supports all stages of Big Data processing

• PolyBase is a new technology that allows integrating Microsoft SQL Server based Parallel Data Warehouse (PDW) with Hadoop

• Azure Blob Storage used to persistently store data

– Data are streamed to Hadoop/HDFS for processing and pushed back to Azure Blob Storage

(32)

HDInsight/Hadoop Ecosystem

Distributed Storage (HDFS) Query (Hive) Distributed Processing (MapReduce) O D BC Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and enhancements Orange = Data Movement Green = Packages

(33)

Hortonworks Data Platform Architecture [ref]

• HDP includes the most recent developments of the Open Source Hadoop suite • Can run on Linux and on Windows OS

(34)

Hortonworks Data Platform (HDP)

http://hortonworks.com/

• HDP delivers a single integrated Hadoop platform for enterprises

– Provides a data platform for multi-workload data processing across an array of processing methods including batch and interactive to real-time

– Supports key capabilities of an enterprise data platform: Governance, Security and Operations – YARN and Hadoop Distributed Filesystem (HDFS) are the core components of HDP

• YARN is treated as datacenter OS and supports multiple access methods (batch,

real-time, streaming, in-memory, and more) on a common data set

– YARN is the architectural center of Hadoop that allows to process data simultaneously in multiple ways

– Allows creating multi-tenant data analytics applications

• HDP runs natively on Linux and Windows OS

– HDP provides the basis for Microsoft’s HDInsight Service meaning complete portability of data is retained on-premise and in the cloud

– Available in integrated hardware from Teradata

• Hortonworks provides a simple starters solution Hadoop Sandbox

– Hortonworks Sandbox is a single-node implementation of Hadoop based on the Hortonworks Data Platform that includes all the typical components found in a Hadoop deployment

(35)

LexisNexis HPCC Systems as an integrated Open

Source platform for Big Data Analytics

HPCC Systems data analytics environment components and HPCC Systems architecture model is based on a distributed, shared-nothing architecture and contains two cluster:

• THOR Data Refinery: Massively parallel Extract, Transform, and Load (ECL) engine that can be used for variety of tasks such as massive: joins, merges, sorts, transformations, clustering, and scaling.

• ROXIE Data Delivery: Massively parallel, high throughput, structured query response engine.

– Perform volumes of structured queries and full text ranked Boolean search.

– ROXIE also provides real-time analytics capabilities, to address real-time classifications, prediction, fraud detection and other problems that normally require stream analytics.

Other components of the HPCC environment

• Enterprise Control Language (ECL): An open source, data-centric declarative programming language used by both THOR and ROXIE for large-scale data management and query processing.

• ECL compiler and job server (ECL Server): Serves as the code generator and compiler that translates ECL code.

• System data store (Dali): Used for environment configuration, message queue maintenance, and enforcement of LDAP security restrictions.

• Archiving server (Sasha): Serves as a companion ‘housekeeping’ server to Dali.

• Distributed File Utility (DFU Server): Controls the spraying and despraying operations used to move data onto and out of THOR.

• The inter-component communication server (ESP Server): Supports multi-protocol communication between services to enable various types of functionality to client applications via multiple protocols.

(36)

LexisNexis HPCC Systems Architecture

• THOR is used for massive data processing in batch mode for ETL processing

(37)

LexisNexis HPCC Systems Data Analytics Environment

• Unstructured and structured content – Ingest disparate data sources

– Manage hundreds of TB of data • Data Refinery, Linking Fusion – Parallel Processing Architecture

– Discover non-obvious links and detect hidden patterns – Language optimised for data-intensive Application

• Analysis Applications – Entity resolution – Link Analysis – Clustering Analysis – Complex Analysis

(38)

LexisNexis Data Analytics Languages

• Enterprise Control Language (ECL)

: An open source, data-centric

declarative programming language

– The declarative character of ECL language simplifies coding

– ECL is developed to simplify both data query design and customary data transformation programming

– ECL is explicitly parallel and relies on the platform parallelism.

• LexisNexis proprietary record linkage technology

SALT (Scalable

Automated Linking Technology)

: automates data preparation process:

profiling, parsing, cleansing, normalisation, standardisation of data.

– SALT uses weighted matching and threshold based computation, it also

enables internal, external and remote linking with external or master datasets. – Enables the power of the HPCC Systems and ECL

• Knowledge Engineering Language (KEL)

is an ongoing development

– KEL is a domain specific data processing language that allows using

(39)

Possible research topics on Cloud Computing and

Big Data Infrastructure

• Control and management plane mechanisms for heterogeneous

intercloud/multicloud infrastructure and optimal resources provisioning and management

• Federated cloud infrastructures: user and resources provisioning under

distributed QoS requirements and policy restrictions

• Cloud powered software and applications engineering for continuous

service/quality improvement by converging development and operation (DevOps)

• Cloud based infrastructures for Big Data processing to match 3 V of Big Data:

Volume, Velocity, Variety, including on-demand provisioning of Big Data services

• Security, access control and trust management in dynamically provisioned cloud

based applications in intercloud/multi-cloud environment

• Data centric security models and services for distributed Big Data (scientific) applications

(40)

Summary and take away

• Emerging Big Data technologies facilitate cloud computing development

and to support HPC applications

– Majority of big CSPs and first of all AWS and Microsoft Azure include in their offering HPC oriented VM images, as well as special HPC and Hadoop clusters

• AWS cloud provides a number of services for Big Data processing including

Elastic MapReduce, NoSQL database DynamoBD, Petabyte scale data warehouse Redshift, Kinesis for real-time strem processing

• Microsoft Azure Analytics Platform System (APS) combines the Microsoft

SQL Server based Parallel Data Warehouse (PDW) platform with the HDInsight Hadoop platform developped in cooperation with Hortonworks

• LexisNexis HPCC Systems provides integrated not Hadoop platform for Big

Data analytics

• Cloud and Big Data are complex but rich research areas

– High entry level but front end research and engineering problems highly demanded by research infrastructure and industry