Structured data meets unstructured data in Azure and Hadoop

(1)

Sameer Parve , Blesson John

[email protected] [email protected]

PFE – SQL Server/Analytics Platform System October 30th ₂₀₁₄

(2)

(3)

(4)

4

(5)



(6)

(7)

SQL Server SMP

(8)

(9)

Complete data warehouse solution Flexibility and choice

Massive scalability at a low cost

Microsoft data warehouse vision

Simplified data warehouse management

Make SQL Server the fastest and most affordable database for

(10)

10

Appliance for high-end massively parallel processing (MPP) data warehousing

Ideal for scale or high-performance data marts and EDWs

Data warehouse appliance (fully-integrated software and hardware)

10s of TB – 6 PB

Infiniband & Ethernet

(11)

MS Analytics Platform System (APS)

Pre-Built Hardware + Software Appliance

• Co-engineered with HP, Dell, and Quanta

• Pre-built Hardware

• Pre-installed Software

• Appliance installed in 1-2 days

• Support - Microsoft provides first call support

• Hardware partner provides onsite break/fix support

Plug and Play Built-in Best

(12)

Hardware architecture overview

One standard node type

2–8 core Intel processors Doubled memory to 256 GB

Updating to the newest Infiniband (FDR – 56 GB/sec) Moving from SAN to JBODs

Reducing costs significantly

Moving away from dependency on a handful of key SAN vendors Using Windows Server 2012 technologies to achieve the same level of reliability and robustness

Backup and Landing Zone (LZ) are now reference architectures and not in the appliance

Customers can use their own hardware

Customers can use more than 1 BU or LZ for high availability

Scale unit concept

Base unit:Minimum configuration; populates rack with networking

Scale unit:Adds capacity by 2–3 compute nodes/related storage

Passive unit:Increases high availability (HA) capacity by adding more spares Host 2 (HST02) Host 1 (HST01) Host 3 (HSA01) Host 4 (HSA02) JBOD IB and Ethernet

(13)

Virtual machine architecture overview

General details

All hosts run Windows Server 2012 Standard

All virtual machines run Windows Server 2012 Standard as a guest operating system

All fabric and workload activity happens in Hyper-V virtual machines Fabric virtual machines, MAD01, and CTL share one server

Lower overhead costs especially for small topologies

PDW Agent runs on all hosts and all virtual machines and collects appliance health data on fabric and workload

DWConfig and Admin Console continue to exist Minor extensions expose host-level information

Windows Storage Spaces handles mirroring and spares and enables use of lower cost DAS (JBODs) rather than SAN

PDW workload details

SQL Server 2012 Enterprise Edition (PDW build) control node and compute nodes for PDW workload

Storage details

More files per filegroup

Larger number of spindles in parallel

HST02 HST01 HSA01 HSA02 JBOD IB and Ethernet

Direct attached SAS

CTL MAD AD VMM

Compute 2 Compute 1

• Window Server 2012 Standard

• PDW engine • DMS Manager

• SQL Server 2012 Enterprise Edition (PDW build) • Shell databases just as in AU3+

• Window Server 2012 Standard

• DMS Core

(14)

Seamlessly add capacity – PDW/HDI

Smallest to largest

• Start small with a warehouse capacity of

several terabytes • Add capacity up to 6 PB 53 TB 6 PB Add capacity Add capacity Largest warehouse PB Start small and grow

(15)

PDW table geometries

• Replicated: A table structure that exists as a full copy within each PDW node

CREATE TABLE <TableName> (

<Column Names and Types> )

WITH (DISTRIBUTION = REPLICATE)

• Distributed: A table structure that is hashed and distributed as evenly as possible across all PDW nodes on the appliance

CREATE TABLE <TableName> (

<Column Names and Types> )

(16)

Date Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Store Dim Store Dim ID Store Name Store Mgr Store Size Item Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc Sales Fact Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold Promo Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End ID DD SD ID PD SF 2 DD SD ID PD SF 3 DD SD ID PD SF 4 DD SD PD SF 1

SMP system Compute nodes

(17)

SQL Server PDW 2012 Control Architecture Cost-Based Query Optimizer

Shell Appliance (SQL Server) Engine Service Plan St eps Plan St eps Plan St eps Compute Node (SQL Server) Compute Node (SQL Server) Compute Node (SQL Server) Con tr ol Nod e SELECT SELECT foo foo foo foo

(18)

(19)

Querying data – MPP engine

•

The MPP engine is designed for high-performance

queries against large data sets

•

Understanding the query architecture and execution

steps of the PDW is key to writing good queries

•

Control node orchestrates the entire set of operations

across all nodes to satisfy a query

•

Avoid queries that create hot spots and excessive data

(20)

20

What is HADOOP?

•

Solution that allows commodity computers to store data and

process them in parallel fashion.

•

Resilience/fault tolerance is not provided by the use of

(21)

21

Types of Nodes on HDFS

•NameNode – One per cluster-responsible for providing

metadata information about the blocks within the

filesystem,tracking replication and managing filesystem namespace.

•Backup Node – Acts as the backup of NameNode

•DataNodes – Responsible for storage of file blocks and

(22)

22

An Example of NameNode and DataNode

3 1 4 5 2 3 4 1 2 5

(23)

23

Other tasks on HADOOP cluster

•Job Tracker – Responsible for submitting client application

request to task tracker on nodes that contain the data to be processed. It also monitors the task tracker using heart beats and reschedules the task on another task tracker in case of a failure. One per cluster.

•Task Tracker – Responsible for performing map, reduce and

(24)

24

The architecture of HADOOP

BI Reporting

(25)

25

MapReduce

•MapReduce is a framework that allows user to write

applications that take advantage of the fault tolerance provided by HADOOP.

•MapReduce programs transform lists of input data elements

into lists of output data elements. This is done twice once using map method and then using reduce method.

•We do not alter the initial input data. The initial input is

transformed and the transformed output becomes the input for the reducer function.

(26)

26

The Mapper Function

1 2 4 3 1 2 5 4 5 1(1) 2 (1) 4 (1) 4 (1) 5 (1) 5(1) 3(1) 1(1) 2 (1)

(27)

27

The Reduce Function

1(1) 2 (1) 4 (1) 4 (1) 5 (2) 3(1) 1(1) 2 (1) 1 (2) 2 (1) 3 (1) 4 (2) 5 (2)

(28)

28

Pseudo Code for the MapReduce Job

mapper (Inputfilename, filedata): for each number in filedata:

emit (number, 1)

reducer (number, values): sum = 0

for each value in values: sum = sum + value

(29)

Sqoop – Load/Unload Utility

SQL Server SQL Server SQL Server … SQL Server Hadoop Cluster 29 Sqoop

(30)

30

(31)

HDFS Bridge in Polybase

DMS

SQL Server DMS SQL Server

HDFS _Hadoop

Cluster

1. DMS is present on all compute and has been extended

to have HDFS Bridge.

2. HDFS Bridge hides complexity of HDFS.

The DMS components are reused for type conversions.

3. All HDFS file types (text, sequence, RCFiles) supported

through the use of appropriate RecordReaders by the HDFS Bridge. The JAVA class used is InputFormat.

HDFS_HDFS

31

(32)

SQL Server DMS SQL Server DMS

Reading HDFS Files in Parallel

32

HDFS NameNode

HDFS

DataNode DataNodeHDFS DataNodeHDFS

Block buffers NameNode returns locations of blocks of file

(33)

33

External Table Command

•There are two different ways to import data from HDFS to

PDW

I. CREATE EXTERNAL TABLE

II.CREATE TABLE AS SELECT (CTAS)

•There is only one way to export data from PDW to HDFS

I. CREATE EXTERNAL TABLE AS SELECT

(34)

34

Example of CREATE EXTERNAL TABLE - Temp

CREATE EXTERNAL TABLE ClickStream ( url varchar(50), event_date date, user_IP varchar(50) ) WITH ( LOCATION = 'hdfs://10.192.63.147:5000/tpch1GB/ClickStream.txt‘ ,FORMAT_OPTIONS ( FIELD_TERMINATOR = '|') , DATE_FORMAT = ꞌMM/dd/yyyyꞌ ) ) ;

(35)

35

Example of CREATE TABLE AS SELECT –

PART 1 – Persistent

--Create the external table called ClickStream. CREATE EXTERNAL TABLE ClickStreamExt

( url varchar(50), event_date date, user_IP varchar(50) ) WITH ( LOCATION = 'hdfs://MyHadoop:5000/tpch1GB/ClickStream.txt‘ ,FORMAT_OPTIONS ( FIELD_TERMINATOR = '|') ) ;

(36)

36

Example of CREATE TABLE AS SELECT –

PART 2 –Persistent

--Use your own processes to create the Hadoop text-delimited files on the -- Hadoop Cluster.

--Use CREATE TABLE AS SELECT to import the Hadoop data into a new --SQL Server PDW table called ClickStreamPDW

CREATE TABLE ClickStreamPDW WITH

(

CLUSTERED COLUMNSTORE INDEX ,DISTRIBUTION = HASH (user_IP) )

AS

SELECT *

(37)

37

Example of CREATE EXTERNAL TABLE AS

SELECT (Export)

USE AdventureWorksPDW2012;

CREATE EXTERNAL TABLE hdfsCustomer WITH

(

LOCATION = 'hdfs://10.192.63.147:5000/files/Customer , FORMAT_OPTIONS ( FIELD_TERMINATOR = '|')

)

(38)

38

The DROP EXTERNAL TABLE

--Drop an external table from PDW. This does not delete the external data. DROP EXTERNAL TABLE [ database_name . [ dbo ] . | dbo . ]table_name [;]

(39)

(40)