• No results found

Introduction to Netezza

N/A
N/A
Protected

Academic year: 2021

Share "Introduction to Netezza"

Copied!
42
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction to Netezza

(2)

Topics

Netezza Architecture

SQL Differences

Stored Procedure Differences

(3)

Netezza Architecture – TwinFin 12

(4)

Topics

System Architecture

Snippet Processing Unit (SPU)

System Capacity

Know Your Customers. Grow Your Business.

Distribution in a shared-nothing architecture

Zone maps & data organization

(5)

Database Hardware Architecture Terminology

Symmetrical Multi Process Architecture (SMP)

• Multiple processors sharing access to disk and memory

• Processors operate asynchronously

• Examples: Sybase, Sybase IQ, Oracle, SQL Server, DB2

Massive Parallel Process Architecture (MPP)

Know Your Customers. Grow Your Business.

Massive Parallel Process Architecture (MPP)

• Multiple processors, each with separate, dedicated memory and disk

• No hardware is shared between processors

• Processors are slaved to a controller

• Processors operate synchronously

• Implemented many different ways

• Netezza approach is unique

(6)

Netezza Architecture

Host compiles SQL, develops execution plan and sends code Host collects and

returns result sets

Result sent to

user

Host

SQL command sent to host from user

Know Your Customers. Grow Your Business.

snippets to SPUs based on plan

SPU executes code snippet.

All SPUs execute same code synchronously. SPUs pass data between

each other as needed

(7)

Snippet Processing Unit (SPU)

Each SPU has a dedicated 1TB disk

Disk controlled by FPGA (Field

Programmable Gate Array)

CPU loads query code into FPGA

Know Your Customers. Grow Your Business.

CPU loads query code into FPGA

FPGA executes code using memory

and cache

CPU performs additional processing to

result set

CPU communicates to controller and

other SPUs via NIC

(8)

System Capacity – TwinFin 12

12 S-Blades, each have 8 SPUs

4 SPUs are spares for fail over – 92 usable SPUs

Each SPU has a 1TB disk drive

Know Your Customers. Grow Your Business.

Disk divided into 3 330GB partitions:

• Active data area

• Workspace

• Redundant copy of another SPU’s active data area

Nominal 330GB capacity per SPU (uncompressed)

(9)

MPP Shared Nothing Architecture

A SPU only has visibility to the data on its own disk.

A SPU will broadcast data to other SPUs as needed (based on

instructions from the controller) via the NIC.

Broadcast data is received and held in memory or work area by

Know Your Customers. Grow Your Business.

Broadcast data is received and held in memory or work area by

the SPU for use in resolving the query.

Each SPU operates synchronously with all other SPUs. Each

execute the same snippet at the same time.

The controller’s execution plan ensures minimal data transfer

(10)

Data Distribution

Data distribution to a SPU is defined by the DISTRIBUTE ON

clause of the CREATE TABLE command

Each table is distributed across all 92 SPUs

You can distribute RANDOM or specify one or more columns in

Know Your Customers. Grow Your Business.

You can distribute RANDOM or specify one or more columns in

the table as a distribution key

• Controller uses distribution key information when deciding the execution plan

• Joins between tables with the same distribution key that are joined using that key will execute entirely on that SPU without the need to broadcast data (table co-location)

• Joins to tables with RANDOM distribution will require data broadcast. The smaller of the two sets is broadcast

(11)

Effects of Poor Data Distribution

System Capacity

• The system is ‘full’ when any one SPU is full

• A badly skewed distribution of a large table may fill up a SPU prematurely

• Try to keep skew under 10% (5% off average) for very large tables

System Performance

Know Your Customers. Grow Your Business.

System Performance

• SQL code snippets are executed synchronously among all SPUs

• The total execution time for a snippet is the longest time among the SPUs

• The total execution time for a query is the sum of the longest times for each snippet

• Snippet execution time is proportionally related to the amount of data a SPU needs to process

(12)

Data Distribution – Co-Location

Co-Location is the physical placement of related data across

multiple tables onto the same SPU

• Allows a SPU to join co-located data without interaction with other SPUs

• Speeds queries by eliminating data broadcast steps

Tables must have identical distribution key columns

Know Your Customers. Grow Your Business.

Tables must have identical distribution key columns

• RANDOM distribution will not co-locate

• Any difference in columns will result in a different, unrelated SPU assignment

Join expression must include all distribution key columns

• Query compiler cannot assume co-location unless the join between two co-located tables includes equi-joins between all columns specified in the distribution key

• Ensure use by defining as few columns as possible as a distribution key

Don’t bother with small tables

(13)

Data Distribution - Example

Know Your Customers. Grow Your Business.

• All 3 tables share ID_TS as a

common key

• ID_TS is always used when joining

these tables

• ID_TS is the appropriate column to

use as a distribution key

CREATE TABLE TSVALUES (

ID_TSDATE INTEGER NOT NULL, ID_TS INTEGER NOT NULL,

VALUE DOUBLE PRECISION NOT NULL, ID_SRC INTEGER NOT NULL,

ID_STATUS INTEGER NOT NULL, TS_BEGIN TIMESTAMP NOT NULL, TS_END TIMESTAMP NOT NULL, SIGMA DOUBLE PRECISION NOT NULL )

DISTRIBUTE ON (ID_TS)

(14)

Data Distribution – Example

Aginity Workbench

• Provides ability to redistribute existing tables

• Allows you to view a table’s distribution

Distribution display for TSValues:

(15)

Co-Located Queries

The query will execute in parallel across all SPUs without data

sharing because:

SELECT N.ID_TS, N.TICKER, N.NAME, V.VALUE FROM TSNAMES N, TSVALUES V

WHERE N.ID_TS = V.ID_TS

AND V.TS_BEGIN BETWEEN N.TS_BEGIN AND N.TS_END AND V.TS_END = ‘1/1/3000’;

Know Your Customers. Grow Your Business.

sharing because:

• Both tables are distributed on the same key (ID_TS)

• ID_TS is used to join the tables

Identical distribution key definitions across tables ensure rows

with the same distribution key values in those tables reside on

the same disk and SPU.

Run times (493M row result set):

• Co-located: 22.1 seconds

(16)

Distribution Key Advice

Keep distribution key as small as possible

• One column is best

• Ensure full use in joins

Tables with some keys in common will not result in co-location

Use a key that provides even distribution

Know Your Customers. Grow Your Business.

Use a key that provides even distribution

• Avoid skews > 10% in very large tables (> 5% nominal system capacity)

• Avoid skews > 5% in multiple same key tables with total size > 5% nominal system capacity

• Avoid skews > 20% in other tables

• May be necessary to add a column to the distribution key to reduce skew

• Don’t worry about skew in small tables (< 0.1 % nominal system capacity)

Poor Distribution Keys

• Dates – cause query ‘hotspots’ when queries based on date range

• Low cardinality columns

(17)

Zone Maps

A zone map records the high and low values of columns for

rows in a data block on disk

Netezza uses zone maps to skip data blocks that do not satisfy

query predicates, speeding query execution

Know Your Customers. Grow Your Business.

Zone maps are automatically created for Integer, Date and

Timestamp columns

You can specify additional zone map columns in the ORGANIZE

(18)

ORGANIZE ON Clause

Defines a Clustered Base Table

Data is arranged (sorted) by the ORGANIZE ON columns during

GROOM

Improves effectiveness of zone maps

Know Your Customers. Grow Your Business.

Improves effectiveness of zone maps

Specify up to 4 columns

Allowable data types:

• INTEGER, DATE, TIMESTAMP

• CHAR, VARCHAR, NCHAR, NVARCHAR – Only first 8 bytes used

• NUMERIC – up to NUMERIC(18)

• FLOAT, DOUBLE

• BOOL

(19)

Using ORGANIZE ON

Select columns most likely used in WHERE clause

Zone map efficiency diminishes for 3

rd

and 4

th

columns

• Greater chance a wide range of values appearing in a data block

Arrange lower cardinality or “clumpier” columns first

Know Your Customers. Grow Your Business.

Arrange lower cardinality or “clumpier” columns first

• Groups larger numbers of rows first, spread over more data blocks

• Increases likelihood subsequent columns spread over multiple blocks

• “Clumpy” – large portion of rows have few unique values

GROOM table periodically

• Reorganizes data according to ORGANIZE ON specification

(20)

Using ORGANIZE ON - TSValues

All current rows have ts_end = 1/1/3000 (very clumpy)

Current rows are of greatest interest (most often selected)

ts_begin commonly used in most queries

Know Your Customers. Grow Your Business.

Specify ts_end first. Clumps all current rows into contiguous

blocks.

CREATE TABLE TSVALUES

(

ID_TSDATE INTEGER NOT NULL, ID_TS INTEGER NOT NULL,

VALUE DOUBLE PRECISION NOT NULL, ID_SRC INTEGER NOT NULL,

ID_STATUS INTEGER NOT NULL, TS_BEGIN TIMESTAMP NOT NULL, TS_END TIMESTAMP NOT NULL, SIGMA DOUBLE PRECISION NOT NULL )

DISTRIBUTE ON (ID_TS)

(21)

GROOM Command

Organizes rows based on ORGANIZE ON clause

GROOM TABLE TSVALUES

RECORDS ALL

RECLAIM BACKUPSET DEFAULT;

Know Your Customers. Grow Your Business.

Organizes rows based on ORGANIZE ON clause

Reclaims space from deleted rows

• Netezza performs updates by deleting old row and inserting new row

• Deleted rows consume space until table is GROOMed

Fast – 15 minutes for 4.6 billion rows in TSValues

(22)

SQL Differences – Netezza/Sybase

(23)

Topics

General Language Characteristics

Identifiers

Data Types

Know Your Customers. Grow Your Business.

Functions

(24)

General Language Characteristics

Similar to Oracle syntax and interpretation

Commands terminated with semi-colon (;)

• Terminator is required at all times

Uses Oracle’s interpretation of NULL

Know Your Customers. Grow Your Business.

Uses Oracle’s interpretation of NULL

• A zero length string (‘’) is considered NULL

(25)

Identifiers

An identifier names a database object

There are 2 types of Identifiers: regular and delimited

Regular Identifier

• Is case insensitive

Know Your Customers. Grow Your Business.

• Is case insensitive

• Must begin with a letter

• Contain letters, digits, underscores, dollar sign ($)

Delimited Identifier

• Enclosed in double quotes

• Is case sensitive

• May include spaces, other special symbols and reserved words

(26)

Data Types

Does not support LOB types (CLOB, BLOB)

Character types

• CHAR, VARCHAR: ASCII data, 64,000 maximum length

• NCHAR, NVARCHAR: Unicode data, 16,000 maximum length

Know Your Customers. Grow Your Business.

Exact Numeric Types

• BYTEINT 8 bit signed integer

• SMALLINT 16 bit signed integer

• INTEGER 32 bit signed integer

• BIGINT 64 bit signed integer

NUMERIC(p,s), NUMERIC(p) Up to 38 digits precision (p), scale (s) from 0 to p

(27)

Data Types (continued)

Approximate Numeric Types

• FLOAT(p) precision can range from 1 to 15

• REAL same as FLOAT(6) – 4 bytes

• DOUBLE PRECISION same as FLOAT(15) – 8 bytes

Logical Types

Know Your Customers. Grow Your Business.

Logical Types

• BOOLEAN (or BOOL) True/false value, 1 byte

Temporal Types

• Time maintained to microsecond (6 decimal places)

• DATE Date with no time – 4 bytes

• TIME Time with no date – 8 bytes

• TIME WITH TIME ZONE Time with time zone information – 12 bytes

• TIMESTAMP Date and time – 8 bytes

• INTERVAL Time interval, non-standard implementation • Maintained in seconds

(28)

Internal Data Types (pseudo-columns)

ROWID

• Unique row identifier, assigned when row is inserted

• Not sequential in table

• Value range 100,000 - 9,223,372,036,854,775,807

• Will numbers ever repeat?

• If a table contains 2 billion rows and takes 3 minutes to copy

Know Your Customers. Grow Your Business.

• If a table contains 2 billion rows and takes 3 minutes to copy

• And if the table is copied repeatedly and continuously (add 2 billion rows every 3 minutes)

• It will take over 24,000 years to run out of numbers

• Netezza hopes you would have upgraded before then

CREATEXID, DELETEXID

• Transaction ID that created and deleted the row.

• If DELETEXID > 0, row has been deleted

• You can not see deleted rows using SQL

(29)

Operators

Differences from Sybase

• Concatenation ||

• Not Equal <> or !=

• Null test ISNULL or IS NULL

NOTNULL or IS NOT NULL

(30)

Conditional Column Expression Functions

CASE

Searched form:

CASE

WHEN <search-condition-1> THEN <result-1> WHEN <search-condition-2> THEN <result-2> ...

WHEN <search-condition-n> THEN <result-n> ELSE <default-result>

Value form:

CASE <test-value>

WHEN <comparand-value-1> THEN <result-1> WHEN <comparand-value-2> THEN <result-2> ...

WHEN <comparand-value-n> THEN <result-n> ELSE <default-result>

Know Your Customers. Grow Your Business.

DECODE

• Same as Oracle DECODE

decode(<expr>,<search1>,<result1>, …<search N>,<result N>,<default>) ELSE <default-result>

END

Search conditions can be arbitrarily complex and results can be expressions.

ELSE <default-result> END

Test values, comparand values, and results can be expressions.

(31)

Commonly Used Functions

Current time

• Use NOW() or CURRENT_TIMESTAMP

• TIMEOFDAY() returns a verbose date string. Example: ‘Mon Jan 24 16:12:05 2011 EST’

Date Conversion

• Date literal is in MM/DD/YYYY format (default). Example: ‘1/1/3000’

Know Your Customers. Grow Your Business.

• Date literal is in MM/DD/YYYY format (default). Example: ‘1/1/3000’

• TO_DATE(<text>, <template>) Converts string to DATE data type Template describes format of date string.

Example: TO_DATE(‘24 Jan 2011’, ‘DD Mon YYYY’)

• TO_TIMESTAMP (<text>, <template>) Converts string to TIMESTAMP data type

TO_CHAR(date or timestamp, <template>) Converts date or timestamp to string

(32)

SQL Differences - Update

Sybase

Netezza

update tempRaw a

Set Sigma = b.Sigma from TSValues b

where a.ID_TS = b.ID_TS

and a.ID_TSDate = b.ID_TSDate update #tempRaw

Set a.Sigma = b.Sigma from #tempRaw a,TSValues b where a.ID_TS = b.ID_TS

and a.ID_TSDate = b.ID_TSDate

Know Your Customers. Grow Your Business.

Special characters not used

for temporary tables

Table only referenced once

and a.ID_TSDate = b.ID_TSDate and now() < b.ts_end

and a.ID_Status >= 1

and a.ID_StatusSigma <= -1 and a.isdelete = 0;

and a.ID_TSDate = b.ID_TSDate and getdate() < b.ts_end

and a.ID_Status >= 1

and a.ID_StatusSigma <= -1 and IsDelete = 0

(33)

SQL Differences - DELETE

DELETE may only reference one table

• DELETE FROM <table> WHERE <condition>

Cannot directly convert:

DELETE #tempRaw FROM TSValues B , #tempRaw A

where A.ID_TS = B.ID_TS and A.ID_TSDate = B.ID_TSDate

Know Your Customers. Grow Your Business.

where A.ID_TS = B.ID_TS and A.ID_TSDate = B.ID_TSDate

and ( ( A.Value = B.Value and A.ID_Status = B.ID_Status ) or ( B.ID_Status >= 1 and A.ID_Status <= -1 )) AND ( abs(A.Sigma - B.Sigma) < 0.00000001 or A.ID_StatusSigma <= -1)

and getdate() between B.ts_begin and B.ts_end AND A.Comments IS NULL

Use UPDATE then DELETE:

Add a boolean column ‘deleteme’ to tempRaw table. Initialize to FALSE. UPDATE tempRaw a

set deleteme = TRUE FROM TSValues B

WHERE A.ID_TS = B.ID_TS AND A.ID_TSDate = B.ID_TSDate

AND ( ( A.Value = B.Value and A.ID_Status = B.ID_Status ) OR ( B.ID_Status >= 1 and A.ID_Status <= -1 )) AND ( abs(A.Sigma - B.Sigma) < 0.00000001 OR A.ID_StatusSigma <= -1)

AND NOW() BETWEEN B.ts_begin AND B.ts_end AND A.Comments IS NULL; DELETE FROM tempRaw WHERE deleteme;

(34)

SQL Differences

Materialized Views

• View may have only one source table

• Limited to 64 columns

• May include an ORDER BY clause

• Optimize zone maps

• Cannot include an ORGANIZE ON clause

Know Your Customers. Grow Your Business.

• Cannot include an ORGANIZE ON clause

Sequences

• You may specify data type (byteint, smallint, int, bigint)

• Default is bigint (64 bit signed integer)

• Blocks of numbers are allocated to each SPU during execution

(35)

Stored Procedure Differences

(36)

Topics

General Differences

CREATE PROCEDURE statement

Iterating through rows (cursors)

Know Your Customers. Grow Your Business.

Returning result sets

(37)

General Differences

Statements are terminated with semi-colons (;)

Variables are not prefixed with special characters

:= symbol used for variable assignment

a := 1;

Know Your Customers. Grow Your Business.

a := 1;

(38)

CREATE PROCEDURE

Sybase

Netezza

CREATE PROCEDURE sp_TS_ProcessFillValues (@LoaderID int, @ID_SRC int)

AS

DECLARE @err_no int, @err_msg varchar(250), @CurrentTime datetime

<...procedure code ...>

CREATE PROCEDURE sp_TS_ProcessFillValues

(INTEGER, INTEGER)

RETURNS INTEGER EXECUTE AS CALLER LANGUAGE NZPLSQL

AS

begin_proc

DECLARE

Know Your Customers. Grow Your Business.

<...procedure code ...>

GO

DECLARE

LoaderID alias for $1; ID_SRC ALIAS FOR $2;

CurrentTime datetime; EndDate datetime; StartDate datetime; begin <... procedure code ...> end; end_proc;

Parameters are unnamed, referenced by position: $1, $2…

Alias may be used to name parameters

Procedure enclosed in a BEGIN_PROC -END_PROC block

Executable code enclosed in a BEGIN – END block following declarations.

(39)

Iterating Through Rows

Netezza does not support cursor data type

Use looping to iterate through a result set:

FOR record | row IN select_clause LOOP

statements

END LOOP;

Know Your Customers. Grow Your Business. END LOOP;

Alternate form:

FOR record | row IN EXECUTE text_expression LOOP

statements

END LOOP;

Example

DECLARE myrec RECORD;

mytotal DOUBLE PRECISION; BEGIN

FOR myrec IN SELECT * FROM tsvalues LOOP mytotal := mytotal + myrec.value;

(40)

Returning Sets from Stored Procedures

Use RETURNS REFTABLE (<table name>) clause

• <table name> refers to a table defined in the schema

• Used to define the format of the result set being returned (metadata only)

Use REFTABLENAME to refer to the result set

• REFTABLENAME is a system provided variable

Know Your Customers. Grow Your Business.

• REFTABLENAME is a system provided variable

• System generates a unique table name

• Essentially a CREATE TEMP TABLE command

• Generate result set using EXECUTE IMMEDIATE

• Example:

EXECUTE IMMEDIATE ‘INSERT INTO ‘ || REFTABLENAME || ‘ VALUES (…);’

(41)

System Variables and Special Functions

SQLERRM

• Error message relating to last SQL error

ROW_COUNT

• Number of rows affected by last executed SQL statement

Know Your Customers. Grow Your Business.

QUOTE_LITERAL(<expression>)

• Used when building EXECUTE IMMEDIATE strings

• Converts the value of the expression into a properly formatted literal string

• Escapes any special characters as necessary.

QUOTE_IDENT(<expression>)

• Used when building EXECUTE IMMEDIATE strings

• Converts expression into properly formatted database object name string

(42)

Netezza Manuals

Full documentation can be found at:

\\crpnycnaf00n2\ReconDQ\Netezza\Manuals

References

Related documents

Enter the code numbers of the broadcast distribution list and push the star key to conclude the entry.. # If the entry was incorrect: delete code numbers of a broadcast

fields for which tables are need to be remembered like CoS prefix table, etc. 5) There is need of operator to handle programming through HyperTerminal and key telephone

List of tables Table 1 - Distribution of typing skills Table 2 - Different combinations to the flight time Table 3 - Mann-Whitney U-test for the key event features Table 4

The stage joins source sub tables in a massive raw table of all source variables and then pulls the variables we want into the intended super table..

IP forwarding tables are maintained by dynamic routing protocols IP Forwarding Table OSPF Domain RIP Domain BGP OS kernel OSPF Process.. OSPF

– Query groups and filters the fact table using values in the dimension tables. • Queries contain only a few

What type of a foreign key constraint which might decide to automatically rename a stored proc to tables in all sql rename server to execute alter table properties of value.. Your

A relational database is a collection of data stored in two or more tables that share a common column-name [key field] that joins the tables together in a single database.. This