Learn how to make ETL and ELT easy with Amazon Redshift

(1)

(2)

Learn how to make ETL and ELT easy with Amazon Redshift

A N T 4 1 1 - R

Harshida Patel

Data Warehouse Specialist SA Amazon Web Services

Juan Yu

Data Warehouse Specialist SA Amazon Web Services

(3)

Agenda

Architecture overview

Batch use case and demo

Streaming use case and demo

Auto maintenance and auto WLM Materialized view

Q&A

(4)

Fastest

Most popular Most cost-effective Integrated with

the data lake

(5)

• Massively parallel, shared-nothing architecture

• Streaming backup/restore from S3

• Leader node

• SQL endpoint

• Stores metadata

• Coordinates parallel SQL processing; ML optimizations

• Customers are not charged for the leader node for any cluster with two or more nodes

• Compute nodes

• Local, columnar storage

• Executes queries in parallel

• Load, backup, restore

• Amazon Redshift Spectrum nodes

• Execute queries directly against data lake

• An Amazon Redshift cluster exists in only 1 subnet/AZ at a time

Amazon Redshift cluster architecture

...

Amazon Redshift Spectrum fleet

Compute nodes

10 GigE (HPC)

JDBC/ODBC

SQL clients/

BI tools

Leader node

Redshift cluster

Amazon S3

Exabyte-scale object storage

Efficient data loads

Streaming backup/restore

(6)

▪ Single row insert, or insert values()

▪ Leader node only

▪ OK for loading small amount of data at low frequency

▪ Bulk loading

▪ Engages compute nodes in parallel

▪ Load from S3, using Copy

▪ Data transformation, using Insert into … select …

Data loading methods: Typical use cases

Data loading

When you COPY to an empty table, optimal column

compression is used automatically

(7)

▪ Use many input files (chunks) as slices, ideally a multiple

▪ E.g., for 16 slices, use 16, 32, 48, etc., input files (chunks)

▪ Compress the data before COPY

▪ File size 1MB–1GB after compression

▪ For reoccurring loading, turn off automatic compression—COMPUPDATE OFF

COPY command: Best practices

Single input file

dc 2.8 x large node (16 slices)

Single input file

Multiple (16) input files

dc 2.8 x large node (16 slices)

Multiple input files

(8)

Use INSERT INTO SELECT

against external Amazon S3 tables

• Aggregate incoming data

• Select subset of columns and/or rows

• Manipulate incoming column data with SQL

Design considerations

• Transform semi-structured data

• $5/TB of (compressed) data scanned

Data loading via Redshift Spectrum

Amazon Redshift Spectrum query engine

Query across

Amazon Redshift and S3

S3data lake

Amazon Redshift data

(9)

Batch ETL/ELT

VPC AWS Cloud

Availability Zone 1

Amazon EMR

AWS Glue Crawler

AWS Glue Data Catalog

Amazon Redshift

Amazon Simple Storage

Bucket with objects

AWS Glue

- AWS Glue crawler

- AWS Glue Data Catalog - AWS Glue jobs

Amazon EMR

Amazon Redshift Spectrum

(10)

(11)

Streaming ETL/ELT

VPC AWS Cloud

Availability Zone 1

Amazon Redshift Amazon Kinesis

Data Streams

Amazon Kinesis Data Firehose AWS Lambda

Bucket with objects

AWS Lambda

Amazon Kinesis Data Streams Amazon Kinesis Data Firehose

Amazon Managed Streaming for Kafka (MSK)

Amazon Managed Streaming for Kafka

(12)

(13)

Best practices : ELT

Batch ingestion, avoid small write

Wrap workflow/statements in an explicit transaction

Consider using DROP TABLE or TRUNCATE instead of DELETE Handle upsert by using Delete + Insert

Staging tables :

•

Use temporary table or permanent table with the “BACKUP NO” option

•

For copying a large number of rows (> hundreds of millions), consider using

ALTER TABLE APPEND instead of INSERT INTO SELECT

(14)

Run stored procedures in Amazon Redshift

stored procedure

Migrating to Amazon Redshift is even

easier!

Support for stored procedures

provides the ability

to run code

(15)

Automated maintenance

Moving toward

zero maintenance

(16)

Automated administration

Automates data

distribution in tables for improved

performance and disk space utilization

Provides intelligent recommendations for tuning based on continuous workload analysis

ALL

keyA keyB keyC keyD

Node 1 Slice 1 Slice 2

EVEN

KEY

recommends distribution key

Clusters are always optimized for

performance

Advise

(17)

Amazon Redshift: Auto WLM

Auto WLM

Easy to maximize query throughput and get consistent performance

• Amazon Redshift automatically manages query concurrency and memory

allocation through a single WLM queue

• Auto WLM will automatically determine the amount of resources that queries

need, and adjust the concurrency based on the workload

• Ensures higher-priority workloads get

preferential treatment via query priority

(18)

Amazon Redshift elastic resize

to Amazon Redshift

cluster in busy periods

transition time and storage on demand

Scale up and down in minutes

Amazon

Redshift cluster

Compute nodes

Amazon Redshift managed S3

JDBC/ODBC

Leader node

CN2

CN1 CN3 CN4

Backup

(19)

• Automatic REFRESH

• Access live data from base tables

• Automatic query rewrite

Amazon Redshift materialized views: Preview

Customer benefits

➢ Significantly boost performance of

predictable analytical & dashboarding workloads

➢ Simplify and optimize ETL/BI data pipelines o Modern data warehouses

o Operational analytics

➢ Easier and faster migration to Amazon Redshift

➢ Automatic query rewrite to leverage MVs for base table queries

2

✓

Define, manage materialized views on one or more tables

✓

Efficient incremental REFRESH

o

INNER JOINs

o

Associative aggregate functions

o

WHERE, GROUPBY, HAVING

✓

Access stored data in MVs

Upcoming functionality

1

(20)

Materialized views

C1 C2 C3 C4 R1

R2 R3

c1 c2 c3 r1

r2 r3

c2 c3 c1 c4

r1 r7 r9

(Projection, filters, aggregates, JOINs)

Base table1 Base table2

MV2

➢ Created from a single or multiple base tables

➢ Projections – Select subset of columns

➢ Filters - Select subset of rows

➢ INNER JOINs from multiple base tables

➢ Aggregates, GROUPBY, HAVING

o Summarize column values from matching rows

o Associative aggregate functions

➢ Precomputed results persisted in the materialized view

➢ Base table changes incrementally refreshed

C1 C2 R1

R5 R9

MV1

(21)

Many third-party tools

(22)

Next steps

• Apply best practices to your ETL jobs

• Leverage Amazon Redshift new

features

(23)

Learn big data with AWS training and certification

Visit aws.amazon.com/training/paths-specialty/