• No results found

Learn how to make ETL and ELT easy with Amazon Redshift

N/A
N/A
Protected

Academic year: 2021

Share "Learn how to make ETL and ELT easy with Amazon Redshift"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Learn how to make ETL and ELT easy with Amazon Redshift

A N T 4 1 1 - R

Harshida Patel

Data Warehouse Specialist SA Amazon Web Services

Juan Yu

Data Warehouse Specialist SA Amazon Web Services

(3)

Agenda

Architecture overview

Batch use case and demo

Streaming use case and demo

Auto maintenance and auto WLM Materialized view

Q&A

(4)

Fastest

Most popular Most cost-effective Integrated with

the data lake

(5)

Massively parallel, shared-nothing architecture

Streaming backup/restore from S3

Leader node

SQL endpoint

Stores metadata

Coordinates parallel SQL processing; ML optimizations

Customers are not charged for the leader node for any cluster with two or more nodes

Compute nodes

Local, columnar storage

Executes queries in parallel

Load, backup, restore

Amazon Redshift Spectrum nodes

Execute queries directly against data lake

An Amazon Redshift cluster exists in only 1 subnet/AZ at a time

Amazon Redshift cluster architecture

...

Amazon Redshift Spectrum fleet

Compute nodes

10 GigE (HPC)

JDBC/ODBC

SQL clients/

BI tools

Leader node

Redshift cluster

Amazon S3

Exabyte-scale object storage

Efficient data loads

Streaming backup/restore

(6)

▪ Single row insert, or insert values()

▪ Leader node only

▪ OK for loading small amount of data at low frequency

▪ Bulk loading

▪ Engages compute nodes in parallel

▪ Load from S3, using Copy

▪ Data transformation, using Insert into … select …

Data loading methods: Typical use cases

Data loading

When you COPY to an empty table, optimal column

compression is used automatically

(7)

▪ Use many input files (chunks) as slices, ideally a multiple

▪ E.g., for 16 slices, use 16, 32, 48, etc., input files (chunks)

▪ Compress the data before COPY

▪ File size 1MB–1GB after compression

▪ For reoccurring loading, turn off automatic compression—COMPUPDATE OFF

COPY command: Best practices

Single input file

dc 2.8 x large node (16 slices)

Single input file

Multiple (16) input files

dc 2.8 x large node (16 slices)

Multiple input files

(8)

Use INSERT INTO SELECT

against external Amazon S3 tables

Aggregate incoming data

Select subset of columns and/or rows

Manipulate incoming column data with SQL

Design considerations

Transform semi-structured data

$5/TB of (compressed) data scanned

Data loading via Redshift Spectrum

Amazon Redshift Spectrum query engine

Query across

Amazon Redshift and S3

S3data lake

Amazon Redshift data

(9)

Batch ETL/ELT

VPC AWS Cloud

Availability Zone 1

Amazon EMR

AWS Glue Crawler

AWS Glue Data Catalog

Amazon Redshift

Amazon Simple Storage

Bucket with objects

Bucket with objects

AWS Glue

- AWS Glue crawler

- AWS Glue Data Catalog - AWS Glue jobs

Amazon EMR

Amazon Redshift Spectrum

(10)

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

(11)

Streaming ETL/ELT

VPC AWS Cloud

Availability Zone 1

Amazon Redshift Amazon Kinesis

Data Streams

Amazon Kinesis Data Firehose AWS Lambda

Bucket with objects

AWS Lambda

Amazon Kinesis Data Streams Amazon Kinesis Data Firehose

Amazon Managed Streaming for Kafka (MSK)

Amazon Managed Streaming for Kafka

(12)

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

(13)

Best practices : ELT

Batch ingestion, avoid small write

Wrap workflow/statements in an explicit transaction

Consider using DROP TABLE or TRUNCATE instead of DELETE Handle upsert by using Delete + Insert

Staging tables :

Use temporary table or permanent table with the “BACKUP NO” option

For copying a large number of rows (> hundreds of millions), consider using

ALTER TABLE APPEND instead of INSERT INTO SELECT

(14)

Run stored procedures in Amazon Redshift

stored procedure

Migrating to Amazon Redshift is even

easier!

Support for stored procedures

provides the ability

to run code

(15)

Automated maintenance

Moving toward

zero maintenance

(16)

Automated administration

Automates data

distribution in tables for improved

performance and disk space utilization

Provides intelligent recommendations for tuning based on continuous workload analysis

ALL

keyA keyB keyC keyD

Node 1 Slice 1 Slice 2

Node 2 Slice 3 Slice 4

EVEN

Node 1 Slice 1 Slice 2

Node 2 Slice 3 Slice 4

KEY

Node 1 Slice 1 Slice 2

Node 2 Slice 3 Slice 4

recommends distribution key

Clusters are always optimized for

performance

Advise

(17)

Amazon Redshift: Auto WLM

Auto WLM

Easy to maximize query throughput and get consistent performance

• Amazon Redshift automatically manages query concurrency and memory

allocation through a single WLM queue

• Auto WLM will automatically determine the amount of resources that queries

need, and adjust the concurrency based on the workload

• Ensures higher-priority workloads get

preferential treatment via query priority

(18)

Amazon Redshift elastic resize

to Amazon Redshift

cluster in busy periods

transition time and storage on demand

Scale up and down in minutes

Amazon

Redshift cluster

Compute nodes

Amazon Redshift managed S3

JDBC/ODBC

Leader node

CN2

CN1 CN3 CN4

Backup

(19)

• Automatic REFRESH

• Access live data from base tables

• Automatic query rewrite

Amazon Redshift materialized views: Preview

Customer benefits

➢ Significantly boost performance of

predictable analytical & dashboarding workloads

➢ Simplify and optimize ETL/BI data pipelines o Modern data warehouses

o Operational analytics

➢ Easier and faster migration to Amazon Redshift

➢ Automatic query rewrite to leverage MVs for base table queries

2

Define, manage materialized views on one or more tables

Efficient incremental REFRESH

o

INNER JOINs

o

Associative aggregate functions

o

WHERE, GROUPBY, HAVING

Access stored data in MVs

Upcoming functionality

1

(20)

Materialized views

C1 C2 C3 C4 R1

R2 R3

c1 c2 c3 r1

r2 r3

c2 c3 c1 c4

r1 r7 r9

(Projection, filters, aggregates, JOINs)

Base table1 Base table2

MV2

➢ Created from a single or multiple base tables

Projections – Select subset of columns

Filters - Select subset of rows

INNER JOINs from multiple base tables

Aggregates, GROUPBY, HAVING

o Summarize column values from matching rows

o Associative aggregate functions

Precomputed results persisted in the materialized view

➢ Base table changes incrementally refreshed

C1 C2 R1

R5 R9

MV1

(21)

Many third-party tools

(22)

Next steps

• Apply best practices to your ETL jobs

• Leverage Amazon Redshift new

features

(23)

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Learn big data with AWS training and certification

Visit aws.amazon.com/training/paths-specialty/

New free digital course, Data Analytics Fundamentals, introduces Amazon S3, Amazon Kinesis, Amazon EMR,

AWS Glue, and Amazon Redshift

Validate expertise with the AWS Certified Big Data - Specialty exam or the new AWS Certified Data Analytics - Specialty beta exam

Resources created by the experts at AWS to help you build and validate data analytics skills

Classroom offerings, including Big Data on AWS, feature

AWS expert instructors and hands-on labs

(24)

Thank you!

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

(25)

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

References

Related documents

No later than December 31, 2015, the Administration will work through the Federal agencies to develop an open source software policy that, together with the Digital Services

Something called is, physics lab manual loyd emphasizes conceptual understanding of each of the data find the true or a beam?. Linearity is zero, physics lab and a vernier caliper

Students will learn how to create a data warehouse with Microsoft SQL Server with Azure SQL Data Warehouse, to implement ETL with SQL Server Integration Services, and to validate

Extract data from your data warehouse and migrate to Amazon Redshift • Extracts data through local migration agents. • Data is optimized for Amazon Redshift and saved in

When completed, the CloudFormation template will have created the environment described in the following diagram:.. 18 This environment has configured the Secure Agent Security

•  Fast – executed in parallel on all data nodes in the Amazon Redshift cluster. •  Can be limited to use a certain percentage of provisioned throughput on the

If you’re using a NoSQL database and need to analyze hundreds of millions of ad impressions, game metrics, or social media hashtags, then use Amazon Redshift as a data warehouse,

Students will learn how to create a data warehouse with Microsoft SQL Server 2014, implement ETL with SQL Server Integration Services, and validate and cleanse data with SQL