© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learn how to make ETL and ELT easy with Amazon Redshift
A N T 4 1 1 - R
Harshida Patel
Data Warehouse Specialist SA Amazon Web Services
Juan Yu
Data Warehouse Specialist SA Amazon Web Services
Agenda
Architecture overview
Batch use case and demo
Streaming use case and demo
Auto maintenance and auto WLM Materialized view
Q&A
Fastest
Most popular Most cost-effective Integrated with
the data lake
• Massively parallel, shared-nothing architecture
• Streaming backup/restore from S3
• Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing; ML optimizations
• Customers are not charged for the leader node for any cluster with two or more nodes
• Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, backup, restore
• Amazon Redshift Spectrum nodes
• Execute queries directly against data lake
• An Amazon Redshift cluster exists in only 1 subnet/AZ at a time
Amazon Redshift cluster architecture
...
Amazon Redshift Spectrum fleet
Compute nodes
10 GigE (HPC)
JDBC/ODBC
SQL clients/
BI tools
Leader node
Redshift cluster
Amazon S3
Exabyte-scale object storage
Efficient data loads
Streaming backup/restore
▪ Single row insert, or insert values()
▪ Leader node only
▪ OK for loading small amount of data at low frequency
▪ Bulk loading
▪ Engages compute nodes in parallel
▪ Load from S3, using Copy
▪ Data transformation, using Insert into … select …
Data loading methods: Typical use cases
Data loading
When you COPY to an empty table, optimal column
compression is used automatically
▪ Use many input files (chunks) as slices, ideally a multiple
▪ E.g., for 16 slices, use 16, 32, 48, etc., input files (chunks)
▪ Compress the data before COPY
▪ File size 1MB–1GB after compression
▪ For reoccurring loading, turn off automatic compression—COMPUPDATE OFF
COPY command: Best practices
Single input file
dc 2.8 x large node (16 slices)
Single input file
Multiple (16) input files
dc 2.8 x large node (16 slices)
Multiple input files
Use INSERT INTO SELECT
against external Amazon S3 tables
• Aggregate incoming data
• Select subset of columns and/or rows
• Manipulate incoming column data with SQL
Design considerations
• Transform semi-structured data
• $5/TB of (compressed) data scanned
Data loading via Redshift Spectrum
Amazon Redshift Spectrum query engine
Query across
Amazon Redshift and S3
S3data lake
Amazon Redshift data
Batch ETL/ELT
VPC AWS Cloud
Availability Zone 1
Amazon EMR
AWS Glue Crawler
AWS Glue Data Catalog
Amazon Redshift
Amazon Simple Storage
Bucket with objects
Bucket with objects
AWS Glue
- AWS Glue crawler
- AWS Glue Data Catalog - AWS Glue jobs
Amazon EMR
Amazon Redshift Spectrum
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Streaming ETL/ELT
VPC AWS Cloud
Availability Zone 1
Amazon Redshift Amazon Kinesis
Data Streams
Amazon Kinesis Data Firehose AWS Lambda
Bucket with objects
AWS Lambda
Amazon Kinesis Data Streams Amazon Kinesis Data Firehose
Amazon Managed Streaming for Kafka (MSK)
Amazon Managed Streaming for Kafka
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Best practices : ELT
Batch ingestion, avoid small write
Wrap workflow/statements in an explicit transaction
Consider using DROP TABLE or TRUNCATE instead of DELETE Handle upsert by using Delete + Insert
Staging tables :
•
Use temporary table or permanent table with the “BACKUP NO” option
•
For copying a large number of rows (> hundreds of millions), consider using
ALTER TABLE APPEND instead of INSERT INTO SELECT
Run stored procedures in Amazon Redshift
stored procedure
Migrating to Amazon Redshift is even
easier!
Support for stored procedures
provides the ability
to run code
Automated maintenance
Moving toward
zero maintenance
Automated administration
Automates data
distribution in tables for improved
performance and disk space utilization
Provides intelligent recommendations for tuning based on continuous workload analysis
ALL
keyA keyB keyC keyD
Node 1 Slice 1 Slice 2
Node 2 Slice 3 Slice 4
EVEN
Node 1 Slice 1 Slice 2
Node 2 Slice 3 Slice 4
KEY
Node 1 Slice 1 Slice 2
Node 2 Slice 3 Slice 4
recommends distribution key
Clusters are always optimized for
performance
Advise
Amazon Redshift: Auto WLM
Auto WLM
Easy to maximize query throughput and get consistent performance
• Amazon Redshift automatically manages query concurrency and memory
allocation through a single WLM queue
• Auto WLM will automatically determine the amount of resources that queries
need, and adjust the concurrency based on the workload
• Ensures higher-priority workloads get
preferential treatment via query priority
Amazon Redshift elastic resize
to Amazon Redshift
cluster in busy periods
transition time and storage on demand
Scale up and down in minutes
Amazon
Redshift cluster
Compute nodes
Amazon Redshift managed S3
JDBC/ODBC
Leader node
CN2
CN1 CN3 CN4
Backup
• Automatic REFRESH
• Access live data from base tables
• Automatic query rewrite
Amazon Redshift materialized views: Preview
Customer benefits
➢ Significantly boost performance of
predictable analytical & dashboarding workloads
➢ Simplify and optimize ETL/BI data pipelines o Modern data warehouses
o Operational analytics
➢ Easier and faster migration to Amazon Redshift
➢ Automatic query rewrite to leverage MVs for base table queries
2
✓
Define, manage materialized views on one or more tables
✓
Efficient incremental REFRESH
o
INNER JOINs
o
Associative aggregate functions
o
WHERE, GROUPBY, HAVING
✓
Access stored data in MVs
Upcoming functionality
1
Materialized views
C1 C2 C3 C4 R1
R2 R3
c1 c2 c3 r1
r2 r3
c2 c3 c1 c4
r1 r7 r9
(Projection, filters, aggregates, JOINs)
Base table1 Base table2
MV2
➢ Created from a single or multiple base tables
➢ Projections – Select subset of columns
➢ Filters - Select subset of rows
➢ INNER JOINs from multiple base tables
➢ Aggregates, GROUPBY, HAVING
o Summarize column values from matching rows
o Associative aggregate functions
➢ Precomputed results persisted in the materialized view
➢ Base table changes incrementally refreshed
C1 C2 R1
R5 R9
MV1
Many third-party tools
Next steps
• Apply best practices to your ETL jobs
• Leverage Amazon Redshift new
features
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learn big data with AWS training and certification
Visit aws.amazon.com/training/paths-specialty/
New free digital course, Data Analytics Fundamentals, introduces Amazon S3, Amazon Kinesis, Amazon EMR,
AWS Glue, and Amazon Redshift
Validate expertise with the AWS Certified Big Data - Specialty exam or the new AWS Certified Data Analytics - Specialty beta exam
Resources created by the experts at AWS to help you build and validate data analytics skills
Classroom offerings, including Big Data on AWS, feature
AWS expert instructors and hands-on labs
Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.