How Oracle Exadata Delivers Extreme Performance

(1)

How Oracle Exadata Delivers Extreme Performance

John Clarke

Sr. Oracle Architect

(2)

Dallas | Detroit | Los Angeles | Singapore | India

Agenda

Ø  Oracle Exadata: Hardware and Software Optimized Ø  Exadata Hardware Components

Ø  How Oracle Works on Exadata Ø  Exadata Software Components

Ø  Smart Scan and Cell Offload Processing Ø  Storage Indexes

Ø  Smart Flash Cache

Ø  Hybrid Columnar Compression Ø  IO Resource Management Ø  How to Get There from Here

Ø  A New Way to Look at Optimization

(3)

Exadata: Hardware and Software Optimized

The Exadata Database Machine is designed for extreme performance, manageability, consolidation, and high-availability. Oracle achieves this by:

Ø  Utilizing an optimally-balanced hardware infrastructure

Ø  Using fast hardware components designed to eliminate IO, CPU, and network bottlenecks

Ø  Delivering significant Oracle software enhancements to leverage balanced hardware configuration

The Exadata Database Machine hardware is just part of the solution – the combination of a balanced

hardware configuration with Exadata’s storage software is what delivers extreme performance.

(4)

Exadata Hardware

Exadata Storage Servers, or “Cells” Database servers, or “Compute nodes” InfiniBand Interconnect

(5)

Exadata Storage Servers

Ø 

Exadata storage servers are self-contained storage platforms that house disk storage and

run the Exadata Storage Server software provided by Oracle

Ø 

A single Exadata storage server is also known as a “cell”

Ø 

A cell is the building block for the Exadata storage grid

Ø 

More cells provide greater capacity and IO bandwidth

Ø 

Databases are typically deployed across multiple cells

Ø 

The storage cells communicate with the Oracle database and ASM instances over

InfiniBand

Ø 

Each cell is wholly dedicated to Oracle database files

(6)

Exadata Storage Servers

Ø 

Exadata storage servers provide high-performance storage for Oracle databases

Ø 

Up to 1.8 Gb/sec raw data bandwidth per cell

Ø 

Up to 75,000 IOPs using Flash per cell

Ø 

Based on 64-bit Sun Fire servers

Ø 

Comes installed with Exadata storage server software, Oracle Linux x86_64, drivers and

utilities

(7)

Exadata Storage Servers

Sun Fire X4270 M2

Component Details

Processors 2 six-core Intel Xeon L5640 Processors (2.26GHz)

Memory 24 GB (6 x 4GB)

Local Disks 12 x 600 GB 15K RPM High Performance SAS, or 12 x 2Tb 7.2k RPM High Capacity SAS

Flash 4 x 96Gb Sun Flash Accelerator F20 PCIe Cards Disk Controller Disk controller HBA with 512 MB battery backed

cache

Network Two InfiniBand 4X QDR (40Gb/s) ports 1 dual-port PCIe HCA

Four embedded GbE ports Remote Management 1 Ethernet port for ILOM

(8)

Exadata Database Servers

Ø 

The compute node grid consists of multiple database servers

Ø 

Oracle 11gR2 and ASM run on the database servers

Ø 

Companies typically run RAC databases on the database server to achieve high

availability and maximize the aggregate CPU and memory horsepower in the compute

node grid

(9)

Exadata Database Servers

Component Details

Processors 2 6-core Intel Xeon X5670 Processors (2.93 GHz)

Memory 96 GB (12 x 8 GB)

Local Disks 4 x 300GB 10K RPM SAS disks

Disk Controller Disk controller HBA with 512MB battery backed cache

Network Two InfiniBand 4X QDR (40Gb/s) ports Four 1GbE Ethernet ports

Two 10GbE Ethernet ports Remote Management 1 Ethernet port for ILOM

Power Supplies 2 redundant hot-swappable power supplies Operating System 64-bit Oracle Enterprise Linux 5.5

(10)

InfiniBand Network

Ø 

InfiniBand is the Exadata Storage Network

Ø 

Looks like normal Ethernet to hosts

Ø 

Efficiency of a SAN

Ø 

Used for both storage and RAC interconnect

(11)

InfiniBand Network

Ø  Oracle uses InfiniBand because of its proven track record with high-performance computing - provides 40 Gb/

sec in each direction

Ø  Looks like Ethernet but is much faster

Ø  Uses “zero copy”, which means data is transferred across the network without intermediate buffer copies in

the various network layers

Ø  Uses “buffer reservation”, which means that the hardware knows exactly where to place buffers ahead of

time

Ø  Unified network fabric for both Exadata storage and RAC interconnect – simplifies cabling and networking Ø  On top of InfiniBand, Exadata uses the Zero Data loss (ZDP) UDP protocol. The ZDP protocol has a very low

CPU overhead with tests showing only a 2 percent CPU utilization while transferring 1 GB/sec of data.

Ø  Each Exadata server is configured with one dual-port InfiniBand card designed to be connected to two separate

InfiniBand switches for high availability. Each InfiniBand link is able to carry the full data bandwidth of the entire cell, which means you can lose an entire network without losing any performance.

(12)

Other Hardware Components

Ø 

Embedded Cisco switch for data center network uplink

Ø 

KVM switch for management of storage server and compute nodes

Ø 

Multiple PDUs with management interfaces

(13)

Exadata Configuration Options

Oracle has four (4) configuration options:

Ø 

Exadata X2-2 Quarter Rack

Ø 

Exadata X2-2 Half Rack

Ø 

Exadata X2-2 Full Rack

Ø 

Exadata X2-8 Full Rack

Ø 

Each configuration is available with either High Performance or High Capacity disks

Ø 

The difference between the configurations is the # of storage servers and compute nodes

Ø 

Oracle has a short list of hardware configuration options because it requires the balanced

hardware configuration

Ø 

A half rack can be upgraded to a full rack, and multiple racks can be interconnected

Ø 

Quarter racks have 2 InfiniBand Switches; Half and Full racks have 3

(14)

Exadata Configuration Options

Compute Nodes X2-2 Quarter Rack X2-2 Half Rack X2-2 Full Rack X2-8 Full Rack

# Compute nodes 2 4 8 2 Processor Cores per Node 12 12 12 64 Total Processor Cores 24 48 96 128 Memory/Node 96 Gb 96 Gb 96 Gb 1 TB Total Memory 192 Gb 384 Gb 768 Gb 2 TB

(15)

Exadata Configuration Options

Exadata Storage Servers

X2-2 Quarter Rack X2-2 Half Rack X2-2 Full Rack X2-8 Full Rack

# Storage cells 3 7 14 14 Disks/Cell 12 12 12 12 Total Disks 36 84 168 168 Flash 1.15 Tb 2.7 Tb 5.4 Tb 5.4 Tb Raw Storage (HP) 21.6 Tb 50.4 Tb 100.8 Tb 100.8 Tb Raw Storage (HC) 72 Tb 168 Tb 336 Tb 336 Tb

(16)

Exadata Configuration Options

Exadata Storage Servers

X2-2 Quarter Rack X2-2 Half Rack X2-2 Full Rack X2-8 Full Rack

Raw disk throughput, High Performance disks Up to 5,400 MBPS Up to 12,600 MBPS Up to 25,200 MBPS Up to 25,200 MBPS Raw disk throughput, High Capacity disks Up to 3,000 MBPS Up to 7,000 MBPS Up to 14,000 MBPS Up to 14,000 MBPS

Disk IOPs, High Performance disks

Up to 10,800 Up to 25,200 Up to 50,400 Up to 50,400

Disk IOPS, High Capacity disks

Up to 4,320 Up to 10,080 Up to 20,160 Up to 20,160

(17)

How Oracle Works on Exadata

•  Oracle 11gR2 •  Oracle RAC

•  Database storage uses ASM

(18)

How Oracle Works on Exadata

Each storage cell contains: •  Exadata Storage Software

•  Disks (LUNs, cell disks, and Grid Disks) •  ASM disk groups built on Exadata Grid

Disks

Exadata features implemented on storage servers and exploited by databases on compute nodes

(19)

How Oracle Works on Exadata

•  Oracle communicates with the Exadata cells using iDB

•  iDB is implemented using LIBCELL

•  Exadata binaries are linked with LIBCELL to facilitate cell

(20)

How Oracle Works on Exadata

•  CELLSRV is the primary

component and provides majority of services

•  Oracle Databases and ASM processes use LIBCELL to communicate to CELLSRV processes

•  CELLSRV is what delivers the unique Exadata features

(21)

Oracle on Exadata

•  Each Exadata cell has 12 physical disks

•  Oracle reserves 29Gb on each of the first two disks in each cell for the “system area”

(22)

Oracle on Exadata

•  One LUN is created on each physical disk

•  A LUN is the unit on which a Cell Disk can be created

(23)

Oracle on Exadata

•  Cell Disks are created on LUNs

•  Cell Disks represent the storage area for each LUN

(24)

Oracle on Exadata

•  Grid Disks are built on Cell Disks

•  A Grid Disk can consume all space on a cell disk or an administrator-specified chunk of the Cell Disk

•  The outermost tracks on a Grid Disk have the highest performance characteristics

•  Building multiple Grid Disks per Cell Disk allows the administrator to segregate storage by usage and performance demands

(25)

Oracle on Exadata

•  Grid Disk are directly exposed to ASM •  ASM Disk Groups are built on Grid Disks

(26)

Exadata Software Features at a Glance

Exadata Software Goals Fully and evenly utilize all

Computing resources in Exadata Database Machine to eliminate bottlenecks and deliver consistent high

performance

Reduce demand for resources

by eliminating any IO that can be discarded without impacting result

(27)

Exadata Smart Scan

⇒  One of the most important Exadata software features

⇒  Typically where Exadata software features discussions start ⇒  One of several “cell offload” features in Exadata

⇒  “cell offload” is defined as Exadata’s shifting of database work from the database servers to

Exadata storage servers

⇒  Primary goal of Smart Scan is to perform majority of IO processing on storage servers and returned

smaller amounts of data from the storage infrastructure to the database servers

⇒  Provides dramatic performance improvements for eligible SQL operations (full table scans, fast full

index scans, joins, etc.)

⇒  Smart Scan is implemented automatically on Exadata for eligible SQL operations ⇒  Smart Scan only works on Exadata

(28)

Traditional SQL Processing

User submits a query

select state,count(*)!

from census!

group by state!

The query is parsed and an execution path is determined. Extents are identified

IO is issued – Oracle retrieves blocks based on extents identified. A block is the smallest unit of transfer from storage system, contains rows and columns

Database instance processes blocks – all required blocks read from storage system to buffer cache

Oracle filters rows and

columns, once loaded to buffer cache, and returns to user

(29)

Smart Scan Processing

User submits a query

select state,count(*)!

from census!

group by state!

An iDB command is constructed and sent to Exadata cells

Each Exadata cell scans data blocks and extracts relevant rows and columns that satisfy the SQL query

Each Exadata cell returns to the database instance an iDB message containing the requested rows and columns. These are not block images and are returned to the PGA The database consolidates

results from across each Exadata cell and returns rows to the client

(30)

Smart Scan and Cell Offload Processing

⇒  Filtering operations are offloaded to the Exadata storage cell

⇒  Column Filtering

⇒  Predicate Filtering (i.e., row filtering)

⇒  Join Filtering

⇒  Only requested rows and columns are returned to the database server ⇒  Significantly less IO transfer over storage network

⇒  Less memory and CPU required on database tier nodes

⇒  Rows/columns retrieved to user’s PGA via direct path read mechanism, not through buffer cache. ⇒  Large IO requests don’t saturate buffer cache

(31)

Smart Scan and Cell Offload Processing …

What’s Eligible for Smart Scan on Exadata?

⇒  Single-table full table scans (not IOTs, not BLOBs, no hash clusters) ⇒  Serial direct reads enabled

⇒  Join filtering using Bloom Filters

⇒  Bloom filtering determines which rows are required to satisfy a join

⇒  A Bloom filter is created with the values of the join column for the smaller table of a join, and

this filter is used by the Exadata storage server to eliminate row candidates from the larger table

Page 31

-  Oracle is a C program -  kcfis_read is function that

does smart can

-  kcfis_read is called by direct path read function, klbldrget

-  klbkdrget is called from full scan functions

(32)

Smart Scan in Action

Our table is 4.53 Gb in

size It’s got almost _{180 million} rows

All its indexes are invisible

A full scan with returned in 5.27 seconds!

(33)

Measuring Smart Scan: cell_offload_plan_display

“cell_offload_plan_display” setting Meaning

AUTO Shows offload-related information in plan display if an Exadata cell is present and the objects are on the cell

ALWAYS Shows offload-related information if the SQL statement is off-loadable, whether or not running on an Exadata cell or not. Useful for 11gR2 databases not running on Exadata to do simulated plans

NEVER Never shows offload-related information in plans.

TABLE ACCESS STORAGE FULL

indicates that the query is cell-off-loadable, eligible for Smart Scan

(34)

Measuring Smart Scan with Statistics

Statistic/Event Meaning

cell physical IO bytes eligible for predicate offloading Number of bytes eligible for cell offload

cell physical IO interconnect bytes Bytes returned from storage cells to database server

(35)

Measuring Smart Scan with Statistics

4.8 Gb eligible for Smart Scan and 45MB returned from storage grid

Smart Scan Efficiency of 99.05%

(36)

Measuring Smart Scan with Statistics

PGA max before Smart Scan is about 4Mb

PGA max after smart scan is over 10Mb

(37)

Controlling Smart Scan Behavior

⇒ 

SQL> alter system set cell_offload_processing=<TRUE|FALSE>;

⇒ 

SQL> alter session set cell_offload_processing=<TRUE|FALSE>;

⇒ 

SQL> select /+ opt_param (‘cell_offload_processing’,’<true|false>’) / ….

Page 37

Control cell offload system-wide

Control cell offload for current session

Control cell offload for SQL statement

(38)

Controlling Smart Scan Behavior

Full scan in 31+ seconds

(39)

(40)

Smart Scan Predicate Filtering

Ran in 6.64 seconds

2.1GB returned via Smart Scan, savings of 56.11%

(41)

Smart Scan Predicate Filtering

45MB returned via Smart Scan, savings of 99.03%

410K rows returned in 5.48 seconds

(42)

Smart Scan Predicate Filtering

8MB returned via Smart Scan, savings of 99.82%

2 rows returned in 5.39 seconds

(43)

Smart Scan Column Projection

1.17 MB of data through interconnect

(44)

Smart Scan Column Projection

651,744 bytes through interconnect

(45)

Smart Scan Column Projection

1,176,032 bytes when selecting BIGCOL 651,744 bytes when selecting BIGCOL 2,048 rows returned (1,874,990 – 651,744) / 2,048 = 524,288 bytes difference measured Predicted delta = 522,240 bytes, which his pretty close

(46)

(47)

Smart Scan Column Projection – Another Example

Query ran in 5.47 seconds Query ran in 13.04 seconds

(48)

Smart Scan: Join Filtering

⇒  In some cases, Oracle will use bloom filters to optimize join filtering ⇒  Bloom filters are implemented on the storage server on Exadata ⇒  Bloom filters work by:

⇒  Examining the columns in a join

⇒  For the smaller table in a join, Oracle determines the space required to store the result of data

and if small compared to the size of the joined (larger) table, may decide to use a bloom filter

⇒  Bloom filters identify which columns are required to satisfy a join and build an in-memory set of

values to compare against the larger table’s rows

⇒  Typically are built to reduce data communication between slaves for PQ joins and work well

with PQO and partitioned tables

(49)

Smart Scan: Join Filtering with offloaded Bloom Filters

Result returned in 7.45 second

SYS_OP_BLOOM_FILTER indicates Bloom Filters. When it happens on “Storage” line, it means it’s offloaded

(50)

Smart Scan: Join Filtering with offloaded Bloom Filters

With

“_bloom_filter_predicate_pushdown_to_storage = false

SYS_OP_BLOOM_FILTER indicates Bloom Filters. Here, does NOT exist on storage line, hence not offloaded

(51)

Smart Scan: Table Scans or Index Scans?

Do we just drop all indexes and let Smart Scan do its magic?

Completed in a little over 3 minutes

Over 6.7 GB returned from cell – note CPU and LIO values …

(52)

Smart Scan: Table Scans or Index Scans?

Do we just drop all indexes and let Smart Scan do its magic?

Completed in less than a second with indexes

Far less IO and less CPU/LIO

Indexes still work better for this type of query!

TEST TEST TEST

(53)

Smart Scan: Disabling Serial Direct Read Mechanism

Query completed in 3.81 seconds

15Gb eligible for cell offload

100Mb transferred over from storage cell

Smart Scan efficiency of 99.34%

(54)

Smart Scan: Disabling Serial Direct Read Mechanism

Disable serial direct reads

Query time increased from 3.8 seconds to 101 seconds

Nothing eligible for cell offload – no Smart Scan

(55)

Querying V$SQL to Determine Cell Off-loadable Queries

(56)

Querying V$SQL to Determine Cell Off-loadable Queries

(57)

Querying V$SQL to Determine Cell Off-loadable Queries

(58)

Smart Scan Summary

⇒  Smart Scan is software engineered for Exadata that can provide significant IO savings and

dramatically improved performance

⇒  Smart Scan works by offloading IO operations to the Exadata storage cell and only returning rows

and columns requested by the SQL statement back to the database instance

⇒  Row filtering provides significant cell scan efficiency gains ⇒  Column filtering provides IO savings

⇒  Join filtering using Bloom Filters, performed on Exadata cells, also can offer huge performance

gains

⇒  Don’t drop all your OLTP indexes without testing – single-row index access will often still outpace

full scan with cell offload

⇒  One of the great things about Exadata and Smart Scan is that nothing needs to be explicitly done by

a developer or DBA to take advantage of it - it “just works” with Exadata

(59)

Exadata Storage Indexes

⇒ 

The goal of storage indexes is to eliminate IO requests to Exadata disks

⇒ 

Storage indexes are NOT at all like traditional Oracle indexes

⇒ 

Storage indexes are like “anti-indexes” – they help Oracle determine which

blocks the requested data are NOT in

⇒ 

Knowing which storage areas will NOT contain the requested data allows

Exadata to skip IO requests to these disks

⇒ 

Bypassing IO to disks reduces work required on the storage grid and

improves performance

(60)

Exadata Storage Indexes: Architecture

⇒ 

Each disk in an Exadata storage cell is divided into 1MB pieces called “storage regions”

⇒ 

For each 1MB storage region, data distribution statistics are held in a memory structure

called a “region index”

⇒ 

Region indexes maintain “min” and “max” values for the storage region as data as

selected from the disk storage region, for up to 8 columns in a table

⇒ 

Region indexes are maintained over time as data is access from the disk

⇒ 

Storage indexes are maintained automatically by Exadata; there is no way to influence

their behavior

⇒ 

Storage server reboots will erase storage index data – region indexes are stored in

memory

⇒ 

Performance impact can be dramatic, but performance impact is not deterministic as its

subject to Exadata’s automatic management of region indexes

(61)

Region index

Exadata Storage Indexes: Architecture

select * from soe.orders where order_total > 60; ASM Disk Storage region ASM AU Min 1 Max 22 Min: 62 Max 100 Min 1 Max 60 Min 1 Max 101 C1 C2 C3 C4 … 1 … … … 59 … … Table: ORDERS C1 C2 C3 C4 … 1 … … … 11 … … Table: ORDERS C1 C2 C3 C4 … 62 … … … 81 … … Table: ORDERS Min = 1, Max = 60 Min = 62, Max = 100 Min = 1, Max = 60

(62)

Exadata Storage Indexes: Scenarios

⇒ 

Data needs to be well-ordered according to query predicates

⇒ 

If all regions contain randomly ordered column values, region index min and max values

will be too widespread to eliminate the region from an IO request

⇒ 

Since Exadata only maintain regions indexes on 8 columns per table, your range of

distinct predicates should be relatively small and static

⇒ 

Storage indexes can return “false-positive”, but will never skip regions and jeopardize

query integrity

select * from soe.orders where order_total between 10 and 15; Min 1 Max 22 Min: 62 Max 100 Min 1 Max 60 Min 1 Max 101

(63)

Exadata Storage Indexes in Action

This is the key statistic: “cell physical IO bytes saved by storage index”

(64)

Exadata Storage Indexes in Action

~ 17Mb of IO saved by storage index

Smart scan efficiency of 99.98%

(65)

Exadata Storage Indexes in Action

Smart scan efficiency of 99.98%

~ 9Gb of IO saved by storage index

(66)

Exadata Storage Indexes in Action

0 rows returned in 0.09 seconds!!

~ 15Gb of IO saved by storage index

(67)

Exadata Storage Indexes in Action

7.8 million rows returned in 4.23 seconds No storage index savings

(68)

Exadata Storage Indexes in Action

7.8 million rows returned in 1.37 seconds Over 12Gb of IO savings with storage indexes

(69)

Exadata Storage Indexes in Action

No storage index savings

(70)

Exadata Storage Indexes in Action

No storage index savings

(71)

Exadata Storage Indexes and Bind Variables

Storage indexes used

(72)

Exadata Storage Indexes and NULL values

Storage index used

(73)

Exadata Storage Indexes with LIKE, BETWEEN, Wildcards

No storage index savings

(74)

Exadata Storage Indexes with LIKE, BETWEEN, Wildcards

14GB of IO saved by storage index

(75)

Exadata Storage Indexes with LIKE, BETWEEN, Wildcards

“LIKE” is fine, but …

Wildcards prevent storage index IO pruning

(76)

Exadata Storage Indexes with LIKE, BETWEEN, Wildcards

Storage index used

(77)

Exadata Storage Indexes with LIKE, BETWEEN, Wildcards

Storage index used but less IO saved with wildcards

(78)

Exadata Storage Indexes with LIKE, BETWEEN, Wildcards

No storage index used

(79)

(80)

Exadata Storage Indexes with “OLTP Tables”

Eliminated over 4Gb of IO via storage index

(81)

Exadata Storage Indexes with “OLTP Tables”

Ran in 0.26 seconds Eliminated over 4GB of IO from via storage index

(82)

(83)

Exadata Storage Indexes with “OLTP Tables”

No storage index when upper-bound in range is higher than max value

(84)

(85)

Exadata Storage Indexes with “Normal” Index Access

(86)

Exadata Storage Indexes with “Normal” Index Access

Storage indexes are only used for cell

(87)

Disabling Storage Indexes

Storage indexes can be disabled by setting

“_kcfis_storageidx_disabled” = TRUE

~ 12Gb saved via storage index Query ran in < 1 second

(88)

Disabling Storage Indexes

Storage indexes can be disabled by setting

“_kcfis_storageidx_disabled” = TRUE

0 Bytes saved by storage index

Query ran in 4.71 seconds

(89)

Tracing Storage Indexes

Storage index operations can be traced by setting “_kcfis_storageidx_diag_mode” = 2

Enable storage index tracing

Indicates storage indexes were in use

(90)

(91)

Tracing Storage Indexes

SQL_ID for given transaction

(92)

Tracing Storage Indexes

Storage region size = 1Mb strt = 0 end = 2048 Memory=2K Column # 11 Low and high values of actual data

(93)

Tracing Storage Indexes

(94)

Summary

⇒ 

Storage indexes are designed to eliminate IO requests on storage servers

⇒ 

Storage indexes are automatically maintained in memory areas on the storage cell

⇒ 

Exadata tracks minimum and maximum values in each storage region while processing IO

requests and stores these region-level boundaries in region indexes

⇒ 

Exadata examines region indexes to determine whether a storage region can be excluded

from access based on a query predicate

⇒ 

Storage indexes are automatically created and maintained

⇒ 

Storage indexes work for cell offload functions (Smart Scan)

⇒ 

Storage indexes can help performance dramatically and at worst-case, will never hurt

performance

⇒ 

Storage indexes work on well-ordered tables based on the query predicates issued

against the tables

(95)

Exadata Smart Flash Cache

⇒  Smart Flash Cache goal: intelligently cache frequently used data ⇒  Exadata caches in PCI flash cards on the storage cell

⇒  Data is cached to improve IO response time and deliver better database performance

⇒  Flash cards can deliver tens of thousands of I/Os per seconds – SAS or SATA drives can deliver a

couple hundred I/Os per second

(96)

Exadata Smart Flash Cache: Hardware

Four 96Gb PCI flash cards per cell

384 Gb of PCI flash per cell

5.4 Tb of flash on full rack, 2.7 Tb for half rack, 1.15 Tb for quarter

One cell can support 75,000 IOPs

(97)

Exadata Smart Flash Cache: Software

⇒  Smart Flash Cache provides a storage cell caching algorithm to cache appropriate data in the

storage cell PCI flash cards

⇒  Each database IO is tagged with metadata

⇒  The CELL_FLASH_CACHE setting for the object(s): DEFAULT means Smart Flash Cache is

used normally, KEEP means that Smart Flash Cache is used more aggressively, NONE means Smart Flash Cache is disabled for the object

⇒  A cache hint: CACHE means the IO should be cached, NOCACHE means it shouldn’t, and

EVICT indicates that the cached data should be removed from Smart Flash Cache

⇒  Smart Flash Cache takes the following into consideration when caching data:

⇒  IO size: Large objects with CELL_FLASH_CACHE=DEFAULT are not cached

⇒  Current cache load: Smart table scans are usually directed to disk but if

CELL_FLASH_CACHE=KEEP and the cache load is low, they may be satisfied from Smart Flash Cache

⇒  Specific operations (backups, Data Pump export/import, etc) are not cached

(98)

Exadata Smart Flash Cache: Software

⇒  Smart Flash Cache is a write-through cache

⇒  After a write is acknowledged, it’s written to Smart Flash Cache if suitable for caching ⇒  Write performance is not improved or diminished with Smart Flash Cache

⇒  Because it’s not a write-back cache, writes are not cushioned by PCI flash cards during write

operations – the writes need to happen to disk

⇒  A small battery-backed cache on each cell performs write-back caching ⇒  Technology comparisons:

⇒  EMC FastCache: EMC EFDs as an extension of storage processor controller cache, write-back

caching to cushion write IO, very good write performance

⇒  FusionIO: data placed directly on PCI flash cards, very high write performance

⇒  Exadata can match write IO performance of EMC FastCache, FusionIO by using Smart Flash

(99)

Smart Flash Cache: Write Operations

#1: Database issues a write operation! #2: CELLSRV inspects IO metadata! #3: Data is written to disk! #4: IO is acknowledged and database process continues! #5: If IO is Smart Flash Cache suitable, IO is written to Flash Cache! #6: CELLSRV uses

LRU algorithm to determine which data to replace !

(100)

Smart Flash Cache: Read from Previously Cached Data

#1: Database issues a read request! #2: CELLSRV inspects IO metadata! #3: If data exists in Flash Cache, it will be read from cache – no disk access!

#4: Read is satisfied from cache!

(101)

Smart Flash Cache: Read from Un-cached Data

#1: Database issues a read operation!

#2: CELLSRV

inspects IO metadata!

#3: Data is read from disk! #4: Read is acknowledged and database process continues! #5: If IO is Smart Flash Cache suitable, IO is written to Flash Cache! #6: CELLSRV uses

LRU algorithm to determine which data to replace !

(102)

Smart Flash Cache in Action: Writes

(103)

(104)

Smart Flash Cache in Action: Writes

Before updating, 97 flash cache hits

1000 rows updated

Performed index range scan

(105)

Smart Flash Cache in Action: Writes

Query Statistics

Update yielded (368 – 97) = 271 flash cache read hits

(106)

Smart Flash Cache in Action: Writes

No physical reads

(107)

Smart Flash Cache in Action: Writes

180 physical reads

(108)

Smart Flash Cache in Action: Reads with Un-cached Data

Ran in 1:21.34

1,072 flash cache hits for dbm1

4,452 flash cache hits for dbm1 = 3,380 flash cache hits

(109)

Smart Flash Cache in Action: Reads with Cached Data

Ran in 16.50 seconds 27.762 flash cache read hits

(110)

(111)

Smart Flash Cache in Action: Dropping Flash Cache

Ran in 1:24.36

No flash cache read hits

(112)

(113)

Smart Flash Cache in Action: Smart Scan Data

~ 12Mb cached/cell for the index

~ 18Mb cached for one of the table partitions

(114)

Smart Flash Cache in Action: Smart Scan Data

We’re using about 350Mb/cell currently …

(115)

(116)

Smart Flash Cache in Action: Smart Scan Data

Very small delta in flash cache read hits

(117)

(118)

(119)

CELL_FLASH_CACHE “KEEP”

Query time dropped to 28 seconds

cell flash cache read hits jumped from 136,709 to 252,095

(120)

CELL_FLASH_CACHE “KEEP”

About 26Gb per cell per partition

(121)

(122)

Summary

⇒  Exadata Smart Flash Cache is a caching mechanism delivered by Exadata

⇒  Smart Flash Cache uses PCI flash cards on each storage cell to cache appropriate data

⇒  Each storage cell has 384Gb of PCI flash available for Smart Flash Cache – a full rack delivers

5.3TB of flash storage

⇒  Smart Flash Cache intelligently caches appropriate data (OLTP)

⇒  Smart Flash Cache is a write-through cache – writes need to be performed on disk before

acknowledged to foreground process. After writes and initial reads, Exadata determines whether data is eligible for cache and writes to Flash Cache. Subsequent IO benefits from Smart Flash Cache

⇒  Segments can be tagged to be “kept” in Smart Flash Cache, similar to buffer cache keep

⇒  Grid Disks (and ASM disk groups) can be created on Flash Disks, improving performance in some

cases

⇒  We recommend using all flash storage for Smart Flash Cache

(123)

Oracle 11g OLTP Compression

VENDOR_ID VEND_NAME STATE VNDR_RATING VENDOR_TYPE ========== =========== ===== =========== ========== 100 ACME ONE MI 100 DIRECT 101  ACME ONE CA 90 DIRECT 102  NORTON IA 95 INDIRECT 103  WINGDINGS MI 96 INDIRECT 104  WINGDINGS GA 96 INDIRECT ACME ONEDIRECT| WINGDINGS96INDIRECT 100ACME ONEMI100DIRECT| 101ACME ONECA()DIRECT| 102NORTONIA95INDIRECT| 103WINGDINGSMS96INDIR ECT| 104WINGDINGSGA96INDIR ECT 100*MI100*|101CA90*| 102NORTONIA95INDIRECT| 103*MS**|104*GA** Uncompressed Compressed <- Header -> Symbol Table-> <- Data -> <- Free Space ->

Rows are inserted into blocks, no

Free space in

As rows are inserted, only unique values are inserted into blocks

Duplicate values inserted into symbol table

Results in more free space

(124)

Oracle 11g OLTP Compression

VENDOR_ID VEND_NAME STATE VNDR_RATING VENDOR_TYPE ========== =========== ===== =========== ========== 100 ACME ONE MI 100 DIRECT 101  ACME ONE CA 90 DIRECT 102  NORTON IA 95 INDIRECT 103  WINGDINGS MI 96 INDIRECT 104  WINGDINGS GA 96 INDIRECT ACME ONEDIRECT| WINGDINGS96INDIRECT 100ACME ONEMI100DIRECT| 101ACME ONECA()DIRECT| 102NORTONIA95INDIRECT| 103WINGDINGSMS96INDIR ECT| 104WINGDINGSGA96INDIR ECT 100*MI100*|101CA90*| 102NORTONIA95INDIRECT| 103*MS**|104*GA** Uncompressed Compressed <- Header -> Symbol Table-> <- Data -> <- Free Space ->

When the first row is inserted into the block, it is not

compressed

Subsequent row inserts check for duplicates and move

duplicate values to symbol table

After free space is exhausted in block, block will be fully

compressed

There is a slight CPU overhead in compressing data; Oracle has to build/maintain symbol table

The unit of compression for OLTP Advanced Compression is the block …

(125)

Exadata Hybrid Columnar Compression

⇒  Data is stored by columns in a compression unit

⇒  A compression unit is a set of blocks, not a single block

⇒  Basic compression and Advanced Compression loads data into blocks row by row. As rows are

added, compression algorithms are invoked to reference or update symbol tables and duplicate values are not inserted

⇒  HCC examines data before being placed on a block and divides the data into arrays of columns ⇒  HCC takes a stream if input values and places all values for column 1 in array 1, all values for

column 2 in array 2, and so on …

⇒  HCC takes four 8k blocks of data in the stream, by default

⇒  HCC then performs its de-duplication process on the entire stream of 4 blocks, so the odds of

finding duplicate data are greatly increased over standard or Advanced compression because we’re

considering a large subset of data

⇒  HCC then does leading-edge compression on the array values and inserts the columns into a

compression unit (set of blocks), not row-by-row

(126)

Exadata Hybrid Columnar Compression

VENDOR_ID VEND_NAME STATE VNDR_RATING VENDOR_TYPE ========== =========== ===== =========== ========== 100 ACME ONE MI 100 DIRECT 101  ACME ONE CA 90 DIRECT 102  NORTON IA 95 INDIRECT 103  WINGDINGS MI 96 INDIRECT 104  WINGDINGS GA 96 INDIRECT 100ACME ONEMI100DIRECT| 101ACME ONECA()DIRECT| 102NORTONIA95INDIRECT| 103WINGDINGSMS96INDIR ECT| 104WINGDINGSGA96INDIR ECT

Uncompressed Hybrid Columnar Compression

Logical Compression Unit <- Header -> CU Header-> VENDOR_ID VEND_NAME VNDR_RATING STATE VEND OR_TY PE COL6 COL7 COL8 COL9 COL10

(127)

Exadata Hybrid Columnar Compression

Page 127 Compression Unit #1 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 Compression Unit #2 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 Compression Unit #2 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 4 blocks, first 50,000 rows

Second CU, new set of 4 blocks, 48,500 rows

Third CU, new set of 4 blocks, 51,000 rows

(128)

Exadata Hybrid Columnar Compression

⇒  Recall, data is stored by columns inside a compression unit

⇒  The greater the degree of duplicate column values, the less space required per compression unit

and the fewer compression units required to store the data

⇒  If queries select a single column or subset of columns, Oracle will only need to read from blocks

units on which the columns exist

⇒  This is different than other types of compression and un-compressed tables ⇒  Not only are we savings space, but we’re saving IO

(129)

Exadata Hybrid Columnar Compression and DML

⇒  EHCC compress data for bulk direct path loads only

⇒  Subsequent bulk load inserts will appends new compression units ⇒  Single-row inserts will insert rows as OLTP compressed

⇒  DELETEs against HCC tables lock entire CU ⇒  When updating EHCC tables:

⇒  The updated row is moved (i.e., deleted + re-inserted, i.e., migrated)

⇒  New row is OLTP-compressed

⇒  Locks impact entire CU, not just row!

(130)

Exadata Hybrid Columnar Compression Types

⇒  QUERY LOW compression is recommended for data warehouse tables in which data load times are

important

⇒  QUERY HIGH is recommended for data warehouse data in which space savings is important.

Offers better compression ratio than QUERY LOW but more expensive

⇒  ARCHIVE LOW is intended for archive data in which load times are important. Better compression

than QUERY compression, in most cases, but more overhead for DML

⇒  ARCHIVE HIGH is designed for archival data in which space savings is the most important factor.

(131)

Using the Compression Advisor

⇒  DBMS_COMPRESSION.GET_COMPRESSION_RATIO is Oracle’s “compression advisor”

Indicates compression type – in this case, EHCC “archive low” compression

comp_for_archive_low = EHCC Archive Low comp_for_archive_high = EHCC Archive High comp_for_query_low = EHCC Query Low comp_for_query_high = EHCC Query High comp_for_oltp = OLTP Advanced Compression

(132)

Using the Compression Advisor

OLTP Compression gives a low 1.1 compression ratio

(133)

Using the Compression Advisor

EHCC Query Low Compression gives 2.7 compression ratio

(134)

Using the Compression Advisor

EHCC Query High Compression gives 5.0 compression ratio

(135)

Using the Compression Advisor

EHCC Archive High Compression gives 9.2 compression ratio

(136)

Hybrid Columnar Compression in Action

⇒  Uncompressed table: SOE.CUSTOMERS

⇒  Use PCTAS to create 5 new compressed tables: one compressed for OLTP, one EHCC

compressed for query low, one compressed for query high, one compressed for archive low, and one compressed for archive high

⇒  Measure time it takes to create compressed table

⇒  Measure time it takes to run sample SELECT against compressed table ⇒  Measure time it takes to bulk insert rows into compressed table

⇒  Measure time it takes to update rows on compressed table

CUSTOMERS CUST_OLTP CUST_QLOW CUST_QHIGH CUST_ALOW CUST_AHIGH

Estimated Blocks

(137)

(138)

Storage Impact: Different Compression Scenarios

In our example, archive compression is not as

efficient as HCC “query high” compression

(139)

Storage Impact: Different Compression Scenarios

CUSTOMERS CUST_OLTP CUST_QLOW CUST_QHIGH CUST_ALOW CUST_AHIGH

Good estimates from DBMS_COMPRESSION

(140)

(141)

(142)

Performance Impact: Creating Compressed Tables

0 5 10 15 20 25 30 35 40 45 50

OLTP QUERYLOW QUERYHIGH ARCHIVELOW ARCHIVEHIGH

(143)

Performance Impact: Creating Compressed Tables

0 500 1000 1500 2000 2500

OLTP QUERYLOW QUERYHIGH ARCHIVELOW ARCHIVEHIGH

(144)

Performance Impact: Querying Compressed Tables

We expect least IO and best full-scan time against this table

(145)

Performance Impact: Querying Compressed Tables

Query ran in 4.63 seconds ~600Mb of IO 96% smart scan efficiency

(146)

Performance Impact: Querying Compressed Tables

4.91 seconds ~700Mb of IO 17% less LIO compared to uncompressed 94.46% smart scan efficiency

(147)

Performance Impact: Querying Compressed Tables

3.03 second runtime 97MB over interconnect Less CPU and LIO 99.07% smart scan efficiency

(148)

Performance Impact: Querying Compressed Tables

Query ran in 2.07 seconds 35 MB over interconnect Less CPU and LIO 99.52% smart scan efficiency

(149)

Performance Impact: Querying Compressed Tables

Ran in 2.08 seconds 19MB over interconnect Even less CPU and LIO

99.74% smart scan

(150)

Performance Impact: Querying Compressed Tables

2.06 seconds 17MB over interconnect 99.76% smart scan efficiency

(151)

(152)

(153)

Performance Impact: Updating Compressed Tables

Took the longest

amount of time More CPU required

to update HCC compressed tables

(154)

(155)

Proving Oracle migrated HCC Updated Rows

Row location: File 11

Block 1569727 Slot 165

(156)

(157)

(158)

Proving Oracle migrated HCC Updated Rows

Value=1 means fetch was done via migrated row

(159)

Examining the Compression Unit

Archive high

# Blocks/CU = 32

CU Length Cols/Rows

(160)

(161)

Examining the Compression Unit

CU size = 7 blocks – smaller than previous block dump

1 deleted row, slot 165 Archive high

(162)

HCC Decompression

⇒  HCC compresses data in-flight as it’s inserted via direct path ⇒  Where and when is the data decompressed when queried? ⇒  Definitions:

⇒  Upper Half = Compute Node

⇒  Lower Half = Storage Server

⇒  HCC data is decompressed on lower half when queried via Smart Scan (i.e., full scan, i.e., serial

direct read). Uncompressed data transmitted over the interconnect, storage server CPUs used for decompression

⇒  HCC data is decompressed on upper half when queried without Smart Scan (i.e., single-block

reads). Compressed data transmitted over interconnect, compute node CPUs used for decompression

⇒  Plan for CPU impact and application design!

(163)

HCC Decompression

42 cs of CPU time on compute node

(164)

HCC Decompression

(165)

HCC Decompression

(166)

Hybrid Columnar Compression Summary

⇒  Exadata Hybrid Columnar Compression is a unique compression feature available only to Exadata ⇒  Hybrid Columnar Compression compresses tables by column for a set of rows and stores inside a

logical compression unit

⇒  The unit of compression for basic and OLTP Advanced Compression is a database block ⇒  The unit of compression for Hybrid Columnar Compression is a compression unit

⇒  Oracle provides 4 different flavors of Hybrid Columnar Compression: “compress for query low”,

“compress for query high”, “compress for archive low”, and “compress for archive high”

⇒  IO against Hybrid Columnar Compressed tables is reduced compared to un-compressed tables ⇒  SELECTs against Hybrid Columnar Compressed tables can run much faster – less IO!

⇒  Rows need to be loaded into HCC tables using DIRECT PATH load

⇒  DML against HCC tables can be expensive: single-row INSERT operations use OLTP compression,

direct path inserts perform HCC compression, and UPDATEs migrate rows (and can be expensive!)

(167)

Exadata IO Resource Management

⇒  IO Resource Management (IORM) provides a means to govern and meter IO from different

workloads in the Exadata Storage Server

⇒  Database consolidation is a key driver to customer adoption of Exadata

⇒  Consolidation means that multiple databases and applications could share Exadata storage ⇒  Different databases in a shared Exadata storage grid could have different IO performance

requirements

⇒  One of the common challenges with shared storage infrastructure is that of competing IO workloads

⇒  Batch vs. OLTP

⇒  Warehouse vs. OLTP

⇒  Production vs. Test and Development

⇒  You can mitigate competing priorities by over-provisioning storage, but this becomes expensive

⇒  Exadata addresses this challenge with IO Resource Management.

(168)

Database Resource Management

⇒  A single database may have many types of workloads with different performance requirements ⇒  “Resource consumer groups” allow you to group sessions by workload

⇒  After creating resource consumer groups, you specify how resources are used within a resource

consumer group

⇒  Once resource consumer groups are established, you must map sessions to a consumer group

based on distinguishing characteristics

⇒  The combination of resource consumer groups and session mappings comprises a “resource plan” ⇒  One resource plan can be active in a database at a time

⇒  A database resource plan is also called an “intradatabase resource plan” ⇒  Let’s show an example …

(169)

Database Resource Management - Example

Database DBM

OM OLTP Consumer group Other OLTP Consumer group Reporting Consumer group

Database XBM

Online query Consumer group Batch query Consumer group

(170)

Database Resource Management - Example

Database DBM

OM OLTP Consumer group Other OLTP Consumer group Reporting Consumer group

Database XBM

Online query Consumer group Batch query Consumer group Interactive category Batch category

(171)

IO Resource Management Plans

⇒  IORM provides different approaches for managing resource allocations

⇒  If you have multiple workloads within a database that you wish to control database resource usage

with, you need to configure an intradatabase resource plan

⇒  If you only have one database in your Exadata Database Machine, your intradatabase resource

plan is all you need – IO resource management is handled automatically inside the storage servers based on this intradatabase resource plan

⇒  If you have multiple databases in your Exadata Database Machine that you wish to govern IO

resource amongst, you create an interdatabase resource plan

⇒  Rules in an interdatabase resource plan specify allocations to databases, not consumer groups ⇒  Category resource management is used when you want to control resources primary by the

category of work being done – it allows for allocation of resources amongst categories spanning multiple databases

(172)

IORM Architecture

CELLSRV queue 1 CELLSRV queue 2 CELLSRV queue 3 CELLSRV queue 4

(173)

IORM Architecture

Resource Plans IORM schedules IO according to resource plans onto disk queues

(174)

IORM Architecture: Rules

⇒  IORM is only “engaged” when needed

⇒  IORM does not intervene if there is only one active consumer group on one database

⇒  Any disk allocation that is not fully utilized is made available to other workloads in relation to the

configured resource plans

⇒  Background IO is scheduled based on their priority relative to user IO ⇒  Redo and control file writes always take precedence

⇒  DBWR writes are scheduled at the same priority as user IO

⇒  For each cell disk, each database accessing the cell has one IO queue per consumer group and

three background queues

⇒  Background IO queues are mapped to “high”, “medium”, and “low” priority requests with different IO

types mapped to each queue

(175)

IORM in Action: Planning

⇒  DBM has three consumer groups, “OM OLTP”, “OTHER OLTP”, and “REPORTING” ⇒  XBM will have two consumer groups, “ONLINE QUERY” and “BATCH QUERY” ⇒  DBM Intradatabase Resource Plan

⇒  50% of resources allocated to “OM OLTP”

⇒  30% of resources allocated to “OTHER OLTP”

⇒  20% of resources allocated to “REPORTING”

⇒  XBM Intradatabase Resource Plan

⇒  70% of resources allocated to “ONLINE QUERY”

(176)

IORM in Action: Planning

Page 176

⇒  Category Plan

⇒  70% of resources allocated to INTERACTIVE category

⇒  “OM OLTP” and “OTHER OLTP” in INTERACTIVE category for DBM

⇒  “ONLINE QUERY” in INTERACTIVE category for XBM

⇒  30% of resources allocated to BATCH category

⇒  “REPORTING” in “BATCH” category for DBM

⇒  “BATCH QUERY” in “BATCH” category for XBM

⇒  Interdatabase Plan

⇒  60% of resources allocated to database DBM

(177)

IORM in Action: Understanding the Math

All User IO = 100% Category Plan Interdatabase Plan Intradatabase Plan IORM Allocation 70% Interactive 30% Batch 40% XBM 60% DBM 40% XBM 60% DBM DBM OM OLTP” 26.25% DBM OTHER OLTP: 15.75% XBM: ONLINE QUERY 28.00% XBM: BATCH QUERY 12.00% DBM: REPORTING 18.00% 30% 70% 20% 30% 50%

(178)

IORM in Action: Understanding the Math

CG% = (Intra CG% / sum (X)) * db% * cat%

“CG%” = IORM determined resource allocation for consumer group sessions “Intra CG%” = resource allocation for consumer group within an intradatabase plan “X” = sum of intradatabase consumer group allocations for all consumer groups in the same category

Db% = percentage of database allocation in the interdatabase plan

“cat%” = percentage of resource allocation for the category in which the consumer group belongs

(179)

(180)

(181)

(182)

(183)

(184)

(185)

(186)

(187)

(188)

Using Cell database IO metrics

Metric Name Meaning

DB_IO_RQ_SM DB_IO_RQ_LG

Total number of IO requests issues by the database since any resource plan was set

DB_IO_RQ_SM_SEC

DB_IO_RQ_LG_SEC IO requests per second issued by the database in the last minute DB_IO_WT_SM

DB_IO_WT_LG

Total number of seconds that IO requests issued by the database waited to be scheduled

Use to measure DB load Use to measure DB load Use to monitor what databases waited for IO to be queued by IORM

(189)

Using Cell database IO metrics

Small IO load per database Large IO load per database

(190)

Using Cell database IO metrics

Small IO load per database/ sec, last minute Large IO load per database/ sec, last minute

(191)

Using Cell database IO metrics

Small IO Waits/DB/ Min Large IO Waits/DB/ Min

(192)

(193)

(194)

(195)

(196)