How Oracle Exadata Delivers Extreme Performance
John Clarke
Sr. Oracle Architect
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Agenda
Ø Oracle Exadata: Hardware and Software Optimized Ø Exadata Hardware Components
Ø How Oracle Works on Exadata Ø Exadata Software Components
Ø Smart Scan and Cell Offload Processing Ø Storage Indexes
Ø Smart Flash Cache
Ø Hybrid Columnar Compression Ø IO Resource Management Ø How to Get There from Here
Ø A New Way to Look at Optimization
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Exadata: Hardware and Software Optimized
The Exadata Database Machine is designed for extreme performance, manageability, consolidation, and high-availability. Oracle achieves this by:
Ø Utilizing an optimally-balanced hardware infrastructure
Ø Using fast hardware components designed to eliminate IO, CPU, and network bottlenecks
Ø Delivering significant Oracle software enhancements to leverage balanced hardware configuration
The Exadata Database Machine hardware is just part of the solution – the combination of a balanced
hardware configuration with Exadata’s storage software is what delivers extreme performance.
Exadata Hardware
Exadata Storage Servers, or “Cells” Database servers, or “Compute nodes” InfiniBand InterconnectDallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Exadata Storage Servers
Ø
Exadata storage servers are self-contained storage platforms that house disk storage and
run the Exadata Storage Server software provided by Oracle
Ø
A single Exadata storage server is also known as a “cell”
ØA cell is the building block for the Exadata storage grid
ØMore cells provide greater capacity and IO bandwidth
ØDatabases are typically deployed across multiple cells
Ø
The storage cells communicate with the Oracle database and ASM instances over
InfiniBand
Ø
Each cell is wholly dedicated to Oracle database files
Exadata Storage Servers
Ø
Exadata storage servers provide high-performance storage for Oracle databases
Ø
Up to 1.8 Gb/sec raw data bandwidth per cell
Ø
Up to 75,000 IOPs using Flash per cell
Ø
Based on 64-bit Sun Fire servers
Ø
Comes installed with Exadata storage server software, Oracle Linux x86_64, drivers and
utilities
Exadata Storage Servers
Sun Fire X4270 M2
Component Details
Processors 2 six-core Intel Xeon L5640 Processors (2.26GHz)
Memory 24 GB (6 x 4GB)
Local Disks 12 x 600 GB 15K RPM High Performance SAS, or 12 x 2Tb 7.2k RPM High Capacity SAS
Flash 4 x 96Gb Sun Flash Accelerator F20 PCIe Cards Disk Controller Disk controller HBA with 512 MB battery backed
cache
Network Two InfiniBand 4X QDR (40Gb/s) ports 1 dual-port PCIe HCA
Four embedded GbE ports Remote Management 1 Ethernet port for ILOM
Exadata Database Servers
Ø
The compute node grid consists of multiple database servers
ØOracle 11gR2 and ASM run on the database servers
Ø
Companies typically run RAC databases on the database server to achieve high
availability and maximize the aggregate CPU and memory horsepower in the compute
node grid
Exadata Database Servers
Component Details
Processors 2 6-core Intel Xeon X5670 Processors (2.93 GHz)
Memory 96 GB (12 x 8 GB)
Local Disks 4 x 300GB 10K RPM SAS disks
Disk Controller Disk controller HBA with 512MB battery backed cache
Network Two InfiniBand 4X QDR (40Gb/s) ports Four 1GbE Ethernet ports
Two 10GbE Ethernet ports Remote Management 1 Ethernet port for ILOM
Power Supplies 2 redundant hot-swappable power supplies Operating System 64-bit Oracle Enterprise Linux 5.5
InfiniBand Network
Ø
InfiniBand is the Exadata Storage Network
ØLooks like normal Ethernet to hosts
Ø
Efficiency of a SAN
Ø
Used for both storage and RAC interconnect
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
InfiniBand Network
Ø Oracle uses InfiniBand because of its proven track record with high-performance computing - provides 40 Gb/
sec in each direction
Ø Looks like Ethernet but is much faster
Ø Uses “zero copy”, which means data is transferred across the network without intermediate buffer copies in
the various network layers
Ø Uses “buffer reservation”, which means that the hardware knows exactly where to place buffers ahead of
time
Ø Unified network fabric for both Exadata storage and RAC interconnect – simplifies cabling and networking Ø On top of InfiniBand, Exadata uses the Zero Data loss (ZDP) UDP protocol. The ZDP protocol has a very low
CPU overhead with tests showing only a 2 percent CPU utilization while transferring 1 GB/sec of data.
Ø Each Exadata server is configured with one dual-port InfiniBand card designed to be connected to two separate
InfiniBand switches for high availability. Each InfiniBand link is able to carry the full data bandwidth of the entire cell, which means you can lose an entire network without losing any performance.
Other Hardware Components
Ø
Embedded Cisco switch for data center network uplink
Ø
KVM switch for management of storage server and compute nodes
ØMultiple PDUs with management interfaces
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Exadata Configuration Options
Oracle has four (4) configuration options:
Ø
Exadata X2-2 Quarter Rack
ØExadata X2-2 Half Rack
ØExadata X2-2 Full Rack
ØExadata X2-8 Full Rack
Ø
Each configuration is available with either High Performance or High Capacity disks
Ø
The difference between the configurations is the # of storage servers and compute nodes
ØOracle has a short list of hardware configuration options because it requires the balanced
hardware configuration
Ø
A half rack can be upgraded to a full rack, and multiple racks can be interconnected
ØQuarter racks have 2 InfiniBand Switches; Half and Full racks have 3
Exadata Configuration Options
Compute Nodes X2-2 Quarter Rack X2-2 Half Rack X2-2 Full Rack X2-8 Full Rack
# Compute nodes 2 4 8 2 Processor Cores per Node 12 12 12 64 Total Processor Cores 24 48 96 128 Memory/Node 96 Gb 96 Gb 96 Gb 1 TB Total Memory 192 Gb 384 Gb 768 Gb 2 TB
Exadata Configuration Options
Exadata Storage Servers
X2-2 Quarter Rack X2-2 Half Rack X2-2 Full Rack X2-8 Full Rack
# Storage cells 3 7 14 14 Disks/Cell 12 12 12 12 Total Disks 36 84 168 168 Flash 1.15 Tb 2.7 Tb 5.4 Tb 5.4 Tb Raw Storage (HP) 21.6 Tb 50.4 Tb 100.8 Tb 100.8 Tb Raw Storage (HC) 72 Tb 168 Tb 336 Tb 336 Tb
Exadata Configuration Options
Exadata Storage Servers
X2-2 Quarter Rack X2-2 Half Rack X2-2 Full Rack X2-8 Full Rack
Raw disk throughput, High Performance disks Up to 5,400 MBPS Up to 12,600 MBPS Up to 25,200 MBPS Up to 25,200 MBPS Raw disk throughput, High Capacity disks Up to 3,000 MBPS Up to 7,000 MBPS Up to 14,000 MBPS Up to 14,000 MBPS
Disk IOPs, High Performance disks
Up to 10,800 Up to 25,200 Up to 50,400 Up to 50,400
Disk IOPS, High Capacity disks
Up to 4,320 Up to 10,080 Up to 20,160 Up to 20,160
How Oracle Works on Exadata
• Oracle 11gR2 • Oracle RAC
• Database storage uses ASM
How Oracle Works on Exadata
Each storage cell contains: • Exadata Storage Software
• Disks (LUNs, cell disks, and Grid Disks) • ASM disk groups built on Exadata Grid
Disks
Exadata features implemented on storage servers and exploited by databases on compute nodes
How Oracle Works on Exadata
• Oracle communicates with the Exadata cells using iDB
• iDB is implemented using LIBCELL
• Exadata binaries are linked with LIBCELL to facilitate cell
How Oracle Works on Exadata
• CELLSRV is the primary
component and provides majority of services
• Oracle Databases and ASM processes use LIBCELL to communicate to CELLSRV processes
• CELLSRV is what delivers the unique Exadata features
Oracle on Exadata
• Each Exadata cell has 12 physical disks
• Oracle reserves 29Gb on each of the first two disks in each cell for the “system area”
Oracle on Exadata
• One LUN is created on each physical disk
• A LUN is the unit on which a Cell Disk can be created
Oracle on Exadata
• Cell Disks are created on LUNs
• Cell Disks represent the storage area for each LUN
Oracle on Exadata
• Grid Disks are built on Cell Disks
• A Grid Disk can consume all space on a cell disk or an administrator-specified chunk of the Cell Disk
• The outermost tracks on a Grid Disk have the highest performance characteristics
• Building multiple Grid Disks per Cell Disk allows the administrator to segregate storage by usage and performance demands
Oracle on Exadata
• Grid Disk are directly exposed to ASM • ASM Disk Groups are built on Grid Disks
Exadata Software Features at a Glance
Exadata Software Goals Fully and evenly utilize all
Computing resources in Exadata Database Machine to eliminate bottlenecks and deliver consistent high
performance
Reduce demand for resources
by eliminating any IO that can be discarded without impacting result
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Exadata Smart Scan
⇒ One of the most important Exadata software features
⇒ Typically where Exadata software features discussions start ⇒ One of several “cell offload” features in Exadata
⇒ “cell offload” is defined as Exadata’s shifting of database work from the database servers to
Exadata storage servers
⇒ Primary goal of Smart Scan is to perform majority of IO processing on storage servers and returned
smaller amounts of data from the storage infrastructure to the database servers
⇒ Provides dramatic performance improvements for eligible SQL operations (full table scans, fast full
index scans, joins, etc.)
⇒ Smart Scan is implemented automatically on Exadata for eligible SQL operations ⇒ Smart Scan only works on Exadata
Traditional SQL Processing
User submits a query
select state,count(*)!
from census!
group by state!
The query is parsed and an execution path is determined. Extents are identified
IO is issued – Oracle retrieves blocks based on extents identified. A block is the smallest unit of transfer from storage system, contains rows and columns
Database instance processes blocks – all required blocks read from storage system to buffer cache
Oracle filters rows and
columns, once loaded to buffer cache, and returns to user
Smart Scan Processing
User submits a query
select state,count(*)!
from census!
group by state!
An iDB command is constructed and sent to Exadata cells
Each Exadata cell scans data blocks and extracts relevant rows and columns that satisfy the SQL query
Each Exadata cell returns to the database instance an iDB message containing the requested rows and columns. These are not block images and are returned to the PGA The database consolidates
results from across each Exadata cell and returns rows to the client
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Smart Scan and Cell Offload Processing
⇒ Filtering operations are offloaded to the Exadata storage cell⇒ Column Filtering
⇒ Predicate Filtering (i.e., row filtering)
⇒ Join Filtering
⇒ Only requested rows and columns are returned to the database server ⇒ Significantly less IO transfer over storage network
⇒ Less memory and CPU required on database tier nodes
⇒ Rows/columns retrieved to user’s PGA via direct path read mechanism, not through buffer cache. ⇒ Large IO requests don’t saturate buffer cache
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Smart Scan and Cell Offload Processing …
What’s Eligible for Smart Scan on Exadata?
⇒ Single-table full table scans (not IOTs, not BLOBs, no hash clusters) ⇒ Serial direct reads enabled
⇒ Join filtering using Bloom Filters
⇒ Bloom filtering determines which rows are required to satisfy a join
⇒ A Bloom filter is created with the values of the join column for the smaller table of a join, and
this filter is used by the Exadata storage server to eliminate row candidates from the larger table
Page 31
- Oracle is a C program - kcfis_read is function that
does smart can
- kcfis_read is called by direct path read function, klbldrget
- klbkdrget is called from full scan functions
Smart Scan in Action
Our table is 4.53 Gb in
size It’s got almost 180 million rows
All its indexes are invisible
A full scan with returned in 5.27 seconds!
Measuring Smart Scan: cell_offload_plan_display
“cell_offload_plan_display” setting Meaning
AUTO Shows offload-related information in plan display if an Exadata cell is present and the objects are on the cell
ALWAYS Shows offload-related information if the SQL statement is off-loadable, whether or not running on an Exadata cell or not. Useful for 11gR2 databases not running on Exadata to do simulated plans
NEVER Never shows offload-related information in plans.
TABLE ACCESS STORAGE FULL
indicates that the query is cell-off-loadable, eligible for Smart Scan
Measuring Smart Scan with Statistics
Statistic/Event Meaning
cell physical IO bytes eligible for predicate offloading Number of bytes eligible for cell offload
cell physical IO interconnect bytes Bytes returned from storage cells to database server
Measuring Smart Scan with Statistics
4.8 Gb eligible for Smart Scan and 45MB returned from storage grid
Smart Scan Efficiency of 99.05%
Measuring Smart Scan with Statistics
PGA max before Smart Scan is about 4Mb
PGA max after smart scan is over 10Mb
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Controlling Smart Scan Behavior
⇒
SQL> alter system set cell_offload_processing=<TRUE|FALSE>;
⇒
SQL> alter session set cell_offload_processing=<TRUE|FALSE>;
⇒
SQL> select /*+ opt_param (‘cell_offload_processing’,’<true|false>’) */ ….
Page 37
Control cell offload system-wide
Control cell offload for current session
Control cell offload for SQL statement
Controlling Smart Scan Behavior
Full scan in 31+ seconds
Smart Scan Predicate Filtering
Ran in 6.64 seconds
2.1GB returned via Smart Scan, savings of 56.11%
Smart Scan Predicate Filtering
45MB returned via Smart Scan, savings of 99.03%
410K rows returned in 5.48 seconds
Smart Scan Predicate Filtering
8MB returned via Smart Scan, savings of 99.82%
2 rows returned in 5.39 seconds
Smart Scan Column Projection
1.17 MB of data through interconnect
Smart Scan Column Projection
651,744 bytes through interconnect
Smart Scan Column Projection
1,176,032 bytes when selecting BIGCOL 651,744 bytes when selecting BIGCOL 2,048 rows returned (1,874,990 – 651,744) / 2,048 = 524,288 bytes difference measured Predicted delta = 522,240 bytes, which his pretty closeSmart Scan Column Projection – Another Example
Query ran in 5.47 seconds Query ran in 13.04 secondsDallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Smart Scan: Join Filtering
⇒ In some cases, Oracle will use bloom filters to optimize join filtering ⇒ Bloom filters are implemented on the storage server on Exadata ⇒ Bloom filters work by:
⇒ Examining the columns in a join
⇒ For the smaller table in a join, Oracle determines the space required to store the result of data
and if small compared to the size of the joined (larger) table, may decide to use a bloom filter
⇒ Bloom filters identify which columns are required to satisfy a join and build an in-memory set of
values to compare against the larger table’s rows
⇒ Typically are built to reduce data communication between slaves for PQ joins and work well
with PQO and partitioned tables
Smart Scan: Join Filtering with offloaded Bloom Filters
Result returned in 7.45 second
SYS_OP_BLOOM_FILTER indicates Bloom Filters. When it happens on “Storage” line, it means it’s offloaded
Smart Scan: Join Filtering with offloaded Bloom Filters
With
“_bloom_filter_predicate_pushdown_to_storage = false
SYS_OP_BLOOM_FILTER indicates Bloom Filters. Here, does NOT exist on storage line, hence not offloaded
Smart Scan: Table Scans or Index Scans?
Do we just drop all indexes and let Smart Scan do its magic?
Completed in a little over 3 minutes
Over 6.7 GB returned from cell – note CPU and LIO values …
Smart Scan: Table Scans or Index Scans?
Do we just drop all indexes and let Smart Scan do its magic?
Completed in less than a second with indexes
Far less IO and less CPU/LIO
Indexes still work better for this type of query!
TEST TEST TEST
Smart Scan: Disabling Serial Direct Read Mechanism
Query completed in 3.81 seconds
15Gb eligible for cell offload
100Mb transferred over from storage cell
Smart Scan efficiency of 99.34%
Smart Scan: Disabling Serial Direct Read Mechanism
Disable serial direct reads
Query time increased from 3.8 seconds to 101 seconds
Nothing eligible for cell offload – no Smart Scan
Querying V$SQL to Determine Cell Off-loadable Queries
Querying V$SQL to Determine Cell Off-loadable Queries
Querying V$SQL to Determine Cell Off-loadable Queries
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Smart Scan Summary
⇒ Smart Scan is software engineered for Exadata that can provide significant IO savings and
dramatically improved performance
⇒ Smart Scan works by offloading IO operations to the Exadata storage cell and only returning rows
and columns requested by the SQL statement back to the database instance
⇒ Row filtering provides significant cell scan efficiency gains ⇒ Column filtering provides IO savings
⇒ Join filtering using Bloom Filters, performed on Exadata cells, also can offer huge performance
gains
⇒ Don’t drop all your OLTP indexes without testing – single-row index access will often still outpace
full scan with cell offload
⇒ One of the great things about Exadata and Smart Scan is that nothing needs to be explicitly done by
a developer or DBA to take advantage of it - it “just works” with Exadata
Exadata Storage Indexes
⇒
The goal of storage indexes is to eliminate IO requests to Exadata disks
⇒Storage indexes are NOT at all like traditional Oracle indexes
⇒
Storage indexes are like “anti-indexes” – they help Oracle determine which
blocks the requested data are NOT in
⇒
Knowing which storage areas will NOT contain the requested data allows
Exadata to skip IO requests to these disks
⇒
Bypassing IO to disks reduces work required on the storage grid and
improves performance
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Exadata Storage Indexes: Architecture
⇒
Each disk in an Exadata storage cell is divided into 1MB pieces called “storage regions”
⇒
For each 1MB storage region, data distribution statistics are held in a memory structure
called a “region index”
⇒
Region indexes maintain “min” and “max” values for the storage region as data as
selected from the disk storage region, for up to 8 columns in a table
⇒
Region indexes are maintained over time as data is access from the disk
⇒
Storage indexes are maintained automatically by Exadata; there is no way to influence
their behavior
⇒
Storage server reboots will erase storage index data – region indexes are stored in
memory
⇒
Performance impact can be dramatic, but performance impact is not deterministic as its
subject to Exadata’s automatic management of region indexes
Region index
Exadata Storage Indexes: Architecture
select * from soe.orders where order_total > 60; ASM Disk Storage region ASM AU Min 1 Max 22 Min: 62 Max 100 Min 1 Max 60 Min 1 Max 101 C1 C2 C3 C4 … 1 … … … 59 … … Table: ORDERS C1 C2 C3 C4 … 1 … … … 11 … … Table: ORDERS C1 C2 C3 C4 … 62 … … … 81 … … Table: ORDERS Min = 1, Max = 60 Min = 62, Max = 100 Min = 1, Max = 60
Exadata Storage Indexes: Scenarios
⇒
Data needs to be well-ordered according to query predicates
⇒
If all regions contain randomly ordered column values, region index min and max values
will be too widespread to eliminate the region from an IO request
⇒
Since Exadata only maintain regions indexes on 8 columns per table, your range of
distinct predicates should be relatively small and static
⇒
Storage indexes can return “false-positive”, but will never skip regions and jeopardize
query integrity
select * from soe.orders where order_total between 10 and 15; Min 1 Max 22 Min: 62 Max 100 Min 1 Max 60 Min 1 Max 101
Exadata Storage Indexes in Action
This is the key statistic: “cell physical IO bytes saved by storage index”
Exadata Storage Indexes in Action
0 rows returned in 4.85 seconds
~ 17Mb of IO saved by storage index
Smart scan efficiency of 99.98%
Exadata Storage Indexes in Action
0 rows returned in 4.85 seconds
Smart scan efficiency of 99.98%
0 rows returned in 2.26 seconds
~ 9Gb of IO saved by storage index
Exadata Storage Indexes in Action
0 rows returned in 0.09 seconds!!
~ 15Gb of IO saved by storage index
Exadata Storage Indexes in Action
7.8 million rows returned in 4.23 seconds No storage index savingsExadata Storage Indexes in Action
7.8 million rows returned in 1.37 seconds Over 12Gb of IO savings with storage indexesExadata Storage Indexes in Action
No storage index savings
Exadata Storage Indexes in Action
No storage index savings
Exadata Storage Indexes and Bind Variables
Storage indexes used
Exadata Storage Indexes and NULL values
Storage index used
Exadata Storage Indexes with LIKE, BETWEEN, Wildcards
No storage index savings
Exadata Storage Indexes with LIKE, BETWEEN, Wildcards
14GB of IO saved by storage index
Exadata Storage Indexes with LIKE, BETWEEN, Wildcards
No storage index savings
“LIKE” is fine, but …
Wildcards prevent storage index IO pruning
Exadata Storage Indexes with LIKE, BETWEEN, Wildcards
Storage index used
Exadata Storage Indexes with LIKE, BETWEEN, Wildcards
Storage index used but less IO saved with wildcards
Exadata Storage Indexes with LIKE, BETWEEN, Wildcards
No storage index used
Exadata Storage Indexes with “OLTP Tables”
Eliminated over 4Gb of IO via storage index
Exadata Storage Indexes with “OLTP Tables”
Ran in 0.26 seconds Eliminated over 4GB of IO from via storage indexExadata Storage Indexes with “OLTP Tables”
No storage index when upper-bound in range is higher than max valueExadata Storage Indexes with “Normal” Index Access
Exadata Storage Indexes with “Normal” Index Access
Storage indexes are only used for cell
Disabling Storage Indexes
Storage indexes can be disabled by setting
“_kcfis_storageidx_disabled” = TRUE
~ 12Gb saved via storage index Query ran in < 1 second
Disabling Storage Indexes
Storage indexes can be disabled by setting
“_kcfis_storageidx_disabled” = TRUE
0 Bytes saved by storage index
Query ran in 4.71 seconds
Tracing Storage Indexes
Storage index operations can be traced by setting “_kcfis_storageidx_diag_mode” = 2
Enable storage index tracing
Indicates storage indexes were in use
Tracing Storage Indexes
SQL_ID for given transaction
Tracing Storage Indexes
Storage region size = 1Mb strt = 0 end = 2048 Memory=2K Column # 11 Low and high values of actual dataTracing Storage Indexes
No storage index savings
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Summary
⇒
Storage indexes are designed to eliminate IO requests on storage servers
⇒
Storage indexes are automatically maintained in memory areas on the storage cell
⇒
Exadata tracks minimum and maximum values in each storage region while processing IO
requests and stores these region-level boundaries in region indexes
⇒
Exadata examines region indexes to determine whether a storage region can be excluded
from access based on a query predicate
⇒
Storage indexes are automatically created and maintained
⇒
Storage indexes work for cell offload functions (Smart Scan)
⇒
Storage indexes can help performance dramatically and at worst-case, will never hurt
performance
⇒
Storage indexes work on well-ordered tables based on the query predicates issued
against the tables
Exadata Smart Flash Cache
⇒ Smart Flash Cache goal: intelligently cache frequently used data ⇒ Exadata caches in PCI flash cards on the storage cell
⇒ Data is cached to improve IO response time and deliver better database performance
⇒ Flash cards can deliver tens of thousands of I/Os per seconds – SAS or SATA drives can deliver a
couple hundred I/Os per second
Exadata Smart Flash Cache: Hardware
Four 96Gb PCI flash cards per cell
384 Gb of PCI flash per cell
5.4 Tb of flash on full rack, 2.7 Tb for half rack, 1.15 Tb for quarter
One cell can support 75,000 IOPs
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Exadata Smart Flash Cache: Software
⇒ Smart Flash Cache provides a storage cell caching algorithm to cache appropriate data in the
storage cell PCI flash cards
⇒ Each database IO is tagged with metadata
⇒ The CELL_FLASH_CACHE setting for the object(s): DEFAULT means Smart Flash Cache is
used normally, KEEP means that Smart Flash Cache is used more aggressively, NONE means Smart Flash Cache is disabled for the object
⇒ A cache hint: CACHE means the IO should be cached, NOCACHE means it shouldn’t, and
EVICT indicates that the cached data should be removed from Smart Flash Cache
⇒ Smart Flash Cache takes the following into consideration when caching data:
⇒ IO size: Large objects with CELL_FLASH_CACHE=DEFAULT are not cached
⇒ Current cache load: Smart table scans are usually directed to disk but if
CELL_FLASH_CACHE=KEEP and the cache load is low, they may be satisfied from Smart Flash Cache
⇒ Specific operations (backups, Data Pump export/import, etc) are not cached
Exadata Smart Flash Cache: Software
⇒ Smart Flash Cache is a write-through cache⇒ After a write is acknowledged, it’s written to Smart Flash Cache if suitable for caching ⇒ Write performance is not improved or diminished with Smart Flash Cache
⇒ Because it’s not a write-back cache, writes are not cushioned by PCI flash cards during write
operations – the writes need to happen to disk
⇒ A small battery-backed cache on each cell performs write-back caching ⇒ Technology comparisons:
⇒ EMC FastCache: EMC EFDs as an extension of storage processor controller cache, write-back
caching to cushion write IO, very good write performance
⇒ FusionIO: data placed directly on PCI flash cards, very high write performance
⇒ Exadata can match write IO performance of EMC FastCache, FusionIO by using Smart Flash
Smart Flash Cache: Write Operations
#1: Database issues a write operation! #2: CELLSRV inspects IO metadata! #3: Data is written to disk! #4: IO is acknowledged and database process continues! #5: If IO is Smart Flash Cache suitable, IO is written to Flash Cache! #6: CELLSRV usesLRU algorithm to determine which data to replace !
Smart Flash Cache: Read from Previously Cached Data
#1: Database issues a read request! #2: CELLSRV inspects IO metadata! #3: If data exists in Flash Cache, it will be read from cache – no disk access!#4: Read is satisfied from cache!
Smart Flash Cache: Read from Un-cached Data
#1: Database issues a read operation!
#2: CELLSRV
inspects IO metadata!
#3: Data is read from disk! #4: Read is acknowledged and database process continues! #5: If IO is Smart Flash Cache suitable, IO is written to Flash Cache! #6: CELLSRV uses
LRU algorithm to determine which data to replace !
Smart Flash Cache in Action: Writes
Smart Flash Cache in Action: Writes
Before updating, 97 flash cache hits
1000 rows updated
Performed index range scan
Smart Flash Cache in Action: Writes
Query Statistics
Update yielded (368 – 97) = 271 flash cache read hits
Smart Flash Cache in Action: Writes
No physical reads
Smart Flash Cache in Action: Writes
180 physical reads
Smart Flash Cache in Action: Reads with Un-cached Data
Ran in 1:21.34
1,072 flash cache hits for dbm1
4,452 flash cache hits for dbm1 = 3,380 flash cache hits
Smart Flash Cache in Action: Reads with Cached Data
Ran in 16.50 seconds 27.762 flash cache read hitsSmart Flash Cache in Action: Dropping Flash Cache
Ran in 1:24.36
No flash cache read hits
Smart Flash Cache in Action: Smart Scan Data
~ 12Mb cached/cell for the index
~ 18Mb cached for one of the table partitions
Smart Flash Cache in Action: Smart Scan Data
We’re using about 350Mb/cell currently …
Smart Flash Cache in Action: Smart Scan Data
Very small delta in flash cache read hits
CELL_FLASH_CACHE “KEEP”
Query time dropped to 28 seconds
cell flash cache read hits jumped from 136,709 to 252,095
CELL_FLASH_CACHE “KEEP”
About 26Gb per cell per partition
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Summary
⇒ Exadata Smart Flash Cache is a caching mechanism delivered by Exadata
⇒ Smart Flash Cache uses PCI flash cards on each storage cell to cache appropriate data
⇒ Each storage cell has 384Gb of PCI flash available for Smart Flash Cache – a full rack delivers
5.3TB of flash storage
⇒ Smart Flash Cache intelligently caches appropriate data (OLTP)
⇒ Smart Flash Cache is a write-through cache – writes need to be performed on disk before
acknowledged to foreground process. After writes and initial reads, Exadata determines whether data is eligible for cache and writes to Flash Cache. Subsequent IO benefits from Smart Flash Cache
⇒ Segments can be tagged to be “kept” in Smart Flash Cache, similar to buffer cache keep
⇒ Grid Disks (and ASM disk groups) can be created on Flash Disks, improving performance in some
cases
⇒ We recommend using all flash storage for Smart Flash Cache
Oracle 11g OLTP Compression
VENDOR_ID VEND_NAME STATE VNDR_RATING VENDOR_TYPE ========== =========== ===== =========== ========== 100 ACME ONE MI 100 DIRECT 101 ACME ONE CA 90 DIRECT 102 NORTON IA 95 INDIRECT 103 WINGDINGS MI 96 INDIRECT 104 WINGDINGS GA 96 INDIRECT ACME ONEDIRECT| WINGDINGS96INDIRECT 100ACME ONEMI100DIRECT| 101ACME ONECA()DIRECT| 102NORTONIA95INDIRECT| 103WINGDINGSMS96INDIR ECT| 104WINGDINGSGA96INDIR ECT 100*MI100*|101CA90*| 102NORTONIA95INDIRECT| 103*MS**|104*GA** Uncompressed Compressed <- Header -> Symbol Table-> <- Data -> <- Free Space ->
Rows are inserted into blocks, no
Free space in
As rows are inserted, only unique values are inserted into blocks
Duplicate values inserted into symbol table
Results in more free space
Oracle 11g OLTP Compression
VENDOR_ID VEND_NAME STATE VNDR_RATING VENDOR_TYPE ========== =========== ===== =========== ========== 100 ACME ONE MI 100 DIRECT 101 ACME ONE CA 90 DIRECT 102 NORTON IA 95 INDIRECT 103 WINGDINGS MI 96 INDIRECT 104 WINGDINGS GA 96 INDIRECT ACME ONEDIRECT| WINGDINGS96INDIRECT 100ACME ONEMI100DIRECT| 101ACME ONECA()DIRECT| 102NORTONIA95INDIRECT| 103WINGDINGSMS96INDIR ECT| 104WINGDINGSGA96INDIR ECT 100*MI100*|101CA90*| 102NORTONIA95INDIRECT| 103*MS**|104*GA** Uncompressed Compressed <- Header -> Symbol Table-> <- Data -> <- Free Space ->
When the first row is inserted into the block, it is not
compressed
Subsequent row inserts check for duplicates and move
duplicate values to symbol table
After free space is exhausted in block, block will be fully
compressed
There is a slight CPU overhead in compressing data; Oracle has to build/maintain symbol table
The unit of compression for OLTP Advanced Compression is the block …
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Exadata Hybrid Columnar Compression
⇒ Data is stored by columns in a compression unit⇒ A compression unit is a set of blocks, not a single block
⇒ Basic compression and Advanced Compression loads data into blocks row by row. As rows are
added, compression algorithms are invoked to reference or update symbol tables and duplicate values are not inserted
⇒ HCC examines data before being placed on a block and divides the data into arrays of columns ⇒ HCC takes a stream if input values and places all values for column 1 in array 1, all values for
column 2 in array 2, and so on …
⇒ HCC takes four 8k blocks of data in the stream, by default
⇒ HCC then performs its de-duplication process on the entire stream of 4 blocks, so the odds of
finding duplicate data are greatly increased over standard or Advanced compression because we’re
considering a large subset of data
⇒ HCC then does leading-edge compression on the array values and inserts the columns into a
compression unit (set of blocks), not row-by-row
Exadata Hybrid Columnar Compression
VENDOR_ID VEND_NAME STATE VNDR_RATING VENDOR_TYPE ========== =========== ===== =========== ========== 100 ACME ONE MI 100 DIRECT 101 ACME ONE CA 90 DIRECT 102 NORTON IA 95 INDIRECT 103 WINGDINGS MI 96 INDIRECT 104 WINGDINGS GA 96 INDIRECT 100ACME ONEMI100DIRECT| 101ACME ONECA()DIRECT| 102NORTONIA95INDIRECT| 103WINGDINGSMS96INDIR ECT| 104WINGDINGSGA96INDIR ECT
Uncompressed Hybrid Columnar Compression
Logical Compression Unit <- Header -> CU Header-> VENDOR_ID VEND_NAME VNDR_RATING STATE VEND OR_TY PE COL6 COL7 COL8 COL9 COL10
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Exadata Hybrid Columnar Compression
Page 127 Compression Unit #1 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 Compression Unit #2 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 Compression Unit #2 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 4 blocks, first 50,000 rows
Second CU, new set of 4 blocks, 48,500 rows
Third CU, new set of 4 blocks, 51,000 rows
Exadata Hybrid Columnar Compression
⇒ Recall, data is stored by columns inside a compression unit⇒ The greater the degree of duplicate column values, the less space required per compression unit
and the fewer compression units required to store the data
⇒ If queries select a single column or subset of columns, Oracle will only need to read from blocks
units on which the columns exist
⇒ This is different than other types of compression and un-compressed tables ⇒ Not only are we savings space, but we’re saving IO
Exadata Hybrid Columnar Compression and DML
⇒ EHCC compress data for bulk direct path loads only⇒ Subsequent bulk load inserts will appends new compression units ⇒ Single-row inserts will insert rows as OLTP compressed
⇒ DELETEs against HCC tables lock entire CU ⇒ When updating EHCC tables:
⇒ The updated row is moved (i.e., deleted + re-inserted, i.e., migrated)
⇒ New row is OLTP-compressed
⇒ Locks impact entire CU, not just row!
Exadata Hybrid Columnar Compression Types
⇒ QUERY LOW compression is recommended for data warehouse tables in which data load times are
important
⇒ QUERY HIGH is recommended for data warehouse data in which space savings is important.
Offers better compression ratio than QUERY LOW but more expensive
⇒ ARCHIVE LOW is intended for archive data in which load times are important. Better compression
than QUERY compression, in most cases, but more overhead for DML
⇒ ARCHIVE HIGH is designed for archival data in which space savings is the most important factor.
Using the Compression Advisor
⇒ DBMS_COMPRESSION.GET_COMPRESSION_RATIO is Oracle’s “compression advisor”
Indicates compression type – in this case, EHCC “archive low” compression
comp_for_archive_low = EHCC Archive Low comp_for_archive_high = EHCC Archive High comp_for_query_low = EHCC Query Low comp_for_query_high = EHCC Query High comp_for_oltp = OLTP Advanced Compression
Using the Compression Advisor
⇒ DBMS_COMPRESSION.GET_COMPRESSION_RATIO is Oracle’s “compression advisor”
OLTP Compression gives a low 1.1 compression ratio
Using the Compression Advisor
⇒ DBMS_COMPRESSION.GET_COMPRESSION_RATIO is Oracle’s “compression advisor”
EHCC Query Low Compression gives 2.7 compression ratio
Using the Compression Advisor
⇒ DBMS_COMPRESSION.GET_COMPRESSION_RATIO is Oracle’s “compression advisor”
EHCC Query High Compression gives 5.0 compression ratio
Using the Compression Advisor
⇒ DBMS_COMPRESSION.GET_COMPRESSION_RATIO is Oracle’s “compression advisor”
EHCC Archive High Compression gives 9.2 compression ratio
Hybrid Columnar Compression in Action
⇒ Uncompressed table: SOE.CUSTOMERS⇒ Use PCTAS to create 5 new compressed tables: one compressed for OLTP, one EHCC
compressed for query low, one compressed for query high, one compressed for archive low, and one compressed for archive high
⇒ Measure time it takes to create compressed table
⇒ Measure time it takes to run sample SELECT against compressed table ⇒ Measure time it takes to bulk insert rows into compressed table
⇒ Measure time it takes to update rows on compressed table
CUSTOMERS CUST_OLTP CUST_QLOW CUST_QHIGH CUST_ALOW CUST_AHIGH
Estimated Blocks
Storage Impact: Different Compression Scenarios
In our example, archive compression is not as
efficient as HCC “query high” compression
Storage Impact: Different Compression Scenarios
CUSTOMERS CUST_OLTP CUST_QLOW CUST_QHIGH CUST_ALOW CUST_AHIGH
Good estimates from DBMS_COMPRESSION
Performance Impact: Creating Compressed Tables
0 5 10 15 20 25 30 35 40 45 50OLTP QUERYLOW QUERYHIGH ARCHIVELOW ARCHIVEHIGH
Performance Impact: Creating Compressed Tables
0 500 1000 1500 2000 2500OLTP QUERYLOW QUERYHIGH ARCHIVELOW ARCHIVEHIGH
Performance Impact: Querying Compressed Tables
We expect least IO and best full-scan time against this table
Performance Impact: Querying Compressed Tables
Query ran in 4.63 seconds ~600Mb of IO 96% smart scan efficiencyPerformance Impact: Querying Compressed Tables
4.91 seconds ~700Mb of IO 17% less LIO compared to uncompressed 94.46% smart scan efficiencyPerformance Impact: Querying Compressed Tables
3.03 second runtime 97MB over interconnect Less CPU and LIO 99.07% smart scan efficiencyPerformance Impact: Querying Compressed Tables
Query ran in 2.07 seconds 35 MB over interconnect Less CPU and LIO 99.52% smart scan efficiencyPerformance Impact: Querying Compressed Tables
Ran in 2.08 seconds 19MB over interconnect Even less CPU and LIO99.74% smart scan
Performance Impact: Querying Compressed Tables
2.06 seconds 17MB over interconnect 99.76% smart scan efficiencyPerformance Impact: Updating Compressed Tables
Took the longest
amount of time More CPU required
to update HCC compressed tables
Proving Oracle migrated HCC Updated Rows
Row location: File 11
Block 1569727 Slot 165
Proving Oracle migrated HCC Updated Rows
Value=1 means fetch was done via migrated row
Examining the Compression Unit
Archive high
# Blocks/CU = 32
CU Length Cols/Rows
Examining the Compression Unit
CU size = 7 blocks – smaller than previous block dump
1 deleted row, slot 165 Archive high
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
HCC Decompression
⇒ HCC compresses data in-flight as it’s inserted via direct path ⇒ Where and when is the data decompressed when queried? ⇒ Definitions:
⇒ Upper Half = Compute Node
⇒ Lower Half = Storage Server
⇒ HCC data is decompressed on lower half when queried via Smart Scan (i.e., full scan, i.e., serial
direct read). Uncompressed data transmitted over the interconnect, storage server CPUs used for decompression
⇒ HCC data is decompressed on upper half when queried without Smart Scan (i.e., single-block
reads). Compressed data transmitted over interconnect, compute node CPUs used for decompression
⇒ Plan for CPU impact and application design!
HCC Decompression
42 cs of CPU time on compute node
HCC Decompression
HCC Decompression
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Hybrid Columnar Compression Summary
⇒ Exadata Hybrid Columnar Compression is a unique compression feature available only to Exadata ⇒ Hybrid Columnar Compression compresses tables by column for a set of rows and stores inside a
logical compression unit
⇒ The unit of compression for basic and OLTP Advanced Compression is a database block ⇒ The unit of compression for Hybrid Columnar Compression is a compression unit
⇒ Oracle provides 4 different flavors of Hybrid Columnar Compression: “compress for query low”,
“compress for query high”, “compress for archive low”, and “compress for archive high”
⇒ IO against Hybrid Columnar Compressed tables is reduced compared to un-compressed tables ⇒ SELECTs against Hybrid Columnar Compressed tables can run much faster – less IO!
⇒ Rows need to be loaded into HCC tables using DIRECT PATH load
⇒ DML against HCC tables can be expensive: single-row INSERT operations use OLTP compression,
direct path inserts perform HCC compression, and UPDATEs migrate rows (and can be expensive!)
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
Exadata IO Resource Management
⇒ IO Resource Management (IORM) provides a means to govern and meter IO from different
workloads in the Exadata Storage Server
⇒ Database consolidation is a key driver to customer adoption of Exadata
⇒ Consolidation means that multiple databases and applications could share Exadata storage ⇒ Different databases in a shared Exadata storage grid could have different IO performance
requirements
⇒ One of the common challenges with shared storage infrastructure is that of competing IO workloads
⇒ Batch vs. OLTP
⇒ Warehouse vs. OLTP
⇒ Production vs. Test and Development
⇒ You can mitigate competing priorities by over-provisioning storage, but this becomes expensive
⇒ Exadata addresses this challenge with IO Resource Management.
Database Resource Management
⇒ A single database may have many types of workloads with different performance requirements ⇒ “Resource consumer groups” allow you to group sessions by workload
⇒ After creating resource consumer groups, you specify how resources are used within a resource
consumer group
⇒ Once resource consumer groups are established, you must map sessions to a consumer group
based on distinguishing characteristics
⇒ The combination of resource consumer groups and session mappings comprises a “resource plan” ⇒ One resource plan can be active in a database at a time
⇒ A database resource plan is also called an “intradatabase resource plan” ⇒ Let’s show an example …
Database Resource Management - Example
Database DBM
OM OLTP Consumer group Other OLTP Consumer group Reporting Consumer groupDatabase XBM
Online query Consumer group Batch query Consumer groupDatabase Resource Management - Example
Database DBM
OM OLTP Consumer group Other OLTP Consumer group Reporting Consumer groupDatabase XBM
Online query Consumer group Batch query Consumer group Interactive category Batch categoryIO Resource Management Plans
⇒ IORM provides different approaches for managing resource allocations
⇒ If you have multiple workloads within a database that you wish to control database resource usage
with, you need to configure an intradatabase resource plan
⇒ If you only have one database in your Exadata Database Machine, your intradatabase resource
plan is all you need – IO resource management is handled automatically inside the storage servers based on this intradatabase resource plan
⇒ If you have multiple databases in your Exadata Database Machine that you wish to govern IO
resource amongst, you create an interdatabase resource plan
⇒ Rules in an interdatabase resource plan specify allocations to databases, not consumer groups ⇒ Category resource management is used when you want to control resources primary by the
category of work being done – it allows for allocation of resources amongst categories spanning multiple databases
IORM Architecture
CELLSRV queue 1 CELLSRV queue 2 CELLSRV queue 3 CELLSRV queue 4IORM Architecture
Resource Plans IORM schedules IO according to resource plans onto disk queuesIORM Architecture: Rules
⇒ IORM is only “engaged” when needed⇒ IORM does not intervene if there is only one active consumer group on one database
⇒ Any disk allocation that is not fully utilized is made available to other workloads in relation to the
configured resource plans
⇒ Background IO is scheduled based on their priority relative to user IO ⇒ Redo and control file writes always take precedence
⇒ DBWR writes are scheduled at the same priority as user IO
⇒ For each cell disk, each database accessing the cell has one IO queue per consumer group and
three background queues
⇒ Background IO queues are mapped to “high”, “medium”, and “low” priority requests with different IO
types mapped to each queue
IORM in Action: Planning
⇒ DBM has three consumer groups, “OM OLTP”, “OTHER OLTP”, and “REPORTING” ⇒ XBM will have two consumer groups, “ONLINE QUERY” and “BATCH QUERY” ⇒ DBM Intradatabase Resource Plan
⇒ 50% of resources allocated to “OM OLTP”
⇒ 30% of resources allocated to “OTHER OLTP”
⇒ 20% of resources allocated to “REPORTING”
⇒ XBM Intradatabase Resource Plan
⇒ 70% of resources allocated to “ONLINE QUERY”
Dallas | Detroit | Los Angeles | Singapore | India
© 2010 Centroid - Confidential
IORM in Action: Planning
Page 176
⇒ Category Plan
⇒ 70% of resources allocated to INTERACTIVE category
⇒ “OM OLTP” and “OTHER OLTP” in INTERACTIVE category for DBM
⇒ “ONLINE QUERY” in INTERACTIVE category for XBM
⇒ 30% of resources allocated to BATCH category
⇒ “REPORTING” in “BATCH” category for DBM
⇒ “BATCH QUERY” in “BATCH” category for XBM
⇒ Interdatabase Plan
⇒ 60% of resources allocated to database DBM
IORM in Action: Understanding the Math
All User IO = 100% Category Plan Interdatabase Plan Intradatabase Plan IORM Allocation 70% Interactive 30% Batch 40% XBM 60% DBM 40% XBM 60% DBM DBM OM OLTP” 26.25% DBM OTHER OLTP: 15.75% XBM: ONLINE QUERY 28.00% XBM: BATCH QUERY 12.00% DBM: REPORTING 18.00% 30% 70% 20% 30% 50%IORM in Action: Understanding the Math
CG% = (Intra CG% / sum (X)) * db% * cat%
“CG%” = IORM determined resource allocation for consumer group sessions “Intra CG%” = resource allocation for consumer group within an intradatabase plan “X” = sum of intradatabase consumer group allocations for all consumer groups in the same category
Db% = percentage of database allocation in the interdatabase plan
“cat%” = percentage of resource allocation for the category in which the consumer group belongs
Using Cell database IO metrics
Metric Name Meaning
DB_IO_RQ_SM DB_IO_RQ_LG
Total number of IO requests issues by the database since any resource plan was set
DB_IO_RQ_SM_SEC
DB_IO_RQ_LG_SEC IO requests per second issued by the database in the last minute DB_IO_WT_SM
DB_IO_WT_LG
Total number of seconds that IO requests issued by the database waited to be scheduled
Use to measure DB load Use to measure DB load Use to monitor what databases waited for IO to be queued by IORM