Oracle Clusterware and Private Network Considerations
Much of this presentation is attributed to Michael Zoll and work done by the RAC Performance Development group
The following is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remains at the sole discretion of Oracle.
Architectural Overview
RAC and Cache Fusion Performance Infrastructure
Common Problems and Resolution Aggregation and VLANs
Oracle Clusterware
public network EVMD CRSD OPROCD ONS VIP1 CSSD VIP2 VIPn /…/ shared storageOCR and Voting Disks
RAW Devices CSSD Runs in Real Time Priority OS EVMD CRSD OPROCD ONS CSSD OS EVMD CRSD OPROCD ONS CSSD OS
Cluster Private High Speed Network
Under the Covers
Redo Log Files
Node n
Node 2
Redo Log Files Redo Log Files
Dictionary Cache Log buffer VKTM LGWR DBW0 SMON PMON Library Cache
Global Resoruce Directory
LMS0 Instance 2 SGA Instance n Buffer Cache LMON LMD0 DIAG Dictionary Cache Log buffer VKTM LGWR DBW0 SMON PMON Library Cache
Global Resoruce Directory
LMS0 Buffer Cache LMON LMD0 DIAG Dictionary Cache Log buffer VKTM LGWR DBW0 SMON PMON Library Cache
Global Resoruce Directory
LMS0 Buffer Cache LMON LMD0 DIAG Instance 1 Node 1 SGA SGA Runs in Real Time Priority Cluster Private High Speed Network
Global Cache Service (GCS)
Manages coherent access to data in buffer caches of all instances in the cluster
Minimizes access time to data which is not in local cache
• access to data in global cache faster than disk
access
Implements fast direct memory access over high-speed interconnects
• for all data blocks and types
Uses an efficient and scalable messaging protocol
• Never more than 3 hops
Cache Hierarchy: Data in Remote Cache
Local Cache Miss Datablock Requested
Datablock Returned Remote Cache Hit
Cache Hierarchy: Data On Disk
Local Cache Miss Datablock Requested
Grant Returned Remote Cache Miss
Cache Hierarchy: Read Mostly
Local Cache Miss
11.1
CPU Optimizations for read-intensive operations
• Read-only access – No messages – Direct reads • Read-mostly access – Message reductions – Latency improvements Significant gains
Performance of Cache Fusion
Message:~200 bytes
Block: e.g. 8K
LMS
Initiate send and wait
Receive Process block Send Receive 200 bytes/(1 Gb/sec ) 8192 bytes/(1 Gb/sec) Total access time: e.g. ~360 microseconds (UDP over GBE)
Fundamentals: Minimum Latency (*), UDP/GBE
and RDS/IB
0.20 0.16 0.13 0.12 RDS/IB 0.46 0.36 0.31 0.30 UDP/GE 16K 8K 4K 2K Block size RT (ms)(*) roundtrip, blocks are not “busy” i.e. no log flush, no serialization ( “buffer busy”)
AWR and Statspack reports would report averages as if they were normally distributed, the session wait history which is included in Statspack in 10.2 and AWR in 11g will show the actual quantiles
The minimum values in this table are the optimal values for 2-way and 3-way block transfers, but can be assumed to be the expected values ( I.e. 10ms for a 2-way block would be very high )
PROCESS (FG/LMS*) PROCESS (FG/LMS*) SOCKET LAYER (tx Buffers) IP LAYER TCP / UDP L2/L3 SWITCH IP LAYER TCP / UDP INTERFACE LAYER INTERFACE LAYER TX RX SOCKET LAYER (rx Buffers)
Infrastructure: Interconnect Bandwidth
Bandwidth requirements depend on several factors ( e.g. buffer cache size, #of CPUs per node, access patterns) and cannot be predicted precisely for every application
Typical utilization approx. 10-30% in OLTP
• 10000-12000 8K blocks per sec to saturate 1 x Gb
Ethernet ( 75-80% of theoretical bandwidth ) Generally, 1Gb/sec sufficient for performance and
scalability in OLTP.
DSS/DW systems should be designed with > 1Gb/sec
capacity
A sizing approach with rules of thumb is described in
• Project MegaGrid: Capacity Planning for Large
Infrastructure: Private Interconnect
• Network between the nodes of a RAC cluster MUST
be private
• Supported links: GbE, IB ( IPoIB: 10.2 )
• Supported transport protocols: UDP, RDS (10.2.0.3)
• Use multiple or dual-ported NICs for redundancy and
increase bandwidth with NIC bonding
• Large ( Jumbo ) Frames for GbE recommended if the
global cache workload requires it.
• global cache block shipping versus small lock
PROCESS (FG/LMS*) PROCESS (FG/LMS*) SOCKET LAYER (tx Buffers) IP LAYER TCP / UDP L2/L3 SWITCH Ingress Egress IP LAYER TCP / UDP INTERFACE LAYER INTERFACE
LAYER Hardware interrupts
RX queues, RX buffers Software interrupts RX IP input queue TX Socket queues RX SOCKET LAYER (rx Buffers) USER KERNEL
Backplane pressure cpu
Socket buffers
TX IP queues
Network Packet Processing: Layers, Queues and Buffers
Infrastructure: IPC configuration
Important Settings:
• Negotiated top bit rate and full duplex mode
• NIC ring buffers
• Ethernet flow control settings
• CPU(s) receiving network interrupts
Verify your setup:
• CVU does checking
• Load testing eliminates potential for problems
• AWR and ADDM give estimations of link utilization
Buffer overflows, congested links and flow control can have severe consequences for performance
Infrastructure: Operating System
Block access latencies increase when CPU(s) busy and run queues are long
• Immediate LMS scheduling is critical for
predictable block access latencies when CPU > 80% busy
Fewer and busier LMS processes may be more efficient.
• monitor their CPU utilization
• Caveat: 1 LMS can be good for runtime
performance but may impact cluster
reconfiguration and instance recovery time
• the default is good for most requirements
Higher priority for LMS is default
<Insert Picture Here>
Common Problems and Symptoms
“Lost Blocks”: Interconnect or Switch Problems System load and scheduling
Contention
Miss-configured or Faulty Interconnect Can
Cause:
Dropped packets/fragments
Buffer overflows
Packet reassembly failures or timeouts
Ethernet Flow control kicks in
TX/RX errors
“lost blocks” at the RDBMS level, responsible for
64% of escalations
“Lost Blocks”: NIC Receive Errors
Db_block_size = 8K
ifconfig –a:
eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04
inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95 TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0
“Lost Blocks”: IP Packet Reassembly Failures
netstat –s
Ip:
84884742 total packets received
…
1201 fragments dropped after timeout
…
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time(s)(ms) Time Wait Class
---log file sync 286,038 49,872 174 41.7 Commit gc buffer busy 177,315 29,021 164 24.3 Cluster
gc cr block busy 110,348 5,703 52 4.8 Cluster gc cr block lost 4,272 4,953 1159 4.1 Cluster
cr request retry 6,316 4,668 739 3.9 Other
Finding a Problem with the Interconnect or IPC
Global Cache Lost block handling
Detection Time in 11g reduced
• 500ms ( around 5 secs in 10g )
• can be lowered if necessary
• robust ( no false positives )
• no extra overhead
Cr request retry event related to lost blocks
• It is highly likely to see it when gc cr blocks lost
Interconnect Statistics
Automatic Workload Repository (AWR )
Target Avg Latency Stddev Avg Latency Stddev Instance 500B msg 500B msg 8K msg 8K msg --- 1 .79 .65 1.04 1.06 2 .75 .57 . 95 .78 3 .55 .59 .53 .59 4 1.59 3.16 1.46 1.82
---Latency probes for different message sizes
Exact throughput measurements ( not shown)
“Blocks Lost”: Solution
Fix interconnect NICs and switches Tune IPC buffer sizes
CPU Saturation or Long Run Queues
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s) (ms) Time Wait Class
--- --- --- ---- ---
---db file sequential 1,312,840 21,590 16 21.8 User I/O read
gc current block 275,004 21,054 77 21.3 Cluster
congested
gc cr grant congested 177,044 13,495 76 13.6 Cluster gc current block 1,192,113 9,931 8 10.0 Cluster 2-way
gc cr block“Congested” : LMS could not dequeue messages fast enoughcongested 85,975 8,917 104 9.0 Cluster
High CPU Load: Solution
Run LMS at higher priority (default) Start more LMS processes
• Never use more LMS processes than CPUs
Reduce the number of user processes Find cause of high CPU consumption
Contention
Event Waits Time (s) AVG (ms) % Call Time --- - ---gc cr block 2-way 317,062 5,767 18 19.0 gc current block 2-way 201,663 4,063 20 13.4 gc buffer busy 111,372 3,970 36 13.1 CPU time 2,938 9.7
gc cr block busy 40,688 1,670 41 5.5
---Global Contention on Data
Serialization
Contention: Solution
Identify “hot” blocks in application Reduce concurrency on hot blocks
High Latencies
Event Waits Time (s) AVG (ms) % Call Time --- -- -- -
---gc cr block 2-way 317,062 5,767 18 19.0
gc current block 2-way 201,663 4,063 20 13.4 gc buffer busy 111,372 3,970 36 13.1 CPU time 2,938 9.7
gc cr block busy 40,688 1,670 41 5.5
---Tackle latency first, then tackle busy events
Expected: To see 2-way, 3-wayHigh Latencies : Solution
Check network configuration
• Private
• Running at expected bit rate
Find cause of high CPU consumption
Health Check
Look for:
Unexpected Events
gc cr block lost 1159 ms
Unexpected “Hints”
•
Contention and Serialization
gc cr/current block busy 52 ms
•
Load and Scheduling
gc current block congested 14 ms
Unexpected high avg
Gigabit Ethernet Definition
Max Bandwidth
• 1000 Mbits = 125 MB per sec, excluding header and pause frames 118 MB per sec
• Equates to 85000 Clusterware/RAC messages or 14000 8k blocks
• RAC workload has a mix of short messages of 256 bytes and db_block_size of long messages • For real life workload only 60-70% the bandwidth can be sustained • For RAC type workload 40 MB per sec per interface is optimal load • For additional bandwidth more interfaces can be aggregated
1 U
ce2
NIC NIC ce4:1
$ --> ifconfig -a
ce2: flags=69040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER STANDBY,INACTIVE > mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255
groupname private
ce4: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255
groupname private
ce4:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,DEPRICATED,IPv4,NOFAILOVER> mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255
ce4
Aggregation Active/Standby
1 U
ce2
NIC NIC ce4:1
$ --> ifconfig -a
ce2: flags=69040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255
groupname private
ce4: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255
groupname private
ce4:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255
ce4
Aggregation Active/Active
1 U 1 U ce2 ce4 ce8 ce10 NIC NIC ce4:1 ce2 ce4 ce8 ce10 NIC NIC ce4:1 $ --> ifconfig -a
ce10: flags=69040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,STANDBY,INACTIVE> mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255
groupname private
ce4: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255
groupname private
ce4:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255
Aggregation Active/Standby
Aggregation Solutions
Cisco Etherchannel based 802.3ad AIX Etherchannel HPUX Auto Port Aggregation SUN Trunking, IPMP, GLD Linux Bonding (only certain modes) Windows NIC teamingAggregation Methods
Load balance/failover/load spreading
• spread on sends/serialize on receives
Active/Standby
Oracle Interconnect Requirement
• Both Send/Receive side load balancing
General Interconnect requirement Recommendations
For OLTP Workloads
• Normally 1 Gbit Ethernet with redundancy
(active/standby or load-balance) is sufficient
• For DW workloads
Multiple GigE aggregated 10 Gig E or Infiniband
Oracle RAC Cluster Interconnect network selection
Oracle Clusterware• IP address associated with Private Hostname (provided during Install interview)
Oracle RAC Database
• Private Network specified during the Install interview
Jumbo Frames
• Non-IEEE standard
• Useful for NAS/iSCSI storage
• Network device inter-operability issues
• Configure with care and test rigorously
Excerpt from alert.log:
Maximum Tranmission Unit (mtu) of the ether
adapter is different on the node running instance 4, and this node. Ether adapters connecting the
cluster nodes must be configured with identical mtu on all the nodes, for Oracle. Please ensure the mtu attribute of the ether adapter on all nodes [and
UDP Socket Buffer
(rx)
Default settings adequate for majority of customers May need to increase allocated buffer size
• MTU size increases
• netstat reporting fragmentation and/or reassembly
errors
Cluster Interconnect NIC settings
• NIC driver dependent – DEFAULTS GENERALLY SATISFACTORY
• Changes can occur between OS versions
• Linux 2.4 => 2.6 kernels,flowcontrol on e1000 drivers
• NAPI interrupt coalescence in 2.6
• Confirm flow control: rx=on, tx=off
• Confirm full bit rate (1000) for the NICs
• Confirm full duplex auto-negotiate
• Ensure NIC names/slots identical on all nodes
• Configure interconnect NICs on fastest PCI bus
• Ensure compatible switch settings
• 802.3ad on NICs = 802.3ad on switch ports
• MTU=9000 on NICs = MTU=9000 on switch ports
FAILURE TO CONFIGURE THE NICS AND SWITCHES CORRECTLY WILL RESULT IN SEVERE PERFORMANCE DEGRADATION AND NODE FENCING
The Interconnect and VLANs
• Interconnect should be dedicated non-routable subnet
mapped to a single dedicated, non-shared VLAN
• If VLANs are ‘trunked’ the interconnect VLAN traffic
should not exceed the access switch layer
• Minimize the impact of Spanning Tree events
• Monitor the switch(es) for congestion
• Avoid QoS definitions that may negatively impact