• No results found

Oracle Clusterware and Private Network Considerations

N/A
N/A
Protected

Academic year: 2021

Share "Oracle Clusterware and Private Network Considerations"

Copied!
48
0
0

Loading.... (view fulltext now)

Full text

(1)

Oracle Clusterware and Private Network Considerations

Much of this presentation is attributed to Michael Zoll and work done by the RAC Performance Development group

(2)

The following is intended to outline our general

product direction. It is intended for information

purposes only, and may not be incorporated into any

contract. It is not a commitment to deliver any

material, code, or functionality, and should not be

relied upon in making purchasing decisions.

The development, release, and timing of any

features or functionality described for Oracle’s

products remains at the sole discretion of Oracle.

(3)

Architectural Overview

RAC and Cache Fusion Performance Infrastructure

Common Problems and Resolution Aggregation and VLANs

(4)

Oracle Clusterware

public network EVMD CRSD OPROCD ONS VIP1 CSSD VIP2 VIPn /…/ shared storage

OCR and Voting Disks

RAW Devices CSSD Runs in Real Time Priority OS EVMD CRSD OPROCD ONS CSSD OS EVMD CRSD OPROCD ONS CSSD OS

Cluster Private High Speed Network

(5)

Under the Covers

Redo Log Files

Node n

Node 2

Redo Log Files Redo Log Files

Dictionary Cache Log buffer VKTM LGWR DBW0 SMON PMON Library Cache

Global Resoruce Directory

LMS0 Instance 2 SGA Instance n Buffer Cache LMON LMD0 DIAG Dictionary Cache Log buffer VKTM LGWR DBW0 SMON PMON Library Cache

Global Resoruce Directory

LMS0 Buffer Cache LMON LMD0 DIAG Dictionary Cache Log buffer VKTM LGWR DBW0 SMON PMON Library Cache

Global Resoruce Directory

LMS0 Buffer Cache LMON LMD0 DIAG Instance 1 Node 1 SGA SGA Runs in Real Time Priority Cluster Private High Speed Network

(6)

Global Cache Service (GCS)

Manages coherent access to data in buffer caches of all instances in the cluster

Minimizes access time to data which is not in local cache

access to data in global cache faster than disk

access

Implements fast direct memory access over high-speed interconnects

for all data blocks and types

Uses an efficient and scalable messaging protocol

Never more than 3 hops

(7)

Cache Hierarchy: Data in Remote Cache

Local Cache Miss Datablock Requested

Datablock Returned Remote Cache Hit

(8)

Cache Hierarchy: Data On Disk

Local Cache Miss Datablock Requested

Grant Returned Remote Cache Miss

(9)

Cache Hierarchy: Read Mostly

Local Cache Miss

(10)

11.1

CPU Optimizations for read-intensive operations

Read-only accessNo messagesDirect readsRead-mostly accessMessage reductionsLatency improvements Significant gains

(11)

Performance of Cache Fusion

Message:~200 bytes

Block: e.g. 8K

LMS

Initiate send and wait

Receive Process block Send Receive 200 bytes/(1 Gb/sec ) 8192 bytes/(1 Gb/sec) Total access time: e.g. ~360 microseconds (UDP over GBE)

(12)

Fundamentals: Minimum Latency (*), UDP/GBE

and RDS/IB

0.20 0.16 0.13 0.12 RDS/IB 0.46 0.36 0.31 0.30 UDP/GE 16K 8K 4K 2K Block size RT (ms)

(*) roundtrip, blocks are not “busy” i.e. no log flush, no serialization ( “buffer busy”)

AWR and Statspack reports would report averages as if they were normally distributed, the session wait history which is included in Statspack in 10.2 and AWR in 11g will show the actual quantiles

The minimum values in this table are the optimal values for 2-way and 3-way block transfers, but can be assumed to be the expected values ( I.e. 10ms for a 2-way block would be very high )

(13)

PROCESS (FG/LMS*) PROCESS (FG/LMS*) SOCKET LAYER (tx Buffers) IP LAYER TCP / UDP L2/L3 SWITCH IP LAYER TCP / UDP INTERFACE LAYER INTERFACE LAYER TX RX SOCKET LAYER (rx Buffers)

(14)

Infrastructure: Interconnect Bandwidth

Bandwidth requirements depend on several factors ( e.g. buffer cache size, #of CPUs per node, access patterns) and cannot be predicted precisely for every application

Typical utilization approx. 10-30% in OLTP

10000-12000 8K blocks per sec to saturate 1 x Gb

Ethernet ( 75-80% of theoretical bandwidth ) Generally, 1Gb/sec sufficient for performance and

scalability in OLTP.

DSS/DW systems should be designed with > 1Gb/sec

capacity

A sizing approach with rules of thumb is described in

Project MegaGrid: Capacity Planning for Large

(15)

Infrastructure: Private Interconnect

Network between the nodes of a RAC cluster MUST

be private

Supported links: GbE, IB ( IPoIB: 10.2 )

Supported transport protocols: UDP, RDS (10.2.0.3)

Use multiple or dual-ported NICs for redundancy and

increase bandwidth with NIC bonding

Large ( Jumbo ) Frames for GbE recommended if the

global cache workload requires it.

global cache block shipping versus small lock

(16)

PROCESS (FG/LMS*) PROCESS (FG/LMS*) SOCKET LAYER (tx Buffers) IP LAYER TCP / UDP L2/L3 SWITCH Ingress Egress IP LAYER TCP / UDP INTERFACE LAYER INTERFACE

LAYER Hardware interrupts

RX queues, RX buffers Software interrupts RX IP input queue TX Socket queues RX SOCKET LAYER (rx Buffers) USER KERNEL

Backplane pressure cpu

Socket buffers

TX IP queues

Network Packet Processing: Layers, Queues and Buffers

(17)

Infrastructure: IPC configuration

Important Settings:

Negotiated top bit rate and full duplex mode

NIC ring buffers

Ethernet flow control settings

CPU(s) receiving network interrupts

Verify your setup:

CVU does checking

Load testing eliminates potential for problems

AWR and ADDM give estimations of link utilization

Buffer overflows, congested links and flow control can have severe consequences for performance

(18)

Infrastructure: Operating System

Block access latencies increase when CPU(s) busy and run queues are long

Immediate LMS scheduling is critical for

predictable block access latencies when CPU > 80% busy

Fewer and busier LMS processes may be more efficient.

monitor their CPU utilization

Caveat: 1 LMS can be good for runtime

performance but may impact cluster

reconfiguration and instance recovery time

the default is good for most requirements

Higher priority for LMS is default

(19)

<Insert Picture Here>

Common Problems and Symptoms

“Lost Blocks”: Interconnect or Switch Problems System load and scheduling

Contention

(20)

Miss-configured or Faulty Interconnect Can

Cause:

Dropped packets/fragments

Buffer overflows

Packet reassembly failures or timeouts

Ethernet Flow control kicks in

TX/RX errors

“lost blocks” at the RDBMS level, responsible for

64% of escalations

(21)

“Lost Blocks”: NIC Receive Errors

Db_block_size = 8K

ifconfig –a:

eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04

inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95 TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0

(22)

“Lost Blocks”: IP Packet Reassembly Failures

netstat –s

Ip:

   84884742 total packets received

   …

1201 fragments dropped after timeout

   …

(23)

Top 5 Timed Events Avg %Total

~~~~~~~~~~~~~~~~~~ wait Call

Event Waits Time(s)(ms) Time Wait Class

---log file sync 286,038 49,872 174 41.7 Commit gc buffer busy 177,315 29,021 164 24.3 Cluster

gc cr block busy 110,348 5,703 52 4.8 Cluster gc cr block lost 4,272 4,953 1159 4.1 Cluster

cr request retry 6,316 4,668 739 3.9 Other

Finding a Problem with the Interconnect or IPC

(24)

Global Cache Lost block handling

Detection Time in 11g reduced

500ms ( around 5 secs in 10g )

can be lowered if necessary

robust ( no false positives )

no extra overhead

Cr request retry event related to lost blocks

It is highly likely to see it when gc cr blocks lost

(25)

Interconnect Statistics

Automatic Workload Repository (AWR )

Target Avg Latency Stddev Avg Latency Stddev Instance 500B msg 500B msg 8K msg 8K msg --- 1 .79 .65 1.04 1.06 2 .75 .57 . 95 .78 3 .55 .59 .53 .59 4 1.59 3.16 1.46 1.82

---Latency probes for different message sizes

Exact throughput measurements ( not shown)

(26)

“Blocks Lost”: Solution

Fix interconnect NICs and switches Tune IPC buffer sizes

(27)

CPU Saturation or Long Run Queues

Top 5 Timed Events Avg %Total

~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s) (ms) Time Wait Class

--- --- --- ---- ---

---db file sequential 1,312,840 21,590 16 21.8 User I/O read

gc current block 275,004 21,054 77 21.3 Cluster

congested

gc cr grant congested 177,044 13,495 76 13.6 Cluster gc current block 1,192,113 9,931 8 10.0 Cluster 2-way

gc cr block“Congested” : LMS could not dequeue messages fast enoughcongested 85,975 8,917 104 9.0 Cluster

(28)

High CPU Load: Solution

Run LMS at higher priority (default) Start more LMS processes

Never use more LMS processes than CPUs

Reduce the number of user processes Find cause of high CPU consumption

(29)

Contention

Event Waits Time (s) AVG (ms) % Call Time --- - ---gc cr block 2-way 317,062 5,767 18 19.0 gc current block 2-way 201,663 4,063 20 13.4 gc buffer busy 111,372 3,970 36 13.1 CPU time 2,938 9.7

gc cr block busy 40,688 1,670 41 5.5

---Global Contention on Data

Serialization

(30)

Contention: Solution

Identify “hot” blocks in application Reduce concurrency on hot blocks

(31)

High Latencies

Event Waits Time (s) AVG (ms) % Call Time --- -- -- -

---gc cr block 2-way 317,062 5,767 18 19.0

gc current block 2-way 201,663 4,063 20 13.4 gc buffer busy 111,372 3,970 36 13.1 CPU time 2,938 9.7

gc cr block busy 40,688 1,670 41 5.5

---Tackle latency first, then tackle busy events

Expected: To see 2-way, 3-way

(32)

High Latencies : Solution

Check network configuration

Private

Running at expected bit rate

Find cause of high CPU consumption

(33)

Health Check

Look for:

Unexpected Events

gc cr block lost 1159 ms

Unexpected “Hints”

Contention and Serialization

gc cr/current block busy 52 ms

Load and Scheduling

gc current block congested 14 ms

Unexpected high avg

(34)

Gigabit Ethernet Definition

Max Bandwidth

1000 Mbits = 125 MB per sec, excluding header and pause frames 118 MB per sec

Equates to 85000 Clusterware/RAC messages or 14000 8k  blocks

RAC workload has a mix of short messages of 256 bytes and db_block_size of long messages For real life workload only 60-70% the bandwidth can be sustainedFor RAC type workload 40 MB per sec per interface is optimal loadFor additional bandwidth more interfaces can be aggregated

(35)

1 U

ce2

NIC NIC ce4:1

$ --> ifconfig -a

ce2: flags=69040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER STANDBY,INACTIVE > mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255

groupname private

ce4: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255

groupname private

ce4:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,DEPRICATED,IPv4,NOFAILOVER> mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255

ce4

Aggregation Active/Standby

(36)

1 U

ce2

NIC NIC ce4:1

$ --> ifconfig -a

ce2: flags=69040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255

groupname private

ce4: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255

groupname private

ce4:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255

ce4

Aggregation Active/Active

(37)

1 U 1 U ce2 ce4 ce8 ce10 NIC NIC ce4:1 ce2 ce4 ce8 ce10 NIC NIC ce4:1 $ --> ifconfig -a

ce10: flags=69040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,STANDBY,INACTIVE> mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255

groupname private

ce4: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255

groupname private

ce4:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255

Aggregation Active/Standby

(38)

Aggregation Solutions

Cisco Etherchannel based 802.3ad AIX Etherchannel HPUX Auto Port Aggregation SUN Trunking, IPMP, GLD   Linux Bonding (only certain modes) Windows NIC teaming    

(39)

Aggregation Methods

Load balance/failover/load spreading

spread on sends/serialize on receives

Active/Standby

Oracle Interconnect Requirement

Both Send/Receive side load balancing

(40)

General Interconnect requirement Recommendations

For OLTP Workloads

Normally 1 Gbit Ethernet with redundancy

(active/standby or load-balance) is sufficient

For DW workloads

         Multiple GigE aggregated          10 Gig E or Infiniband

(41)

Oracle RAC Cluster Interconnect  network selection

Oracle Clusterware

IP address associated with Private Hostname (provided during Install interview)

Oracle RAC Database

Private Network specified during the Install interview

(42)

Jumbo Frames

Non-IEEE standard

Useful for NAS/iSCSI storage

Network device inter-operability issues

Configure with care and test rigorously

Excerpt from alert.log:

Maximum Tranmission Unit (mtu) of the ether

adapter is different on the node running instance 4, and this node. Ether adapters connecting the

cluster nodes must be configured with identical mtu on all the nodes, for Oracle. Please ensure the mtu attribute of the ether adapter on all nodes [and

(43)

UDP Socket Buffer

(rx)

Default settings adequate for majority of customers May need to increase allocated buffer size

MTU size increases

netstat reporting fragmentation and/or reassembly

errors

(44)

Cluster Interconnect NIC settings

NIC driver dependent – DEFAULTS GENERALLY SATISFACTORY

Changes can occur between OS versions

Linux 2.4 => 2.6 kernels,flowcontrol on e1000 drivers

NAPI interrupt coalescence in 2.6

Confirm flow control: rx=on, tx=off

Confirm full bit rate (1000) for the NICs

Confirm full duplex auto-negotiate

Ensure NIC names/slots identical on all nodes

Configure interconnect NICs on fastest PCI bus

Ensure compatible switch settings

802.3ad on NICs = 802.3ad on switch ports

MTU=9000 on NICs = MTU=9000 on switch ports

FAILURE TO CONFIGURE THE NICS AND SWITCHES CORRECTLY WILL RESULT IN SEVERE PERFORMANCE DEGRADATION AND NODE FENCING

(45)

The Interconnect and VLANs

Interconnect should be dedicated non-routable subnet

mapped to a single dedicated, non-shared VLAN

If VLANs are ‘trunked’ the interconnect VLAN traffic

should not exceed the access switch layer

Minimize the impact of Spanning Tree events

Monitor the switch(es) for congestion

Avoid QoS definitions that may negatively impact

(46)
(47)
(48)

A

Q

Q U E S T I O N S

Q U E S T I O N S

&

A N S W E R S

References

Related documents

Section 3 discusses some more general considerations that must be addressed in design, construction and modification Aboriginal housing, including the appropriateness of the

surplus is given in equation (9). This completes the proof. Note that at this optimum tariff on low quality imports, profit of the domestic output firm falls to zero. This

In addition, with a similar or complementary (as part of a marketing and production strategy) activity are also involved numerous organizations and individuals from the private

The meso-scale chemistry-transport model CHIMERE is used to assess our understanding of major sources and formation processes leading to a fairly large amount of organic aerosols –

(ii) Freelance writers and others who have little or no understanding of how to write SEO copy: Whether you’re new to freelance writing or internet marketing, or you know a..

Study the distance time graph of a toy car given here and choose the incorrect statement from the following... A The toy car is fastest during first

Caution: Question Paper CODE as given above MUST be correctly marked in the answer OMR sheet Caution: Question Paper CODE as given above MUST be correctly

TF 11 : Formation with Specialization and Research in Health at the Work and Environment in Africa; Assistance to occupational hygiene graduate programmes in developing countries