RAC Important Concepts

(1)

Oracle Real Application Clusters (RAC)

The following documentation is a guide on how to install, configure and administer Oracle 10g Real Application Clusters (RAC). Some of the topics that I will be discussing have already been covered in my Oracle topic.

The site has been comprised of reading the following books and real world experience, if you are new to Oracle RAC I highly recommend that you should purchase these books as it contains far more information than this web site contains and of course the Official Oracle web site contains all the documentation you will ever need.

Please feel free to email me any constructive criticism you have with the site as any additional knowledge or mistakes that I have made would be most welcomed.

1. HA, Clustering and OPS ... 3

High Availability and Clustering ... 3

Clustering ... 5

Oracle RAC History ... 6

Oracle Parallel Server Architecture ... 6

2. RAC Architecture ... 9

RAC Architecture Introduction ... 10

RAC Components ... 11

Disk architecture ... 13

Oracle Clusterware ... 14

Oracle Kernel Components ... 17

RAC Background Processes ... 18

3. RAC Installation, Configuration and Storage ... 20

RAC Installation ... 20

4. RAC Administration and Management ... 22

RAC Parameters ... 22

Starting and Stopping Instances ... 23

Undo Management ... 24 Temporary Tablespace ... 24 Redologs ... 24 Flashback ... 25 SRVCTL command ... 25 Services ... 27

Cluster Ready Services (CRS) ... 29

Oracle Cluster Registry (OCR) ... 31

Voting Disk ... 33

5. RAC Backups and Recovery ... 33

Introduction ... 34

Backup Basics ... 34

Instance Recovery ... 34

Crash Recovery ... 35

(2)

6. RAC Performance ... 41

RAC Performance ... 41

Partitioning Workload ... 42

RAC Wait Events ... 42

Enqueue Tuning ... 44

AWR and RAC ... 45

Cluster Interconnect ... 47

7. Global Resource Directory (GRD) ... 47

GRD introduction ... 48

Cache Coherency ... 48

Resources and Enqueues ... 48

Global Enqueue Services (GES) ... 50

Global Locks ... 50

Messaging ... 51

Global Cache Services (GCS) ... 52

8. Cache Fusion ... 57

Introduction ... 57

Ping ... 57

Past Image Blocks (PI) ... 58

Cache Fusion I ... 58

Cache Fusion in Operation ... 62

9. RAC Troubleshooting ... 71

Troubleshooting ... 72

Lamport Algorithm ... 74

Disable/Enable Oracle RAC ... 75

Performance Issues ... 75

Debugging Node Eviction ... 77

Debugging CRS and GSD ... 78

10. Adding and Removing nodes ... 80

Adding and removing nodes ... 80

Pre-Install Checking ... 80

Install CRS ... 80

Installing Oracle DB Software ... 81

Create the Database Instance ... 82

Removing a Node ... 82

11. RAC Cheat sheet ... 85

Cheatsheet ... 85 Useful Views/Tables ... 86 Useful Parameters ... 88 Processes ... 89 General Administration ... 90 CRS Administration ... 92 Voting Disk ... 95

(3)

1. HA, Clustering and OPS

High Availability and Clustering

When you have very critical systems that require to be online 24x7 then you need a HA solution (High Availability), you have to weigh up the risk associated with downtime against the cost of a solution. HA solutions are not cheap and they are not easy to manage. HA solutions need to be thoroughly tested as it may not be tested in the real world for months. I had a solution that run for almost a year before a hardware failure caused a failover, this is when your testing before hand comes into play.

As I said before HA comes with a price, and there are a number of HA technologies • Fault Tolerance - this technology protects you from hardware failures for example

redundant PSU, etc

• Disaster Recovery - this technology protects from operational issues such as a Data Center becoming unavailable

• Disaster Tolerance - this technology is used to prepare for the above two, the most important of the three technologies.

Every company should plan for unplanned outages, this costs virtually nothing, knowing what to do in a DR situation is half the battle, in many companies people make excuses not to design a DR plan (it costs to much, we don't have the redundant hardware,etc). You cannot make these assumptions until you design a DR plan, the plan will highlight the risks and the costs that go with that risk, then you can make the decision on what you can and cannot afford, there is no excuse not to create DR plan.

Sometimes in large corporations you will hear the phrase five nines, this phrases means the availability of a system and what downtime (approx) is allowed, the table below highlights the uptime a system requires in order to achieve the five nines

% uptime % Downtime Downtime per year Downtime per week

98 2 7.3 days 3 hours 22 minutes

99 1 3.65 days 1 hour 41 minutes

99.8 0.2 17 hours 30 minutes 20 minutes

99.9 0.1 8 hours 45 minutes 10 minutes

99.99 0.01 52.5 minutes 1 minute

99.999 (five nines) 0.001 5.25 minutes 6 seconds

To achieve the five nines your system is only allowed 5.25 minutes per year or 6 seconds per week, in some HA designs it may take 6 seconds to failover.

When looking for a solution you should try and build redundancy into your plan, this is the first step to a HA solution, for example

(4)

• Make sure computer cabinets have dual power

• Make sure servers have dual power supplies, dual network cards, redundant hard disks that can be mirrored

• Make sure you use multipathing to the data disk which are usually on a SAN or NAS

• Make sure that the server is connected to two different network switches

You are trying to eliminate as many Single Point Of Failures (SPOF's) as you can without increasing the costs. Most hardware today will have these redundancy features built in, but its up to you to make use of them.

HA comes in three/four favors

No-failover

This option usually uses the already built-in redundancy, failed disks and PSU can be replaced online, but if a major hardware was to fail then a system outage is unavoidable, the system will remain down until it is fixed.

This solution can be perfectly acceptable in some environments but at what price to the business, even in today's market QA/DEV systems cost money when not running, i am sure that your developers are quite happy to take the day off paid while you fix the system

Cluster

This is the jewel in the HA world, a cluster can be configure in a variety of favors, from minimal downtime while services are moved to a good nodes, to virtual zero downtime.

However a cluster solution does come with a heavy price tag, hardware,

configuration and maintaining a cluster is expensive but if you business loses vast amounts of money if you system is down, then its worth it.

Cold failover

Many smaller companies use this solution, basically you have a additional server ready to take over a number of servers if one where to fail. I have used this technique myself, i create a number of scripts that can turn a cold standby server into any number of servers, if the original server is going to have a prolonged outage.

The problem with this solution is there is going to be downtime, especially if it takes a long time to get the standby server up to the same point in time as the failed server.

The advantage of this solution is that one additional server could cover a number of servers, even if it slight under powered to the original server, as long as it keeps the service running.

Hot failover

Many applications offer hot-standby servers, these servers are running along side the live system, data is applied to the hot-standby server periodically to keep it up to date, thus in a failover situation the server is almost ready to go.

(5)

The problem with this system is costs and manageability, also one server is usually dedicated to one application, thus you may have to have many hot-standby servers.

The advantage is that downtime is kept to a minimum, but there will be some downtime, generally the time it take to get the hot-standby server up todate, for example applying the last set of logs to a database.

Here is a summary table that shows the most command aspects of cold failover versus hot failover

Aspects Cold Failover Hot Failover Scalability/number

of nodes Scalable limited to the capacity of a single node

As nodes can be added on demand, it provides infinite scalability. High number of nodes supported.

User interruption

Required up to a minimal extent. The failover operation can be scripted or automated to a certain extent

Not required, failover is automatic

Transparent failover of

applications Not Possible

Transparent application failover will be available where sessions can be transferred to another node without user interruption

Load Balancing Not possible, only one server _{will be used} Incoming load can be balanced _{between both nodes} Usage of resources Only one server at a time, the

other server will be kept idle Both the servers will be used

Failover time More than minutes as the other _{system must be cold started} Less than a minute, typically in a few _seconds.

Clustering

I have discussed clustering in my Tomcat and JBoss topics, so I will only touch on the subject lightly here. A cluster is a group of two or more interconnected nodes, that provide a service. The cluster provides a high level of fault tolerance, if a node were to become unavailable within the cluster the services are moved/restored to another working node, thus the end user should never know that a fault occurred.

Clusters can be setup to use a single node in the cluster or to load balance between the nodes, but the main object is to keep the service running, hence why you pay top dollar for this. One advantage of a cluster is that it is very scalable because additional nodes can be added or taken away (a node may need to be patched) without interrupting the service.

(6)

Clustering has come a long way, there are now three types of clustering architecture

Shared nothing

each node within the cluster is independent, they share nothing. An example of this may be web servers, you a have number of nodes within the cluster

supplying the same web service. The content will be static thus there is no need to share disks, etc.

Shared disk only

each node will be attached or have access to the same set of disks. These disks will contain the data that is required by the service. One node will control the application and the disk and in the event of a that node fails, the other node will take control of both the application and the data. This means that one node will have to be on standby setting idle waiting to take over if required to do so. A typical traditional Veritas Cluster and Sun Cluster would fit the bill here.

Shared everything

again all nodes will be attached or have access to the same set of disks, but this time each node can read/write to the disks concurrently. Normally there will be a piece of software that controls the reading and writing to the disks

ensuring data integrity. To achieve this a cluster-wide filesystem is introduced, so that all nodes view the filesystem identically, the software then coordinates the sharing and updating of files, records and databases.

Oracle RAC and IBM HACMP would be good examples of this type of cluster

Oracle RAC History

The first Oracle cluster database was release with Oracle 6 for the digital VAX, this was the first cluster database on the market. With Oracle 6.2 Oracle Parallel Server (OPS) was born, which used Oracle's own DLM (Distributed Lock Manager). Oracle 7 used vendor-supplied clusterware but this was complex to setup and manage, Oracle 8 introduce a general lock manager and this was a direction for Oracle to create its own clusterware product. Oracle's lock manager is integrated with Oracle code with an additional layer called OSD (Operating System Dependent), this was soon integrated within the kernel and become known as IDLM (Integrated Distributed Lock Manager) in later Oracle versions. Oracle Real Application Clusters 9i (Oracle RAC) used the same IDLM and relied on external clusterware software (Sun Cluster, Veritas Cluster, etc).

Oracle Parallel Server Architecture

A Oracle parallel database consists of two or more nodes that own Oracle instances and share a disk array. Each node has its own SGA and its own redo logs, but the data files and control files are all shared to all instances. All data and controls are concurrently read and written by all instances, redo logs files on the other hand can be read by any instance but only written by the owning instance. Each instance has its own set of background processes.

(7)

The components of a OPS database are

• Cluster Manager - OS Vendor specific

• Distributed Lock Manager (DLM)

• Cluster Interconnect

• Shared Disk Array

The Cluster Group Services (CGS) has some OSD components (node monitor interface) and the rest is built in the kernel. CGS has a key repository used by the DLM for

communication and network related activities. This layer provides the following

• Internode messaging

• Group member consistency

• Cluster synchronization

• Process grouping, registration and deregistration

The DLM is a integral part of OPS and the RAC stack. In older versions the DLM API module had to rely on external OS routines to check the status of a lock, this was done using UNIX sockets and pipes. With the new IDLM the data is in the SGA of each instance and requires only a serialized lookup using latches and/or enqueues and may require global coordination, the algorithm for which was built directly into the Oracle kernel. The IDLM job is to track every lock granted to a resource, memory structures

(8)

required by the DLM are allocated out of the shared pool. The design of the DLM is such it can survive nodes failures in all but one node of the cluster.

A user must require a lock before it can operate on any resource, the Parallel Cache

Management (PCM) coordinates and maintains data blocks exists within each data buffer

cache (of an instance) so that data viewed or requested by users is never inconsistent or incoherent. The PCM ensures that only one instance in a cluster can modify a block at any given time, other instances have to wait until the lock is released.

DLM maintains information about all locks on a given resource, the DLM nominates one node to manage all relevant lock information for a resource, this node is referred to as the

master node, lock mastering is distributed among all nodes. Using the IPC layer the DLM

permits it to share the load of mastering resources, which means that a user can lock a resource on one node but actually end up communicating with the processes on another node.

In OPS 8i Oracle introduced Cache Fusion Stage 1, this introduced a new background process called the Block Server Process (BSP). The BSP main roles was to ship

consistent read (CR) version(s) of a block(s) across an instance in a read/write contention scenario, this shipping is performed over a high speed interconnect. Cache Fusion Stage 2 in Oracle 9i and 10g, addresses some of the issues with Stage 1, in which both types of blocks (CR and CUR) can be transferred using the interconnect. Since 8i the introduction of the GV$ views meant that a DBA could view cluster-wide database and other statistics sitting on any node/instance of the cluster.

The limitations of OPS are

• scalability is limited to the capacity of the node • you cannot easily add additional nodes to a OPS

• OPS requires third-party clustering software adding to the expense and complexity.

• OPS requires RAW partitions and these can be difficult to manage. Oracle RAC addresses the limitation in OPS by extending Cache Fusion, and the

dynamic lock mastering. Oracle 10g RAC also comes with its own integrated clusterware and storage management framework, removing all dependencies of a third-party

clusterware product. The latest Oracle RAC offers

• Availability - can be configured to have no SPOF even when running on low spec'ed hardware

• Scalability - multiple servers in a cluster to manage a single database transparently (scale out), basically means adding additional server is easy. • Reliability - improved code and monitoring RAC has become very reliable • Affordability - you can use low cost hardware, however RAC is not going to

(9)

• Transparency - RAC looks and feels like a standard Oracle database to an application

(10)

RAC Architecture Introduction

Oracle Real Application clusters allows multiple instances to access a single database, the instances will be running on multiple nodes. In an standard Oracle configuration a

database can only be mounted by one instance but in a RAC environment many instances can access a single database.

Oracle's RAC is heavy dependent on a efficient, high reliable high speed private network called the interconnect, make sure when designing a RAC system that you get the best that you can afford.

The table below describes the difference of a standard oracle database (single instance) an a RAC environment

Component Single Instance

Environment RAC Environment

SGA Instance has its own _SGA Each instance has its own SGA

Background processes

Instance has its own set of background

processes

Each instance has its own set of background processes

Datafiles Accessed by only one _instance Shared by all instances (shared storage)

Control Files Accessed by only one

(11)

Online Redo Logfile

Dedicated for

write/read to only one instance

Only one instance can write but other

instances can read during recovery and archiving. If an instance is shutdown, log switches by other instances can force the idle instance redo logs to be archived

Archived Redo Logfile

Dedicated to the instance

Private to the instance but other instances will need access to all required archive logs during media recovery

Flash Recovery

Log Accessed by only one instance Shared by all instances (shared storage)

Alert Log and Trace Files

Dedicated to the instance

Private to each instance, other instances never read or write to those files.

ORACLE_HOME

Multiple instances on the same server accessing different databases ca use the same executable files

Same as single instance plus can be placed on shared file system allowing a common

ORACLE_HOME for all instances in a RAC environment.

RAC Components

The major components of a Oracle RAC system are

• Shared disk system

• Oracle Clusterware

• Cluster Interconnects

• Oracle Kernel Components

(12)

(13)

Disk architecture

With today's SAN and NAS disk storage systems, sharing storage is fairly easy and is required for a RAC environment, you can use the below storage setups

• SAN (Storage Area Networks) - generally using fibre to connect to the SAN • NAS ( Network Attached Storage) - generally using a network to connect to the

NAS using either NFS, ISCSI

• JBOD - direct attached storage, the old traditional way and still used by many companies as a cheap option

All of the above solutions can offer multi-pathing to reduce SPOFs within the RAC environment, there is no reason not to configure multi-pathing as the cost is cheap when adding additional paths to the disk because most of the expense is paid when out when configuring the first path, so an additional controller card and network/fibre cables is all that is need.

The last thing to think about is how to setup the underlining disk structure this is known as a raid level, there are about 12 different raid levels that I know off, here are the most common ones

raid 0 (Striping)

A number of disks are concatenated together to give the appearance of one very large disk.

Advantages

Improved performance

Can Create very large Volumes Disadvantages

Not highly available (if one disk fails, the volume fails)

raid 1 (Mirroring)

A single disk is mirrored by another disk, if one disk fails the system is unaffected as it can use its mirror.

Advantages

Improved performance

Highly Available (if one disk fails the mirror takes over) Disadvantages

Expensive (requires double the number of disks)

raid 5 Raid stands for Redundant Array of Inexpensive Disks, the disks are striped with parity across 3 or more disks, the parity is used in the event that one of the disks fails, the data on the failed disk is reconstructed by using the parity bit.

(14)

Improved performance (read only) Not expensive

Disadvantages

Slow write operations (caused by having to create the parity bit)

There are many other raid levels that can be used with a particular hardware environment for example EMC storage uses the RAID-S, HP storage uses Auto RAID, so check with the manufacture for the best solution that will provide you with the best performance and resilience.

Once you have you storage attached to the servers, you have three choices on how to setup the disks

• Raw Volumes - normally used for performance benefits, however they are hard to manage and backup

• Cluster FileSystem - used to hold all the Oracle datafiles can be used by windows and linux, its not used widely

• Automatic Storage Management (ASM) - Oracle choice of storage management, its a portable, dedicated and optimized cluster filesystem

I will only be discussing ASM, which i have already have a topic on called Automatic Storage Management.

Oracle Clusterware

Oracle Clusterware software is designed to run Oracle in a cluster mode, it can support you to 64 nodes, it can even be used with a vendor cluster like Sun Cluster.

The Clusterware software allows nodes to communicate with each other and forms the cluster that makes the nodes work as a single logical server. The software is run by the Cluster Ready Services (CRS) using the Oracle Cluster Registry (OCR) that records and maintains the cluster and node membership information and the voting disk which acts as a tiebreaker during communication failures. Consistent heartbeat information travels across the interconnect to the voting disk when the cluster is running.

The CRS has four components

• OPROCd - Process Monitor Daemon

• CRSd - CRS daemon, the failure of this daemon results in a node being reboot to avoid data corruption

• OCSSd - Oracle Cluster Synchronization Service Daemon (updates the registry)

(15)

The OPROCd daemon provides the I/O fencing for the Oracle cluster, it uses the

hangcheck timer or watchdog timer for the cluster integrity. It is locked into memory and runs as a realtime processes, failure of this daemon results in the node being rebooted. Fencing is used to protect the data, if a node were to have problems fencing presumes the worst and protects the data thus restarts the node in question, its better to be save than sorry.

The CRSd process manages resources such as starting and stopping the services and failover of the application resources, it also spawns separate processes to manage application resources. CRS manages the OCR and stores the current know state of the cluster, it requires a public, private and VIP interface in order to run. OCSSd provides synchronization services among nodes, it provides access to the node membership and enables basic cluster services, including cluster group services and locking, failure of this daemon causes the node to be rebooted to avoid split-brain situations.

The below functions are covered by the OCSSd

• CSS provides basic Group Services Support; it is a distributed group membership system that allows applications to coordinate activities to archive a common result.

• Group services use vendor clusterware group services when it is available. • Lock services provide the basic cluster-wide serialization locking functions, it

uses the First In, First Out (FIFO) mechanism to manage locking

• Node services uses OCR to store data and updates the information during reconfiguration, it also manages the OCR data which is static otherwise. The last component is the Event Management Logger, which runs the EVMd process. The daemon spawns a processes called evmlogger and generates the events when things happen. The evmlogger spawns new children processes on demand and scans the callout directory to invoke callouts. Death of the EVMd daemon will not halt the instance and will be restarted.

Quick recap

CRS Process Functionality Failure of the Process Run _AS OPROCd - Process

Monitor

provides basic cluster integrity

services Node Restart root

EVMd - Event

Management spawns a child process event logger and generates callouts Daemon automatically restarted, no node restart oracle

OCSSd - Cluster Synchronization Services

basic node membership, group

services, basic locking Node Restart oracle

CRSd - Cluster Ready Services

resource monitoring, failover and node recovery

Daemon restarted automatically, no node

(16)

restart

The cluster-ready services (CRS) is a new component in 10g RAC, its is installed in a separate home directory called ORACLE_CRS_HOME. It is a mandatory component but can be used with a third party cluster (Veritas, Sun Cluster), by default it manages the node membership functionality along with managing regular RAC-related resources and services

RAC uses a membership scheme, thus any node wanting to join the cluster as to become a member. RAC can evict any member that it seems as a problem, its primary concern is protecting the data. You can add and remove nodes from the cluster and the membership increases or decrease, when network problems occur membership becomes the deciding factor on which part stays as the cluster and what nodes get evicted, the use of a voting disk is used which I will talk about later.

The resource management framework manage the resources to the cluster (disks, volumes), thus you can have only have one resource management framework per resource. Multiple frameworks are not supported as it can lead to undesirable affects. The Oracle Cluster Ready Services (CRS) uses the registry to keep the cluster

configuration, it should reside on a shared storage and accessible to all nodes within the cluster. This shared storage is known as the Oracle Cluster Registry (OCR) and its a major part of the cluster, it is automatically backed up (every 4 hours) the daemons plus you can manually back it up. The OCSSd uses the OCR extensively and writes the changes to the registry

The OCR keeps details of all resources and services, it stores name and value pairs of information such as resources that are used to manage the resource equivalents by the CRS stack. Resources with the CRS stack are components that are managed by CRS and have the information on the good/bad state and the callout scripts. The OCR is also used to supply bootstrap information ports, nodes, etc, it is a binary file.

The OCR is loaded as cache on each node, each node will update the cache then only one node is allowed to write the cache to the OCR file, the node is called the master. The Enterprise manager also uses the OCR cache, it should be at least 100MB in size. The CRS daemon will update the OCR about status of the nodes in the cluster during reconfigurations and failures.

The voting disk (or quorum disk) is shared by all nodes within the cluster, information about the cluster is constantly being written to the disk, this is know as the heartbeat. If for any reason a node cannot access the voting disk it is immediately evicted from the cluster, this protects the cluster from split-brains (the Instance Membership Recovery algorithm IMR is used to detect and resolve split-brains) as the voting disk decides what part is the really cluster. The voting disk manages the cluster membership and arbitrates the cluster ownership during communication failures between nodes. Voting is often confused with quorum the are similar but distinct, below details what each means

(17)

Voting A vote is usually a formal expression of opinion or will in response to a

proposed decision

Quorum is defined as the number, usually a majority of members of a body, that, _{when assembled is legally competent to transact business}

The only vote that counts is the quorum member vote, the quorum member vote defines the cluster. If a node or group of nodes cannot archive a quorum, they should not start any services because they risk conflicting with an established quorum.

The voting disk has to reside on shared storage, it is a a small file (20MB) that can be accessed by all nodes in the cluster. In Oracle 10g R1 you can have only one voting disk, but in R2 you can have upto 32 voting disks allowing you to eliminate any SPOF's. The original Virtual IP in Oracle was Transparent Application Failover (TAF), this had limitations, this has now been replaced with cluster VIPs. The cluster VIPs will failover to working nodes if a node should fail, these public IPs are configured in DNS so that users can access them. The cluster VIPs are different from the cluster interconnect IP address and are only used to access the database.

The cluster interconnect is used to synchronize the resources of the RAC cluster, and also used to transfer some data from one instance to another. This interconnect should be private, highly available and fast with low latency, ideally they should be on a minimum private 1GB network. What ever hardware you are using the NIC should use multi-pathing (Linux - bonding, Solaris - IPMP). You can use crossover cables in a QA/DEV environment but it is not supported in a production environment, also crossover cables limit you to a two node cluster.

Oracle Kernel Components

The kernel components relate to the background processes, buffer cache and shared pool and managing the resources without conflicts and corruptions requires special handling. In RAC as more than one instance is accessing the resource, the instances require better coordination at the resource management level. Each node will have its own set of buffers but will be able to request and receive data blocks currently held in another instance's cache. The management of data sharing and exchange is done by the Global Cache Services (GCS).

All the resources in the cluster group form a central repository called the Global Resource Directory (GRD), which is distributed. Each instance masters some set of resources and together all instances form the GRD. The resources are equally distributed among the nodes based on their weight. The GRD is managed by two services called Global Caches Services (GCS) and Global Enqueue Services (GES), together they form and manage the GRD. When a node leaves the cluster, the GRD portion of that instance needs to be redistributed to the surviving nodes, a similar action is performed when a new node joins.

(18)

RAC Background Processes

Each node has its own background processes and memory structures, there are additional processes than the norm to manage the shared resources, theses additional processes maintain cache coherency across the nodes.

Cache coherency is the technique of keeping multiple copies of a buffer consistent between different Oracle instances on different nodes. Global cache management ensures that access to a master copy of a data block in one buffer cache is coordinated with the copy of the block in another buffer cache.

The sequence of a operation would go as below

1. When instance A needs a block of data to modify, it reads the bock from disk, before reading it must inform the GCS (DLM). GCS keeps track of the lock status of the data block by keeping an exclusive lock on it on behalf of instance A

2. Now instance B wants to modify that same data block, it to must inform GCS, GCS will then request instance A to release the lock, thus GCS ensures that instance B gets the latest version of the data block (including instance A modifications) and then exclusively locks it on instance B behalf.

3. At any one point in time, only one instance has the current copy of the block, thus keeping the integrity of the block.

GCS maintains data coherency and coordination by keeping track of all lock status of each block that can be read/written to by any nodes in the RAC. GCS is an in memory database that contains information about current locks on blocks and instances waiting to acquire locks. This is known as Parallel Cache Management (PCM). The Global

Resource Manager (GRM) helps to coordinate and communicate the lock requests from Oracle processes between instances in the RAC. Each instance has a buffer cache in its SGA, to ensure that each RAC instance obtains the block that it needs to satisfy a query or transaction. RAC uses two processes the GCS and GES which maintain records of lock status of each data file and each cached block using a GRD.

So what is a resource, it is an identifiable entity, it basically has a name or a reference, it can be a area in memory, a disk file or an abstract entity. A resource can be owned or locked in various states (exclusive or shared). Any shared resource is lockable and if it is not shared no access conflict will occur.

A global resource is a resource that is visible to all the nodes within the cluster. Data buffer cache blocks are the most obvious and most heavily global resource, transaction enqueue's and database data structures are other examples. GCS handle data buffer cache blocks and GES handle all the non-data block resources.

All caches in the SGA are either global or local, dictionary and buffer caches are global, large and java pool buffer caches are local. Cache fusion is used to read the data buffer cache from another instance instead of getting the block from disk, thus cache fusion

(19)

moves current copies of data blocks between instances (hence why you need a fast private network), GCS manages the block transfers between the instances.

Finally we get to the processes

Oracle RAC Daemons and Processes

LMSn Lock Manager Server process - GCS

this is the cache fusion part and the most active process, it handles the consistent copies of blocks that are transferred between

instances. It receives requests from LMD to perform lock requests. I rolls back any uncommitted transactions. There can be up to ten LMS processes running and can be started dynamically if demand requires it.

they manage lock manager service requests for GCS resources and send them to a service queue to be handled by the LMSn process. It also handles global deadlock detection and monitors for lock conversion timeouts.

as a performance gain you can increase this process priority to make sure CPU starvation does not occur

you can see the statistics of this daemon by looking at the view X$KJMSDP LMON Lock Monitor Process - GES

this process manages the GES, it maintains consistency of GCS memory structure in case of process death. It is also responsible for cluster reconfiguration and locks reconfiguration (node joining or leaving), it checks for instance deaths and listens for local

messaging.

A detailed log file is created that tracks any reconfigurations that have happened. LMD Lock Manager Daemon - GES

this manages the enqueue manager service requests for the GCS. It also handles deadlock detention and remote resource requests from other instances.

you can see the statistics of this daemon by looking at the view X$KJMDDP

LCK0

Lock Process - GES

manages instance resource requests and cross-instance call operations for shared resources. It builds a list of invalid lock elements and validates lock elements during recovery.

DIAG Diagnostic _Daemon

This is a lightweight process, it uses the DIAG framework to monitor the health of the cluster. It captures information for later diagnosis in the event of failures. It will perform any necessary recovery if an operational hang is detected.

(20)

3. RAC Installation, Configuration and Storage

RAC Installation

I am not going to show you a step by step guide on how to install Oracle RAC there are many documents on the internet that explain it better then I could. However I will point to the one I am fond of and it works very will if you want to build a cheap Oracle RAC environment to play around with, the instructions are simple and I have had no problems setting up, installing and configuring it.

To configure a Oracle RAC environment follow the instructions in the document Build your own Oracle RAC cluster on Oracle Enterprise Linux and ISCSI, there is also a newer version out using 11g. As I said the document is excellent, I used the hardware below and it cost me a little over £400 from EBay, alot cheaper than an Oracle course. I did try and setup a RAC environment on VMWare on my laptop (I do have an old laptop) but it did not work very well, hence why I took the route above.

Hardware Description Instance Node 1, 2 and 3 3 X Compaq Evo D510 PC's specs: CPU - 2.4GHz (P4) RAM - 2GB HD - 40GB

Note: picked these up for £50 each, had to buy additional memory to max it out. The third node I use to add, remove and break to see what happens to the cluster, definitely worth getting a third node.

Openfiler Server Compaq Evo D510 PC specs: CPU - 2.4GHz (P4) RAM - 2GB HD - 40GB

HD - 250GB (brought additional disk for ISCSI storage, more than enough for me)

Router/Switch

2 x Netgear GS608 8 port Gigabit switches (one for the private RAC network, one for the ISCSI network (data))

Note: I could have connect it all to one switch and saved a bit of money

(21)

any more) and TOE (TCP offload engine) Network cables - cat5e

KVM switch - cheap one

Make sure you give yourself a couple of days to setup, install and configure the RAC, take your time and make notes, I have now setup and reinstalled so many times that I can do in a day.

Make use of that third node, don't install it with the original configuration, add it afterwards, use this node to remove a node from the cluster and also to simulate node failures, this is the only way to learn, keep repeating certain situations until you fully understand how RAC works.

(22)

4. RAC Administration and Management

RAC Parameters

I am only going to talk about RAC administration, if you need Oracle administration then see my Oracle section.

It is recommended that the spfile (binary parameter file) is shared between all nodes within the cluster, but it is possible that each instance can have its own spfile. The parameters can be grouped into three categories

Unique parameters These parameters are unique to each instance, examples would be

instance_name, thread and undo_tablespace

Identical

parameters Parameters in this category must be the same for each instance, examples would be db_name and control_file

Neither unique or identical parameters

parameters that are not in any of the above, examples would be

db_cache_size, large_pool_size, local_listener and gcs_servers_processes

The main unique parameters that you should know about are

• instance_name - defines the name of the Oracle instance (default is the value of

the oracle_sid variable)

• instance_number - a unique number for each instance must be greater than 0 but

smaller than the max_instance parameter

• thread - specifies the set of redolog files to be used by the instance

• undo_tablespace - specifies the name of the undo tablespace to be used by the instance

• rollback_segments - you should use Automatic Undo Management

• cluster_interconnects - use if only if Oracle has trouble not picking the correct

interconnects

The identical unique parameters that you should know about are below you can use the below query to view all of them

select name, isinstance_modifiable from v$parameter where isinstance_modifiable = 'false' order by name;

• cluster_database - options are true or false, mounts the control file in either share

(cluster) or exclusive mode, use false in the below cases

o Converting from no archive log mode to archive log mode and vice versa o Enabling the flashback database feature

o Performing a media recovery on a system table

o Maintenance of a node

(23)

• cluster_database_instances - specifies the number of instances that will be

accessing the database (set to maximum # of nodes)

• dml_locks - specifies the number of DML locks for a particular instance (only

change if you get ORA-00055 errors)

• gc_files_to_locks - specify the number of global locks to a data file, changing this

disables the Cache Fusion.

• max_commit_propagation_delay - influences the mechanism Oracle uses to

synchronize the SCN among all instances

• instance_groups - specify multiple parallel query execution groups and assigns

the current instance to those groups

• parallel_instance_group - specifies the group of instances to be used for parallel

query execution

• gcs_server_processes - specify the number of lock manager server (LMS)

background processes used by the instance for Cache Fusion

• remote_listener - register the instance with listeners on remote nodes.

syntax for parameter file <instance_name>.<parameter_name>=<parameter_value> inst1.db_cache_size = 1000000 *.undo_management=auto example

alter system set db_2k_cache_size=10m scope=spfile sid='inst1'; Note: use the sid option to specify a particular instance

Starting and Stopping Instances

The srvctl command is used to start/stop an instance, you can also use sqlplus to start and stop the instance

start all instances

srvctl start database -d <database> -o <option>

Note: starts listeners if not already running, you can use the -o option to specify startup/shutdown options, see below for options

force open mount nomount

stop all instances srvctl stop database -d <database> -o <option>

Note: the listeners are not stopped, you can use the -o option to specify startup/shutdown options, see below for options

(24)

immediate abort normal transactional start/stop particular

instance srvctl [start|stop] database -d <database> -i <instance>,<instance>

Undo Management

To recap on undo management you can see my undo section, instances in a RAC do not share undo, they each have a dedicated undo tablespace. Using the undo_tablespace parameter each instance can point to its own undo tablespace

undo tablespace

instance1.undo_tablespace=undo_tbs1 instance2.undo_tablespace=undo_tbs2

With todays Oracle you should be using automatic undo management, again I have a detailed discussion on AUM in my undo section.

Temporary Tablespace

I have already discussed temporary tablespace's, in a RAC environment you should setup a temporary tablespace group, this group is then used by all instances of the RAC. Each instance creates a temporary segment in the temporary tablespace it is using. If an instance is running a large sort, temporary segments can be reclaimed from segments from other instances in that tablespace.

useful views

gv$sort_segment - explore current and maximum sort segment usage

statistics (check columns freed_extents, free_requests ,if they grow increase tablespace size)

gv$tempseg_usage - explore temporary segment usage details such as name,

SQL, etc

v$tempfile - identify - temporary datafiles being used for the temporary

tablespace

Redologs

I have already discussed redologs, in a RAC environment every instance has its own set of redologs. Each instance has exclusive write access to its own redologs, but each

(25)

instance can read each others redologs, this is used for recovery. Redologs are located on the shared storage so that all instances can have access to each others redologs. The process is a little different to the standard Oracle when changing the archive mode

archive mode (RAC)

SQL> alter system set cluster_database=false scope=spfile sid='prod1'; srvctl stop database -d <database>

SQL> startup mount

SQL> alter database archivelog;

SQL> alter system set cluster_database=true scope=spfile sid='prod1'; SQL> shutdown;

srvctl start database -d prod

Flashback

Again I have already talked about flashback, there is no difference in RAC environment apart from the setting up

flashback (RAC)

## Make sure that the database is running in archive log mode SQL> archive log list

## Setup the flashback

SQL> alter system set cluster_database=false scope=spfile sid='prod1'; SQL> alter system set DB_RECOVERY_FILE_DEST_SIZE=200M scope=spfile;

SQL> alter system set DB_RECOVERY_FILE_DEST='/ocfs2/flashback' scope=spfile;

srvctl stop database -p prod1 SQL> startup mount

SQL> alter database flashback on; SQL> shutdown;

srvctl start database -p prod1

SRVCTL command

We have already come across the srvctl above, this command is called the server control utility. It can divided into two categories

• Database configuration tasks • Database instance control tasks

(26)

Oracle stores database configuration in a repository, the configuration is stored in the Oracle Cluster Registry (OCR) that was created when RAC was installed, it will be located on the shared storage. Srvctl uses CRS to communicate and perform startup and shutdown commands on other nodes.

I suggest that you lookup the command but I will provide a few examples display the

registered databases

srvctl config database

status

srvctl status database -d <database

srvctl status instance -d <database> -i <instance> srvctl status nodeapps -n <node>

srvctl status service -d <database> srvctl status asm -n <node>

stopping/starting

srvctl stop database -d <database>

srvctl stop instance -d <database> -i <instance>,<instance> srvctl stop service -d <database> [-s <service><service>] [-i <instance>,<instance>]

srvctl stop nodeapps -n <node> srvctl stop asm -n <node>

srvctl start database -d <database>

srvctl start instance -d <database> -i <instance>,<instance> srvctl start service -d <database> -s <service><service> -i <instance>,<instance>

srvctl start nodeapps -n <node> srvctl start asm -n <node>

adding/removing

srvctl add database -d <database> -o <oracle_home> srvctl add instance -d <database> -i <instance> -n <node> srvctl add service -d <database> -s <service> -r <preferred_list>

srvctl add nodeapps -n <node> -o <oracle_home> -A <name|ip>/network srvctl add asm -n <node> -i <asm_instance> -o <oracle_home>

srvctl remove database -d <database> -o <oracle_home> srvctl remove instance -d <database> -i <instance> -n <node> srvctl remove service -d <database> -s <service> -r <preferred_list> srvctl remove nodeapps -n <node> -o <oracle_home> -A <name| ip>/network

(27)

Services

Services are used to manage the workload in Oracle RAC, the important features of services are

• used to distribute the workload

• can be configured to provide high availability • provide a transparent way to direct workload

The view v$services contains information about services that have been started on that instance, here is a list from a fresh RAC installation

The table above is described below

• Goal - allows you to define a service goal using service time, throughput or none

• Connect Time Load Balancing Goal - listeners and mid-tier servers contain

current information about service performance

• Distributed Transaction Processing - used for distributed transactions

• AQ_HA_Notifications - information about nodes being up or down will be sent

to mid-tier servers via the advance queuing mechanism

• Preferred and Available Instances - the preferred instances for a service,

available ones are the backup instances

You can administer services using the following tools

• DBCA

• EM (Enterprise Manager)

• DBMS_SERVICES

• Server Control (srvctl)

Two services are created when the database is first installed, these services are running all the time and cannot be disabled.

• sys$background - used by an instance's background processes only

• sys$users - when users connect to the database without specifying a service they

(28)

add

srvctl add service -d D01 -s BATCH_SERVICE -r node1,node2 -a node3 Note: the options are describe below

-d - database -s - the service

-r - the service will running on the these nodes

-a - if nodes in the -r list are not running then run on this node

remove srvctl remove service -d D01 -s BATCH_SERVICE

start srvctl start service -d D01 -s BATCH_SERVICE

stop srvctl stop service -d D01 -s BATCH_SERVICE

status srvctl status service -d D10 -s BATCH_SERVICE

service (example)

## create the JOB class BEGIN DBMS_SCHEDULER.create_job_class( job_class_name => 'BATCH_JOB_CLASS', service => 'BATCH_SERVICE'); END; /

## Grant the privileges to execute the job grant execute on sys.batch_job_class to vallep; ## create a job associated with a job class BEGIN DBMS_SCHDULER.create_job( job_name => 'my_user.batch_job_test', job_type => 'PLSQL_BLOCK', job_action => SYSTIMESTAMP' repeat_interval => 'FREQ=DAILY;', job_class => 'SYS.BATCH_JOB_CLASS', end_date => NULL, enabled => TRUE,

comments => 'Test batch job to show RAC services'); END;

/

## assign a job class to an existing job

exec dbms_scheduler.set_attribute('MY_BATCH_JOB', 'JOB_CLASS', 'BATCH_JOB_CLASS');

(29)

Cluster Ready Services (CRS)

CRS is Oracle's clusterware software, you can use it with other third-party clusterware software, though it is not required (apart from HP True64).

CRS is start automatically when the server starts, you should only stop this service in the following situations

• Applying a patch set to $ORA_CRS_HOME

• O/S maintenance

• Debugging CRS problems

CRS Administration

starting

## Starting CRS using Oracle 10g R1 not possible

## Starting CRS using Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl start crs

stopping

## Stopping CRS using Oracle 10g R1 srvctl stop -d database <database> srvctl stop asm -n <node>

srvctl stop nodeapps -n <node> /etc/init.d/init.crs stop

## Stopping CRS using Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl stop crs

disabling/enabling

## stop CRS restarting after a reboot, basically permanent over reboots ## Oracle 10g R1 /etc/init.d/init.crs [disable|enable] ## Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl [disable|enable] crs checking $ORA_CRS_HOME/bin/crsctl check crs $ORA_CRS_HOME/bin/crsctl check evmd $ORA_CRS_HOME/bin/crsctl check cssd $ORA_CRS_HOME/bin/crsctl check crsd

$ORA_CRS_HOME/bin/crsctl check install -wait 600

Resource Applications (CRS Utilities)

status $ORA_CRS_HOME/bin/crs_stat

(30)

$ORA_CRS_HOME/bin/crs_stat -ls $ORA_CRS_HOME/bin/crs_stat -p Note:

-t more readable display -ls permission listing -p parameters

create profile $ORA_CRS_HOME/bin/crs_profile

register/unregister

application $ORA_CRS_HOME/bin/crs_register$ORA_CRS_HOME/bin/crs_unregister Start/Stop an application $ORA_CRS_HOME/bin/crs_start_{$ORA_CRS_HOME/bin/crs_stop}

Resource permissions $ORA_CRS_HOME/bin/crs_getparam

$ORA_CRS_HOME/bin/crs_setparam

Relocate a resource $ORA_CRS_HOME/bin/crs_relocate

Nodes

member number/name

olsnodes -n

Note: the olsnodes command is located in $ORA_CRS_HOME/bin

local node name olsnodes -l

activates logging olsnodes -g

Oracle Interfaces

display oifcfg getif

delete oicfg delig -global

set

oicfg setif -global <interface name>/<subnet>:public oicfg setif -global <interface

name>/<subnet>:cluster_interconnect

Global Services Daemon Control

starting gsdctl start

stopping gsdctl stop

status gsdctl status

Cluster Configuration (clscfg is used during installation)

create a new configuration

clscfg -install

Note: the clscfg command is located in $ORA_CRS_HOME/bin

upgrade or downgrade and

existing configuration clscfg -upgradeclscfg –downgrade add or delete a node from clscfg -add

(31)

the configuration clscfg –delete create a special single-node

configuration for ASM clscfg –local

brief listing of terminology

used in the other nodes clscfg –concepts

used for tracing clscfg –trace

help clscfg -h

Cluster Name Check

print cluster name

cemutlo -n

Note: in Oracle 9i the ulity was called "cemutls", the command is located in $ORA_CRS_HOME/bin print the clusterware

version

cemutlo -w

Note: in Oracle 9i the ulity was called "cemutls"

Node Scripts

Add Node

addnode.sh

Note: see adding and deleting nodes

Delete Node deletenode.sh

Note: see adding and deleting nodes

Oracle Cluster Registry (OCR)

As you already know the OCR is the registry that contains information

• Node list

• Node membership mapping

• Database instance, node and other mapping information

• Characteristics of any third-party applications controlled by CRS

The file location is specified during the installation, the file pointer indicating the OCR device location is the ocr.loc, this can be in either of the following

• linux - /etc/oracle • solaris - /var/opt/oracle

(32)

orc.loc

ocrconfig_loc=/u02/oradata/racdb/OCRFile

ocrmirrorconfig_loc=/u02/oradata/racdb/OCRFile_mirror local_only=FALSE

OCR is import to the RAC environment and any problems must be immediately actioned, the command can be found in located in $ORA_CRS_HOME/bin

OCR Utilities

log file $ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.log

checking

ocrcheck

Note: will return the OCR version, total space allocated, space used, free space, location of each device and the result of the integrity check dump contents

ocrdump

Note: by default it dumps the contents into a file named OCRDUMPFILE in the current directory

export/import

ocrconfig -export <file> ocrconfig -restore <file>

backup/restore

# show backups

ocrconfig -showbackup

# to change the location of the backup, you can even specify a ASM disk

ocrconfig -backuploc <path|+asm>

# perform a backup, will use the location specified by the -backuploc location

ocrconfig -manualbackup # perform a restore

ocrconfig -restore <file> # delete a backup

orcconfig -delete <file>

Note: there are many more option so see the ocrconfig man page add/remove/replace ## add/relocate the ocrmirror file to the specified location

ocrconfig -replace ocrmirror '/ocfs2/ocr2.dbf' ## relocate an existing OCR file

(33)

## remove the OCR or OCRMirror file ocrconfig -replace ocr

ocrconfig -replace ocrmirror

Voting Disk

The voting disk as I mentioned in the architecture is used to resolve membership issues in the event of a partitioned cluster, the voting disk protects data integrity.

querying crsctl query css votedisk

adding crsctl add css votedisk <file> deleting crsctl delete css votedisk <file>

(34)

Introduction

Backups and recovery is very similar to a single instance database. This article covers only the specific issues that surround RAC backups and recovery, I have already written a article on standard Oracle backups and recovery.

Backups can be different depending on the the size of the company • small company - may use tools such as tar, cpio, rsync

• medium/large company - Veritas Netbackup, RMAN

• Enterprise company - SAN mirroring with a backup option like Netbackup or RMAN

Oracle RAC can use all the above backup technologies, but Oracle prefers you to use RMAN oracle own backup solution.

Backup Basics

Oracle backups can be taken hot or cold, a backup will comprise of the following

• Datafiles

• Control Files

• Archive redolog files

• Parameter files (init.ora or SPFILE)

Databases have now grown to very large sizes well over a terabyte in size in some cases, thus tapes backups are not used in these cases but sophisticated disk mirroring have taken their place. RMAN can be used in either a tape or disk solution, it can even work with third-party solutions such as Veritas Netbackup.

In a Oracle RAC environment it is critical to make sure that all archive redolog files are located on shared storage, this is required when trying to recover the database, as you need access to all archive redologs. RMAN can use parallelism when recovering, the node that performs the recovery must have access to all archived redologs, however, during recovery only one node applies the archived logs as in a standard single instance configuration.

Oracle RAC also supports Oracle Data Guard, thus you can have a primary database configured as a RAC and a standby database also configured as a RAC.

Instance Recovery

(35)

• Crash Recovery - means that all instances have failed, thus they all need to be

recovered

• Instance Recovery - means that one or more instances have failed. this instance

can then be recovered by the surviving instances

Redo information generated by an instance is called a thread of redo. All log files for that instance belong to this thread, an online redolog file belongs to a group and the group belongs to a thread. Details about log group file and thread association details are stored in the control file. RAC databases have multiple threads of redo, each instance has one active thread, the threads are parallel timelines and together form a stream. A stream consists of all the threads of redo information ever recorded, the streams form the timeline of changes performed to the database.

Oracle records the changes made to a database, these are called change vectors. Each vector is a description of a single change, usually a single block. A redo record contains one or more change vectors and is located by its Redo Byte Address (RBA) and points to a specific location in the redolog file (or thread). It will consist of three components

• log sequence number

• block number within the log

• byte number within the block

Checkpoints are the same in a RAC environment and a single instance environment, I have already discussed checkpoints, when a checkpoint needs to be triggered, Oracle will look for the thread checkpoint that has the lowest checkpoint SCN, all blocks in memory that contain changes made prior to this SCN across all instances must be written out to disk. I have discussed how to control recovery in my Oracle section and this applies to RAC as well.

Crash Recovery

Crash recovery is basically the same for a single instance and a RAC environment, I have a complete recovery section in my Oracle section, here is a note detailing the difference For a single instance the following is the recovery process

1. The on-disk block is the starting point for the recovery, Oracle will only consider the block on the disk so the recovery is simple. Crash recovery will automatically happen using the online redo logs that are current or active

2. The starting point is the last full checkpoint. The starting point is provided by the control file and compared against the same information in the data file headers, only the changes need to be applied

3. The block specified in the redolog is read into cache, if the block has the same timestamp as the redo record (SCN match) the redo is applied.

(36)

For a RAC instance the following is the recovery process

1. A foreground process in a surviving instance detects an "invalid block lock" condition when a attempt is made to read a block into the buffer cache. This is an indication that an instance has failed (died)

2. The foreground process sends a notification to instance system monitor (SMON) which begin to search for dead instances. SMON maintains a list of all the dead instances and invalid block locks. Once the recovery and cleanup has finished this list is updated.

3. The death of another instance is detected if the current instance is able to acquire that instance's redo thread locks, which is usually held by an open and active instance.

Oracle RAC uses a two-pass recovery, because a data block could have been modified in any of the instances (dead or alive), so it needs to obtain the latest version of the dirty block and it uses PI (Past Image) and Block Written Record (BWR) to archive this in a quick and timely fashion.

Block Written Record (BRW)

The cache aging and incremental checkpoint system would write a number of blocks to disk, when the DBWR completes a data block write operation, it also adds a redo record that states the block has been written (data block address and SCN). DBWn can write block written records (BWRs) in batches, though in a lazy fashion. In RAC a BWR is written when an instance writes a block covered by a global resource or when it is told that its past image (PI) buffer it is holding is no longer necessary.

Past Image (PI)

This is was makes RAC cache fusion work, it eliminates the write/write contention problem that existed in the OPS database. A PI is a copy of a globally dirty block and is maintained in the database buffer cache, it can be created and saved when a dirty block is shipped across to another instance after setting the resource role to global. The GCS is responsible for informing an instance that its PI is no longer needed after another instance writes a newer (current) version of the same block. PI's are discarded when GCS posts all the holding instances that a new and consistent version of that particular block is now on disk.

I go into more details about PI's in my cache fusion section.

The first pass does not perform the actual recovery but merges and reads redo threads to create a hash table of the blocks that need recovery and that are not known to have been written back to the datafiles. The checkpoint SCN is need as a starting point for the recovery, all modified blocks are added to the recovery set (a organized hash table). A block will not be recovered if its BWR version is greater than the latest PI in any of the buffer caches.

The second pass SMON rereads the merged redo stream (by SCN) from all threads needing recovery, the redolog entries are then compared against a recovery set built in the

(37)

first pass and any matches are applied to the in-memory buffers as in a single pass recovery. The buffer cache is flushed and the checkpoint SCN for each thread is updated upon successful completion.

Cache Fusion Recovery

I have a detailed section on cache fusion, this section covers the recovery, cache fusion is only used in RAC environments, as additional steps are required, such as GRD

reconfiguration, internode communication, etc. There are two types of recovery • Crash Recovery - all instances have failed

• Instance Recovery - one instance has failed

In both cases the threads from failed instances need to be merged, in a instance recovery SMON will perform the recovery where as in a crash recovery a foreground process performs the recovery.

The main features (advantages) of cache fusion recovery are

• Recovery cost is proportional to the number of failures, not the total number of nodes

• It eliminates disk reads of blocks that are present in a surviving instance's cache • It prunes recovery set based on the global resource lock state

• The cluster is available after an initial log scan, even before recovery reads are complete

In cache fusion the starting point for recovery of a block is its most current PI version, this could be located on any of the surviving instances and multiple PI blocks of a particular buffer can exist.

Remastering is the term used that describes the operation whereby a node attempting

recovery tries to own or master the resource(s) that were once mastered by another instance prior to the failure. When one instance leaves the cluster, the GRD of that instance needs to be redistributed to the surviving nodes. RAC uses an algorithm called

lazy remastering to remaster only a minimal number of resources during a

reconfiguration. The entire Parallel Cache Management (PCM) lock space remains invalid while the DLM and SMON complete the below steps

1. IDLM master node discards locks that are held by dead instances, the space is reclaimed by this operation is used to remaster locks that are held by the surviving instance for which a dead instance was remastered

2. SMON issues a message saying that it has acquired the necessary buffer locks to perform recovery

(38)

Lets look at an example on what happens during a remastering, lets presume the following

• Instance A masters resources 1, 3, 5 and 7 • Instance B masters resources 2, 4, 6, and 8 • Instance C masters resources 9, 10, 11 and 12

Instance B is removed from the cluster, only the resources from instance B are evenly remastered across the surviving nodes (no resources on instances A and C are affected), this reduces the amount of work the RAC has to perform, likewise when a instance joins a cluster only minimum amount of resources are remastered to the new instance.

Before Remastering

After Remastering

You can control the remastering process with a number of parameters

(39)

_lm_master_weight controls which instance will hold or (re)master more resources than

others

_gcs_resources controls the number of resources an instance will master at a time you can also force a dynamic remastering (DRM) of an object using oradebug

force dynamic remastering (DRM)

## Obtain the OBJECT_ID form the below table SQL> select * from v$gcspfmaster_info;

## Determine who masters it SQL> oradebug setmypid

SQL> oradebug lkdebug -a <OBJECT_ID> ## Now remaster the resource

SQL> oradebug setmypid

SQL> oradebug lkdebug -m pkey <OBJECT_ID> The steps of a GRD reconfiguration is as follows

• Instance death is detected by the cluster manager

• Request for PCM locks are frozen

• Enqueues are reconfigured and made available

• DLM recovery

• GCS (PCM lock) is remastered

• Pending writes and notifications are processed • I Pass recovery

o The instance recovery (IR) lock is acquired by SMON

o The recovery set is prepared and built, memory space is allocated in the

SMON PGA

o SMON acquires locks on buffers that need recovery • II Pass recovery

o II pass recovery is initiated, database is partially available o Blocks are made available as they are recovered

o The IR lock is released by SMON, recovery is then complete o The system is available

(40)