The Failover Process - Configuring Mirroring

4.1 Configuring Mirroring

4.2.5 The Failover Process

Mirroring provides rapid, automatic, unattended failover. There are several events that could trigger a failover, such as:

• The Data Channel is closed by the primary failover member, which could occur if Caché on the primary member becomes unresponsive due to an application hang or error.

• An operator manually issues a takeover command.

• A takeover command is issued on the backup via the SYS.MirrorAPI.

There are many predefined rules that influence the behavior of a takeover. These rules are mostly configurable and are designed to provide a customized failover configuration that is appropriate to your deployment.

• The Failover Process — A System Perspective

• The Failover Process — An Application Perspective

• Failover (Takeover) Rules

• Failover Scenarios

4.2.5.1 The Failover Process — A System Perspective

Whenever possible, mirroring provides rapid, automatic, unattended failover. There are several events that could trigger a failover, such as:

• The backup does not hear from the primary within a required interval, which could occur in the case of network problems.

• An application or host problem causes Caché to become unresponsive on the primary.

• A takeover is initiated by an operator or script.

The backup system ensures it is fully up-to-date before marking itself as the new primary system. The default mirroring configuration prevents errors during takeover, such as split-brain syndrome — a condition whereby both systems concurrently run as active primaries — which could lead to logical database degradation and loss of integrity.

In addition, it is possible for an operator to temporarily bring the primary system down without causing a failover to occur. This mode can be useful, for example, in the event the primary system needs to be brought down for a very short period of time for maintenance. After bringing the primary system back up, the default behavior of automatic failover is restored.

4.2.5.2 The Failover Process — An Application Perspective

On a successful failover, the mirror VIP (if configured) is automatically bound to a local interface on the new primary. This allows external clients to reconnect to the same mirror VIP address as before, which greatly simplifies the management of external client programs because they do not need to be aware of multiple database systems and IP addresses. If, however, a mirror VIP is not configured, external clients will need to maintain knowledge of the two failover members and appropriately connect to the currently running primary.

In an ECP deployment, application servers view a failover as a server restart condition. By design, ECP application servers reestablish their connections to the new primary failover member and continue processing their in-progress workload; during the failover process, users connected to the application servers may experience a momentary pause before they are able to resume work. For this to occur, the failover between the two failover members must occur within the configured ECP recovery timeout. If, however, the failover takes longer than this timeout, ECP recovery is initiated (that is, open transactions are rolled back, locks are released, etc.), and new connections to the new primary system are established by the ECP application servers.

4.2.5.3 Failover (Takeover) Rules

The main goals of the backup failover member are to determine definitively that the:

• Primary failover member is down (either because there has been a failure or it has been forced down).

• Backup failover member has all of the journal data that is present in the databases on the primary failover member. Once a failover is triggered, the backup attempts to automatically take over as the primary. The default mirror configuration attempts to balance the convenience of an automated takeover with the security of a failsafe takeover. This section discusses the logic, rules, and result of an automatic takeover.

A central element of the automated failover functionality is the ability of the backup to determine whether or not it is active (that is, caught up with the primary) when the primary failed, and to subsequently become caught-up if needed. The backup does this by attempting to validate the end of the last mirror journal file across the two systems to ensure that it has received the most recent updates from the primary.

If the backup is able to establish communication with the ISCAgent on the primary failover member, it not only validates the end of the last/active mirror journal file, but also does the following:

• If the primary is currently running, the backup aborts the automatic failover process and attempts to re-link with the primary.

• If the primary is currently hung, the backup requests the ISCAgent to force the primary down. This is done to minimize the possibility of split-brain syndrome, whereby two systems continue to run in the role of primary.

• If it is determined that the primary was more current than the backup at the time of the failure, then the backup asks the ISCAgent to send the additional data so that it can get fully caught-up

4.2.5.4 Failover Scenarios

This section outlines several failover scenarios, along with the expected result of automatic failover in each scenario. The configuration options that impact failover are also considered as part of this discussion.

Note: All cases in which the primary failover member is brought down with the nofailover option are ignored because no failover would occur in those cases. Additionally, if the Agent Contact Required for Failover tuning parameter is configured as No, a user-defined external function must exist so that the status of the failed primary can be externally determined (for information about the Agent Contact Required for Failover tuning parameter, see Mirror Tunable Parameters in this chapter).

The following scenarios are described:

• Primary Failover Member Fails and Backup Failover Member is Running

• Primary Failover Member and ISCAgent Fails and Backup Failover Member is Running and Active

• Operator-initiated Failover

Primary Failover Member Fails and Backup Failover Member is Running

The following table shows the state of the components/systems for this scenario:

60 Caché High Availability Guide Mirroring

State Component/System DOWN Caché on Primary Running ISCAgent on Primary Running Caché on Backup N/A ISCAgent on Backup

As shown in the following illustration, the backup successfully takes over as the primary regardless of the status of the

Agent Contact Required for Failover tuning parameter. The backup ensures that it is fully caught-up prior to taking over

as the primary.

Figure 4–5: Status of Systems

Primary Failover Member and ISCAgent Fails and Backup Failover Member is Running and Active

The following table shows the state of the components/systems for this scenario:

State Component/System DOWN Caché on Primary DOWN ISCAgent on Primary Running Caché on Backup N/A ISCAgent on Backup

As shown in the following illustration, this scenario could occur in the event of a catastrophic system (host operating system or hardware) failure for the primary, or in the event of a network interruption between the primary and the backup. The success or failure of the failover depends on the following:

• Agent Contact Required for Failover tuning parameter set to No, the backup is active, and the

$$IsOtherNodeDown^ZMIRROR()function returns True — In this situation, the backup successfully takes over as

primary because it is not required to contact the ISCAgent on the failed primary, and it is able to determine that it is

• Agent Contact Required for Failover tuning parameter set to Yes, the backup status is either active or not active — In this situation, the backup aborts the attempt to take over as primary because it is not able to communicate with the ISCAgent on the primary failover member.

Note: This is the default configuration of the mirror — it is intended to prevent the backup failover member from taking over in the event of a network interruption between the two systems.

Figure 4–6: Status of Systems

Operator-initiated Failover

An operator has the ability to shut down Caché on the primary failover member in the following ways:

• Graceful shutdown

• Forced shutdown

• Graceful shutdown with the nofailover option, which can be passed as an argument to the platform-specific shutdown commands

Note: The recommended method for triggering a planned failover is to perform a graceful shutdown on the primary failover member.

If the Agent Contact Required for Failover tuning parameter set to Yes (see Agent Contact Required for Failover in this chapter), you can determine whether or not the backup failover member was caught up when the primary member failed by checking the cconsole.log file for messages similar to the following:

• “ (mirror_name) Failed to contact agent on former primary, can't take over” — indicates the backup failover member was active when the primary member failed, but could not take over because it could not reach the ISCAgent on the primary.

• “ (mirror_name) Non-active backup is down” — indicates the backup failover member was not active when the primary member failed and is resetting.

Graceful Shutdown

A graceful shutdown is the preferred method to trigger a failover during planned activities; to accomplish this, you shut down the running primary failover member, in which case, the backup failover member takes over as primary member when it detects that the primary failover member has stopped running.

Forced Shutdown

To trigger a failover by forcing a shutdown, on the running backup mirror member, select the Try to make this primary

option from the Mirror Management main menu list of the ^MIRROR routine. In this case, the backup only succeeds if the

62 Caché High Availability Guide Mirroring

ISCAgent process is running on the primary, and the backup member is able to communicate successfully with the agent process; this communication ensures the backup does not assume the role of primary while the old primary is still active and running.

Alternatively, as a last resort, you could attempt to force a failover by selecting the Force this node to become the primary

option from the Mirror Management main menu list of the ^MIRROR routine on the running backup failover member. In this case, the backup tries to take over as the primary, regardless of whether it is able to connect to the old primary system, but does not take over if it fails to perform some basic required tasks (for example, reading a mirror journal file or mirror log file, etc.).

CAUTION: Forcing the backup failover member to become the primary may result in a loss of data; therefore, this option should be used only when you are absolutely certain that the old primary system is no longer running. Before using this option, InterSystems recommends that you contact the InterSystems Worldwide Response Center (WRC) for assistance.

Graceful Shutdown with nofailover Option

You can gracefully shut down the primary failover member without the backup member taking over as the primary by setting the nofailover option. In this case, a running backup failover member does not attempt to take over as primary member because the nofailover option is specified for the primary member. However, on the backup failover member, you can force the backup member to take over as the new primary failover member (while the old primary member continues to be offline) by selecting the Change No Failover State option from the Mirror Management main menu list of the ^MIRROR routine. This option clears the nofailover state, thus allowing the backup to take over as the new primary member. The state is cleared the next time the primary failover member is brought online.

In document Caché High Availability Guide (Page 65-69)