Solving HA nodes both going active/inactive at the same time

Last modified on 21 Apr, 2021. Revision 25

Problem description

When using a cOS Core High Availability (HA) cluster there may be situations where one of the the following occurs:

  1. Master and Slave nodes are both active at the same time.
  2. Master and Slave nodes are both inactive at the same time.
  3. Nodes might also be changing their active/inactive roles frequently without an apparent reason.

This article will discuss possible ways to troubleshoot and solve these issues.

Note: The intentional HA Active-Active setup described in the cOS Core Admin Guide for load-sharing is another subject and is not connected with this article.

Scenario-1: Both cluster nodes active at the same time

This is the most likely issue that might be observed and it basically means that the Master node cannot find the Slave node and the Slave node cannot find the Master node. In order to explain this further, consider an example cluster that has 5 interfaces, as shown in the diagram below:

This is a standard cluster setup. Every physical interface is connected to various switches. For example, we have G1 from both the Master and Slave nodes connected to the same switch and a similar setup for all the other interfaces. The exception is the sync interface (G5) which is connected directly between the Master and Slave using a common TP or fibre cable, and nothing in between.

In order for the Master to find the Slave unit and vice-versa, the cluster sends out heartbeats on all physical interfaces. Heartbeats are the method for a cluster node to declare that it’s alive. We can think of them as a kind of ping packet constantly being sent and received.

Important: Heartbeats are not sent out on non-physical interfaces such as VLAN’s, IPSec tunnels, GRE tunnels, PPTP connections, Loopback interfaces etc. If VLANs are used on a switch, untagged VLAN0 must be configured on the switch to allow heartbeat packets to be sent between the Firewalls.

Each interface sends out heartbeats at a set interval and each cluster node then decides if its cluster peer is alive.

But at the same time this means we have some redundancy. In our example we have 5 interfaces in total but what if two of them are not used? They do not even have a link.

Would the cluster go active/active in this scenario? No, it would not. The reason for this is because the cluster nodes receive/send heartbeats from the other interfaces.


Question: What if all interfaces except the sync is working? Would they then go active/active?
Answer: No, they would not. The synchronization interface sends out heartbeats at a fixed rate that is normally more than twice the amount of heartbeats sent on a normal interface, but it would still be more than enough to establish that a node is alive. However, there would be no state synchronization, nor any configuration synchronization, unless InControl is being used to manage the cluster.

Question: What if the reverse is true? All interfaces except sync is down. Would that cause an active/active situation?
Answer: No, it would not. As long as we have even one interface working that can send/receive heartbeats it would be enough for the cluster to see its peer and to know which node should be active or inactive. Having only one interface able to send/receive heartbeats however makes the cluster very sensitive and the slightest network problem may cause the cluster to failover and/or start going into an active/active state. It is recommended to have heartbeats enabled and working on as many interfaces as possible, the more the better.

Question: If the Sync interface is down, how can the cluster determine which node should be the active one?
Answer: In the heartbeat packets being sent out on all interfaces is also information about the amount of connections the sender node has. The cluster nodes can then determine which one should be the active node by comparing the connection count between itself and it’s peer. The one with the most connections will be the active node.

It was mentioned earlier that the synchronization interface sends more heartbeats than a normal interface. The principle of this can be illustrated in this simplified sequence:

  1. Heartbeat sent on G1
  2. Heartbeat sent on Sync
  3. Heartbeat sent on G2
  4. Heartbeat sent on Sync
  5. Heartbeat sent on G3
  6. Heartbeat sent on Sync

Therefore, even if only the Sync interface is able to send/receive heartbeats, it sends them at a higher rate that makes the cluster less likely to enter an active/active state than if only a normal non-sync interface had been the one used for heartbeats.

Heartbeat characteristics

Troubleshooting Active/Active situations

The main problem when encountering an active/active situation is, as discussed above, the lack of heartbeats. But there may be situations where the problem is due to how the surrounding network topology is designed. For example, interfaces may actively scan for heartbeats but since heartbeats are specifically designed to not pass through network equipment like routers, they may be dropped before reaching listening interfaces.

The main problem of an active/active situation is usually that the HA cluster nodes are not receiving enough heartbeats from their cluster peer. This may mean that not even one interface is able to both send and receiver heartbeats if both nodes go active at the same time.

There are four methods that can used to deal with active/active situations:

Method-1: Disable Heartbeats

The first possible solution to this problem is to disable the sending of heartbeats on interfaces that are not in use or interfaces that we know cannot send/receive heartbeats.
The option to disable cluster heartbeats can be found under the WebUI advanced tab for each Ethernet interface in the configuration.

It is recommended to make a comment in the comment field on interfaces when heartbeats are disabled for future reference. It is also recommended that if/when the interfaces are used, the sending of heartbeats is activated again.

This is an optional setting since even a single interface can be enough to send/receive heartbeats. This is the best way to stop the cluster sending heartbeats on interfaces that are not actively used and to stop it generating packets on particular interfaces and networks.

Method-2: Disable interfaces that are not in use

The second possible solution is to disable the interface itself. Simply right-click the interface and select “disable interface”. This will stop the sending of heartbeats on the interface.

WARNING! Before disabling any interfaces you must make sure it is NOT the registration interface that the cOS Core license is bound to. If you disable the interface to which the license is bound, the cOS Core will enter Lockdown mode! The easiest way to check which interface the license is bound to is to first use the CLI command “license” and then combine that with the CLI command “ifstat -maclist” to see a list of all interfaces and their MAC addresses.

Method-3: In a virtual cluster use identical hardware settings

For a cOS Core HA cluster running in a virtual environment, such as VMware or KVM, it is recommended that the Master and Slave each have the same hardware settings (PCI-Bus, Slot & Port). The reason for this is because the Shared MAC address is calculated based on hardware settings + the cluster ID. And if the hardware settings on the cluster nodes are different, the Shared MAC address becomes different as well, which could (depending on scenario) cause the cluster nodes to think that the heartbeats and synchronization packets are sent from something other than their cluster peer.

Note: Even though it is recommended to use identical hardware settings on both cluster nodes, it is possible to run a virtual cluster using different hardware settings. The reason why this works is because the shared MAC calculation is initially based on the hardware settings of the Master node. So even if the Slave is different, the shared MAC calculation should be the same on both cluster nodes. The reason why Clavister does not recommend this scenario is because it is not actively tested & verified during the QA testing of new versions.

If the hardware settings are not the same, there is a chance of an active/active problem occurring. The logs would most likely also be filled with events such as “disallowed_on_sync_iface”.
The easiest way to compare the hardware settings between the cluster nodes is to download a Technical Support File (TSF)  from both nodes using, for example, the WebUI and then compare their hardware sections.

Method-4: Investigate high Diffie-Hellman (DH) group usage on IPsec tunnels

This is a problem that has been seen more and more since the introduction of DH group 14-18 in cOS Core version 10.20.00. The problem of using DH group 15 and above is that the processing resources required to generate these very strong keys is high and can cause system stalls on less powerful hardware. DH group 18 for instance is a 8192 bit key and that requires powerful hardware in order to avoid system interruptions.

The effect of having high DH groups will be that the inactive node believes its peer is down and therefore goes active. To troubleshoot, examine the logs from before the active/active event to see if there was an IPsec tunnel negotiation prior to the event or perform a configuration review and examine if DH groups of 15 or above are used by configured IPsec tunnels. A tunnel with many local and remote networks will also generate many DH key negotiations, and having many DH negotiations going on at the same time can make the situation even worse. A possible way to mitigate this would be to make the inactive cluster node less sensitive in case its peer has been silent for some time. The setting “HA Failover Time” could be increased to a higher value to try to avoid the cluster from changing roles frequently. But if this happens in the first place it may be an indication that the hardware is not powerful enough to handle the current IPsec loads with long key length DH groups. A hardware upgrade may need to be considered.

Method-5: (under VMware only) Change from Distributed Port Group to Standard Port Group

In an isolated case with a cOS Core HA cluster running under VMware, there were problems with configuration synchronization, cluster peers going active/active, as well as problems accessing the cOS Core WebUI. After changing from Distributed Port Group to Standard Port Group in VMware, all problems disappeared.

A final note on Active/Active situations:

It is important to highlight again that the most frequent reason for active/active situations occurring is lack of heartbeats. As long as heartbeats are unable to traverse from the Master to the Slave node and vice-versa, it is a potential source of problems. This should be investigated first since even one interface receiving heartbeats should be sufficient for HA nodes to see their peers.

Scenario-2: Both cluster nodes inactive at the same time

This scenario is very unusual/rare. It means that both cluster nodes have received heartbeat data from their peer indicating that it has more connections and should be the active node. The problem is that both cluster nodes believe the the peer has more connections and so both cluster peers will stay inactive.

Troubleshooting Inactive/Inactive situations

The solution to this problem may not be obvious as it can be obscure network problems causing the issue. Here is a list of what to check:

  1. If there is more than one Clavister HA Cluster configured in the network and they can see each other, verify that they are NOT using the same cluster ID. If  the ID is duplicated there is a chance that  heartbeats can be recognized by the wrong cluster. A common log seen during this kind of problem is "heartbeat_from_myself".
  2. Check the network in case there is some sort of port mirroring or other equipment that mirrors the packets sent by the cluster, meaning that the cluster sees it's own heartbeats.
  3. Verify that the cluster is correctly configured. Check that nodes are correctly configured as Master and Slave.
  4. Make sure that the IP addresses configured on the interfaces are not the same for the Shared, Master_IP and Slave_IP. If all three are using the same IP address, you would encounter very strange problems and the logs might contain "heartbeat_from_myself" entries as well.

For information about adjusting HA cluster settings to get the best performance and stability, particularly in larger network environments, please read through the additional Knowledge Base article at the following link:

Related articles

Device initiated InControl management of NetWall HA clusters with a single public IP
31 Mar, 2022 incontrol core netcon netwall ha cluster coscore