Adjusting advanced cluster settings on larger installationsLast modified on 13 Jan, 2021. Revision 5
|Up to date for||
cOS Core 13.00.08
cOS Core 10.xx.xx
My High Availability cluster is not synchronizing properly, also i have seen incidents where the cluster changes role for no apparent reason.
Problems with synchronization and cluster role changes can of course be all kinds of reasons such as hardware problem on sync interface, bad cable, incorrect configuration etc. but if we look at some of the Advanced Settings for High Availability there are some settings here that may need to be adjusted. For most these settings never need to be changed but for larger installations it is recommended to modify them to incorporate large synchronization data and (based on scenario) lessen the chance that the cluster performs a failover due to lack of heartbeats from it’s peer.
The settings that we want to adjust are the following and can be found under System->High Availability->Advanced:
- Sync Buffer Size , default value 4096
- Recommended value : 4096
This setting controls how much synchronization data (in KB) can be buffered before waiting for acknowledgement from its cluster peer. Today’s appliance models (E80 and above) have quite a lot of spare memory, so allocating 4 MB instead of one should be no problem, having a little extra buffer for the synchronization will never hurt.
Note: The old default value was in older versions 1024. The value will not update on existing configurations automatically. Only new configurations from around 2017 will use the new default value.
- Sync Packet Max Burst , default value 100
- Recommended value : 100
This setting controls how many packet the active cluster peer can send in a synchronization state burst to the inactive node. For larger installations (100+ users) it is highly recommended to increase this value, using the default value can cause the active node to be unable to synchronize data fast enough. Meaning the inactive node may not be fully synchronized with the active.
Note: The old default value was in older versions 20. The value will not update on existing configurations automatically. Only new configurations from around 2017 will use the new default value.
- HA Failover Time , default values 750ms
- Recommended value : 1500-2500ms
This setting controls how long the inactive node node will “wait” before going active in case it has not received sufficient heartbeats from it’s peer within this time. Simply speaking if the inactive node has not “seen” the active node for 750ms it will go active.
Depending on the scenario/size/network structure, 750ms can be a bit low. In case the system encounters network packet bursts it could result in the inactive declaring the active node as inactive and then go active itself. Then you could enter an active/active state and then the clusters start to negotiate which node that should be the active node. This in turn could cause disruptions in the network.
One way to make the cluster “less” sensitive to minor network “hickups” would be to increase this value.
Note: The higher the value here the longer it would take for the inactive node to take over in case something happens with the active node. The value configured here will have to be based on what is reasonable acceptable, is 1.5-2.5 seconds of total network outage acceptable in case something happens with the active node? It will be up to the administrator to decide.
8 Sep, 2020 vmware log ha rarp arp core
15 Apr, 2021 core brokenlink cluster
21 Apr, 2021 core cluster ha
31 May, 2021 hardware ha e80a e80b