Clusters fail while nodes are still up

Document ID : KB000099632
Last Modified Date : 01/06/2018
Show Technical Document Details
Issue:
We have identified an unexpected behavior in CA SSO 12.7 bi-cluster
setup that leads to down time condition. We perceive it as a deviation
from the intended functionality and would like it to be resolved by
producing a patch for current or closest future release. Here is the
scenario.

The Setup: CA SSO PS infrastructure is setup in 2 clusters with 3
nodes in each (1.1, 1.2, 1.3 and 2.1, 2.2, 2.3 respectively) and a
failover threshold of 50%. The "enable failover" feature between
cluster nodes is turned off.

The bug: In a failover scenario we managed to reach a reproducable
state where the entire agent infrastructure was down while 2 out of 6
Policy servers were still up but idling by executing the following
failover test scenario:

Steps :

1. Nodes 1.1 and 1,2 are shutdown. Result: All agents gradually
   failover from Cluster 1, as expected, due to availability dropping
   to 33% and Cluster 2 becoming preferred.
2. Node 2.2 is shutdown. Result: All agents are staying on Cluster 2
   because it still is on 66%.
3. Node 2.3 is shutdown. Result: All agents are down and disconnected
   from Cluster 1 and Cluster 2 which still have 33% capacity each.

This is obviously a problem: 2 servers are still up but do nothing,
while the entire agents environment in both data centers is down. The
only way to workaround the bug is to NOT use failover threshold at
all, i.e. setting it to 33% so that agents keep hammering the poor
cluster 1 until it faints off, all the while cluster 2 would enjoy its
100% capacity. This has to be addressed.

Here's a sample to illustrate it :

We have 6 Policy Servers configured in 2 clusters as follows: 

  Cluster A : 1.1, 1.2, 1.3 
  Cluster B : 2.1, 2.2, 2.3 

The failover threshold is set to 50%, which means that the cluster
will be considered down when there is a minimum of 50% of the Policy
Servers in that cluster unavailable. They do the following:

Steps :

1. Nodes 1.1 and 1.2 are shutdown. Result: All agents gradually
   failover from Cluster 1, as expected, due to availability dropping
   to 33% and Cluster 2 becoming active.
2. Node 2.2 is shutdown. Result: All agents are staying on Cluster 2
   because it still is on 66%.
3. Node 2.3 is shutdown. Result: All agents are down and disconnected
   from Cluster 1 and Cluster 2 which still have 33% capacity each.

We expect (as the doc mentioned) to have requests still going to
Cluster B available nodes.
Resolution:
The behavior observed is expected and working as designed. To avoid
the cluster to be considered down when there is still 1 Policy Server
up, the Failover threshold should be set to less than 30%. Another
option is to set a third cluster with all the Policy Servers to there
would be the two available nodes there.