Unexpected High Availability Failovers

Document ID : KB000005248
Last Modified Date : 14/02/2018
Show Technical Document Details
Issue:

The CA Release Automation management servers (aka NAC, datamanagement) have been installed/setup to use high availability (see Additional Information section below). Every so often an unexplained failover occurs. 

Environment:
CA Release Automation 5.x - 6.x
Cause:

There are two conditions that a CA Release Automation management server can failover:

  1. The time on the servers is out of sync. 
  2. The passive NAC (received a login request). 

These following scenarios are described in more detail below:

Time Synchronization

It is extremely important for the date/time on the two management servers to be in sync with each other - down to the same second. The reason this is the case is because:

  1. The active management server updates the database with its current timestamp. This is done every 1 second.
  2. The passive management server checks the timestamp written by the active management server and then compares it with its local timestamp.
  3. If the passive management server determines that the active management server has not updated the database within 15 seconds then it will become the active management server.

When this condition occurs you will find a message similar to the following in the nolio_dm_all.log file:

2016-10-04 16:07:06,788 [periodicTasksMasterMonitor-1] INFO  (com.nolio.platform.server.dataservices.services.ha.MasterNacService:287) - current master [MasterNac[id=<id_val>, nacNode=NacNode[id=<id_val>, hostname='<masterNacHostname>', ip='<masterNacIpAddress>'], lastIAmAlive=2016-10-04 16:06:51.0, firstIAmAlive=2016-10-04 16:01:07.0, upgradeState=null]] has not reported aliveness for more than 15000 ms.

 

Login Requests

If the passive management server receives a login request then it believes that it must become the active management server. Login requests are typically handled by a frontend load balancer (that must be configured to send its 100% of its traffic to the active node). The load balancer is typically configured to switch which management server it considers to be active based on http get requests. In an environment that does not have time synchronization problems, this is failover method that one should typically see. But the failover could happen if someone mistakenly attempts to login directly to the passive management server. 

When this condition occurs you will find a message similar to the following in the nolio_dm_all.log file:

2016-10-04 16:01:07,316 [http-nio-8080-exec-2] INFO  (com.nolio.platform.server.dataservices.services.ha.MasterNacIdentifierInterceptor:129) - received new incoming request when I'm not master. Trying to become master before handling request...

Resolution:

Identify the cause for the failover based on the information provided (in the Cause section above) and review any information related to installation and configuration in the Additional Information section below. Adjust the settings and/or behavior to ensure reliable failover.

Note: It is worth noting that time synchronization problems may result in multiple failovers. This is indicated by the related time sync messages inside of the passive (becoming active) management server logs and related login request messages inside of the active (becoming passive) management server logs. This may happen when the passive (becoming active) has a date/time that is greater than the active (becoming passive) management server. Since the load balancer doesn't experience an error with its HTTP GET requests against the management server it thinks is active it continues to forward all of its traffic to the newly passive management server. 

It is also worth noting that we have seen where servers configured to use NTP suddenly have its server time changed to something very unexpected which could trigger a time sync failover. This might reveal itself in the nolio_dm_all.log file with messages out of sequence. 

Example:

Message 1: 2016-10-04 16:01:07,316

Message 2: 2016-10-04 15:26:57,320

Message 3: 2016-10-04 16:01:07,320

Notice how message 2 has a timestamp that is earlier than its previous message. This is an indication that someone has either manually changed the time on the server or NTP sync had a glitch. 

Additional Information:

More information related to setting up High Availability can be found here:

Install to Provide High Availability: https://docops.ca.com/ca-release-automation/6-3/en/installation/install-to-provide-high-availability/

Architecture and Implementation: https://communities.ca.com/docs/DOC-231165900

CA Release Automation HA Configuration: https://communities.ca.com/docs/DOC-231172193

CA-Release-Automation-Artifactory-HA-Best-PracticesV2.5: https://communities.ca.com/docs/DOC-231153988

Apply Patches to a High Availability Installation: https://docops.ca.com/ca-release-automation/6-3/en/installation/install-to-provide-high-availability/apply-patches-to-a-high-availability-installation 

Execution Server High Availability Installation and Scalability: https://docops.ca.com/ca-release-automation/6-3/en/installation/install-to-provide-high-availability/execution-server-high-availability-and-scalability 

Upgrade a High Availability Deployment: https://docops.ca.com/ca-release-automation/6-3/en/installation/install-to-provide-high-availability/upgrade-a-high-availability-deployment