Cluster Session Manager Out of Sync

Document ID : KB000123996
Last Modified Date : 08/01/2019
Show Technical Document Details
Introduction:
Sometimes a synchronization issue may arise where the Session Manager database in the primary site shows as out of sync in both nodes but nonetheless users are still able to login to any of them .

In this situation, the primary site nodes may enter into a “split-brain” scenario, where each node thinks it is the master and the other side’s Session Management database is inactive.

Since each node regards itself as master, and the other node’s database as inactive, credential management jobs will be allowed to run in both nodes with no synchronization to the other one.

User-added image
User-added image
Background:
This issue is usually caused by a temporary communication problem. Temporary suspension of the PAM VM instances, such as that caused by VMOTION or possible live snapshots, may be the root cause of this issue.
Environment:
CA PAM 3.2.2
Instructions:
Recommendations
We have the following recommendations for minimizing the risk of unsynchronized accounts going forward. These make sense even if the split-brain situations are thoroughly suppressed.
  1. Avoid any downtime of the PAM cluster VMs while the cluster is on, particularly for primary site nodes. This includes exempting them from VMOTION or other technology that may move a VM from one server to another. Snapshots should be taken only during maintenance windows with the cluster turned off.
  2. Integrate with a syslog server, or SPLUNK server, if not done yet, and set up email notifications on PAM-CM-0457 and PAM-CM-0469 messages. When these are received, a PAM admin should check on cluster status, making sure to check the status on both nodes.
  3. Before turning the cluster off and back on following an out-of-sync state, perform the following checks:
    1. If the two primary nodes showed consistent database state while the cluster is on, e.g. both show the Node2 database as active and the Node 1 database as inactive, make sure to move the node that had the active database to the top of the primary site nodes list. This is documented online, see https://docops.ca.com/ca-privileged-access-manager/3-2-2/EN/deploying/set-up-a-cluster/cluster-synchronization-promotion-and-recovery
    2. If the database state is not reported consistently by the primary site nodes, find the PAM-CM-0457 message in the session logs to determine when the problem started. On each node go to Credentials > Reports > Run. Select the “Account Passwords Update Attempts” report. Run a report that covers the out-of-sync period. This will show you which target accounts were updated by which node. Pick the one with the most updates as master. Compare passwords for the accounts updated by the other node. If they are not in sync, the password on the other node likely is right. In that case save the password temporarily so you can set it in the cluster after it is turned on. This is only necessary for accounts whose passwords are not updated by another account. For passwords that are updated by another account, just generate new passwords once the cluster is back. Note: If you check on passwords while the cluster is stopped, you may have to unlock nodes to retrieve them. Make sure no scheduled jobs run during that time.
    3. Take a database backup using the Configuration > Database page on the primary site nodes, particularly on the second node because that will have its database overwritten on cluster startup. This way it is possible to check for any missed changes later.