Reinitialize replication in a multi-node cluster

Document ID : KB000044402
Last Modified Date : 31/08/2018
Show Technical Document Details
Introduction:

This procedure will explain how to reinitialize failed replication in a multi-node cluster during a maintenance window.

Background:

The CA API Gateway uses MySQL replication to provide database failover and availability should one Gateway appliance or database server become unavailable or degraded. MySQL replication will ensure that a duplicated copy of a database object is maintained in one or more locations. The Gateway uses a master-master implementation in a multi-node environment to ensure that database changes to one host are replicated to the other database host.

MySQL replication is capable of repairing itself if the other database node is unavailable. Replication can repair itself after a node in the cluster experiences a graceful shutdown, MySQL server shutdown, or a network outage. Replication will fail if a data consistency error is experienced. This article will prescribe how to reinitialize replication in a multi-node cluster.

Presentation

The following data may be visible when running the SHOW SLAVE STATUS query against the local MySQL database:
Slave_IO_Running: No 
Slave_SQL_Running: No 

The following log entries may be present in the Gateway log files:
2380 WARNING "Error accessing host/database 
2381 WARNING "Replication failing for host/database 

Environment:
CA API Management Gateway 9.x
Instructions:

Note: This process is recommended to be performed during a maintenance window where the Gateway cluster is completely unavailable

This process will result in a significant period of downtime if a large quantity of audits (typically over 1,000,000) are present in the local Gateway database. These audit records can be removed in advance of this procedure via the following articles:

  1. TEC0000001008 Deleting old audit events from the local Gateway database 
  2. TEC0000001123 Removing audit records from the Gateway database in a multi-node cluster without downtime )

 

Please contact CA Support if there are any concerns about the length and breadth of the maintenance window

The following procedure should be executed in order to reinitialize replication in a multi-node cluster:

  1. Back up the database on the PRIMARY node: mysqldump --all-databases > ~/all_databases_`date '+%Y%m%d_%T'`.sql
  2. Make sure that the last line of the newly created SQL file is "-- Dump completed <date-time>".
  3. Store the file for future use.
  4. Stop slave replication on both nodes: mysqladmin stop-slave 
  5. Reset the master configuration on both nodes: mysql -e "reset master" 
  6. Reset the slave configuration on both nodes: mysql -e "reset slave; reset slave all"
  7. Stop the Gateway service on both nodes: service ssg stop
  8. Execute the attached create_slave.sh script on the secondary node: /opt/SecureSpan/Appliance/bin/create_slave.sh 
  9. Provide the fully qualified domain name (FQDN) of the primary node
  10. Enter yes to the prompt "Do you want to clone a database from $MASTER (yes or no)?"
  11. Execute the create_slave.sh script on the primary node: /opt/SecureSpan/Appliance/bin/create_slave.sh 
  12. Provide the FQDN of the secondary node when prompted
  13. Enter no to the prompt "Do you want to clone a database from $MASTER (yes or no)?"
  14. Start the Gateway service on both nodes: service ssg start
  15. Query the status of replication on both nodes: mysql -e "show slave status\G"
NOTE: The "\G" at the end of the line ensures the output is human-readable. 
  1. Verify both nodes return the following lines: 
Slave_IO_Running: Yes 
Slave_SQL_Running: Yes 
Seconds_Behind_Master: 0 


Replication should be re-initialized at this point. The above output indicates that the master/slave relationship is functioning.