Data Repository cluster regularly shuts down unexpectedly.

Document ID : KB000008334
Last Modified Date : 14/02/2018
Show Technical Document Details
Issue:

Data Repository (DR) Database (DB) stability issues are observed in a three node cluster. The DB is found down frequently with no known user interaction. Despite this condition the DB is able to be restarted without error after each outage.

Environment:
All supported CAPM releases
Cause:

In a Vertica DB cluster it will shut itself down if the majority of nodes are seen as down. In a standard three node cluster this means if 2/3 nodes are seen as down it will protect the remaining node known to be running by shutting it down. 

In this case we observe messages in the vertica.log files for the nodes that point to network disconnects between nodes as a cause.

 

This message is observed from the node0001 vertica.log: 

2017-09-30 01:48:59.174 Spread Client:0x93d8550 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0003 left the cluster 

This then aligns with this message in the node0003 vertica.log file: 

2017-09-30 01:48:59.173 Spread Client:0x832f9d0 [Comms] <INFO> NETWORK change with 2 VS sets 

Looking in the node0001 vertica.log we see similar messages for node0002:

2017-09-30 01:48:59.173 Spread Client:0x93d8550 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0002 left the cluster 

Further in the node0002 log file the same message as we see in the node0003 log: 

2017-09-30 01:48:59.172 Spread Client:0x9019170 [Comms] <INFO> NETWORK change with 2 VS sets 

 

 

That leaves us with a Vertica system which sees node0001 as the lone remaining member of a three node cluster. This violates k-safety and triggers the shutdown cycle.

2017-09-30 01:48:59.174 Spread Client:0x93d8550 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes 

 

 

 

Resolution:

The internal network team will need to determine why the nodes losing network connectivity to each other at these times. Once the cause is identified and the network problems between cluster nodes is resolved the outages should cease.