Vertica is down and when we restart it, it stops again in a few moments.

Document ID : KB000103269
Last Modified Date : 25/06/2018
Show Technical Document Details
Issue:
Vertica went down last night.
Each time we restart vertica it fails again within a few moments.
 
Environment:
CAPM 3.5
Vertica 8.1.0-4
3 node Vertica cluster on Linux

 
Cause:
In Vertica.log we see one node leaves the cluster, then another node leaves the cluster, then we shut down for k-safety.
And in dmesg we see OS level errors on the interfaces:
some sample lines
 
[ 7303.350454] bond0: Removing slave eth0
[ 7303.350523] bond0: Releasing backup interface eth0
[ 7303.350526] bond0: the permanent HWaddr of eth0 - **:f2:e9:bd:9c:70 - is still in use by bond0 - set the HWaddr of eth0 to a different address to avoid conflicts
[ 7303.507399] bond0: Removing slave eth1
[ 7303.507457] bond0: Removing an active aggregator
[ 7303.507459] bond0: Releasing backup interface eth1
[ 7303.944889] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
[ 7303.972597] bond0: Adding slave eth0
[ 7303.972661] tg3 0000:16:00.0: irq 53 for MSI/MSI-X
[ 7303.972665] tg3 0000:16:00.0: irq 54 for MSI/MSI-X
[ 7303.972669] tg3 0000:16:00.0: irq 55 for MSI/MSI-X
[ 7303.972673] tg3 0000:16:00.0: irq 56 for MSI/MSI-X
[ 7303.972677] tg3 0000:16:00.0: irq 57 for MSI/MSI-X
[ 7304.087939] bond0: Enslaving eth0 as a backup interface with a down link
[ 7304.114156] bond0: Adding slave eth1
[ 7304.114213] tg3 0000:16:00.1: irq 58 for MSI/MSI-X
[ 7304.114217] tg3 0000:16:00.1: irq 59 for MSI/MSI-X
[ 7304.114221] tg3 0000:16:00.1: irq 60 for MSI/MSI-X
[ 7304.114225] tg3 0000:16:00.1: irq 61 for MSI/MSI-X
[ 7304.114229] tg3 0000:16:00.1: irq 62 for MSI/MSI-X
[ 7304.229236] bond0: Enslaving eth1 as a backup interface with a down link
[ 7304.234371] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
[ 7316.890718] tg3 0000:16:00.0 eth0: Link is up at 1000 Mbps, full duplex
[ 7316.890724] tg3 0000:16:00.0 eth0: Flow control is off for TX and off for RX
[ 7316.890727] tg3 0000:16:00.0 eth0: EEE is disabled
[ 7316.933455] bond0: link status definitely up for interface eth0, 1000 Mbps full duplex
[ 7316.933462] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
[ 7316.933472] bond0: first active interface up!
[ 7316.933507] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
[ 7317.042805] tg3 0000:16:00.1 eth1: Link is up at 1000 Mbps, full duplex
[ 7317.042811] tg3 0000:16:00.1 eth1: Flow control is off for TX and off for RX
 
Resolution:
Issue was with Cisco FabricPath switch and Server, not communicating correctly using LACP. The port channel (lacp) definition on the Cisco switch had to removed and re-added to once again enable switch to server communication.