Adaptive Load Balancing bond fails if one of the member links goes down and then recovers.

Document ID : KB000111976
Last Modified Date : 16/10/2018
Show Technical Document Details
Issue:
  • We have configured the PAM server with 2 NICs (Network Interface Cards), let us call them NIC_A and NIC_B, bonded in Adaptive Load Balancing mode.
  • When the communication through NIC_A breaks down, the communication from/to the PAM server continues through the NIC_B. This is the expected behavior.
  • However, when the communication through the failing NIC_A is recovered, the PAM server becomes non-accessible via network. The only way to recover from this situation is disabling and enabling again the one which has not failed, NIC_B, or reboot the entire PAM server.
Environment:
Physical Appliances running PAM Server 3.x or above.
Cause:
  • The network device Port Channel feature bundles individual links into a channel group to create a single logical link that provides the aggregate bandwidth of up to eight physical links. If a member port within a Port Channel fails, traffic previously carried over the failed link switches to the remaining member ports within the port channel.
  • The switch had the Port Channel feature enabled, so it assumed it had to manage the links for eventual disconnections, too. 
  • So, the problem cause was that both, the PAM Server and the switch were trying to manage the bonding, resulting in a total link failure.
Resolution:
Disable the Port Channel feature in the network devices for the links that are connected to the PAM appliances.
Additional Information:
  • A similar situation occurs with bondings having more that 2 NICs.
  • In this case, all the NICs in the bonding except for one have to fail and the general failure accessing the PAM server will occur when all of them have recovered.
  • For instance, in a PAM server with 4 NICs, NIC_1, NIC_2, NIC_3 and NIC4, in a bond in Adaptive Load Balancing mode, there should be failures in NIC_1, NIC_2 and NIC_3 for the problem to occur. The PAM server communication failure will happen when the last one of the failing NICs got finally recovered, but not when just 1 or 2 of them had recovered.