Windows Agent in cluster doesn't retrieve the second Automation Engine when the first is down

Document ID : KB000084433
Last Modified Date : 14/04/2018
Show Technical Document Details
Issue:
Error Message :
U02000042 Connection aborted. Error code '10053', error description: 'An established connection was aborted by the software in your host machine.'.

When using an Automation Engine (AE) cluster (Active-Active) with agents able to switch from one to another Engine in case the other Engine falls (OS issue, Network issue, Shut down of the Virtual Machine,...). For now, when an Engine is shut down gracefully, agents appear to automatically connect to the other Communication Processes (CPs) as expected.

But when an Engine is shut down "violently" (Shut down of the Virtual Machine) the connected agents do not automatically failover to the other CPs running on the other active AE.

This case is happening for all Windows agents in a clustered environment.

Investigation
  • Connect Agents to a given AE CP in an Active-Active cluster
  • Take the node offline either using sudden power off or a network disconnection
  • Agents connected the now offline instance are unaware of the failover
  • They take between 15-18 Minutes to realize, after this they simply show as offline
  • As soon as the AE Server is taken offline, it is no longer possible to communicate to the given Agent.
Environment:
OS: All Windows
Cause:
Cause type:
Defect
Root Cause: The KEEP_ALIVE variable is not correctly used by the Agent, which doesn't reconnect to the second Automation Engine directly.
Resolution:
Update to a fix version listed below or a newer version if available.

Fix Status: In Progress

Fix Version(s):
Automation Engine 12.2.0 - Planned release date: 2018-06-19
Automation Engine 12.1.1 - Available
Automation Engine 12.0.4 - Available
Additional Information:
Workaround :
Completely restart the Windows Agent.