Agents fail to re-connect after RMI failover

Document ID : KB000084622
Last Modified Date : 14/04/2018
Show Technical Document Details
Issue:
Error Message :
ErrorMsg: AwE-5103 network socket error (8/16/17 3:20 PM)
Details: 217fa8[TLS_DH_anon_WITH_AES_128_GCM_SHA256: Socket[addr=prod.test.org/10.200.3.40,port=60010,localport=61180]]
java.net.SocketException: Connection reset

In the following scenario, Agents can stay stuck in a SRVC_DOWN status after RMI failover:
  1. Primary RMI goes down and Agents fails over to secondary RMI successfully.
  2. Start up the primary RMI and Agents go down for a couple seconds but reconnect.
  3. Finally, if the secondary RMI is killed, Agents go down and do not reconnect.
They stay stuck in a SRVC_DOWN status. This is true even when the Primary RMI is set as the Primary RMI.



 
Environment:
OS Version: N/A
Cause:
Cause type:
Defect
Root Cause: Need to add and improve debugging.
Network Checker skip sockets that are in Shutdown.
Don't call startup code during fail over so AM doesn't end up with multiple read threads for the same rmiserver, and so AM loops checking for the master in the database
fix removal of old socket in reconnect()
Resolution:
Update to a fix version listed below or a newer version if available.

Fix Status: Released

Fix Version(s):
Applications Manager 9.2.1 – Available
Additional Information:
Workaround :
Restart the Automation Engine and all Remote Agents.