What is the mechanism used to detect that an agent has disconnected from the Enterprise Manager?

Document ID : KB000077335
Last Modified Date : 12/04/2018
Show Technical Document Details
Question:
What is the mechanism that Introscope uses to detect if an agent has disconnected?
Answer:
There are primarily two kinds of connections between agent and EM - Isengard socket communication and HTTP tunneling of Isengard.
In both cases, the agent initiates the connection, so, an agent registers with the EM.

In the case of Isengard, because the connection is persistent, the EM detects the disconnection in two ways:
1. If the EM tries to write some data and it fails, then the EM marks the agent as gone away. This is immediate, but the agent still appears in the investigator tree as greyed out for 30 minutes in case the agent connects again.
2. If there is no activity from either the agent or the EM side, i.e., read or write(typically, most of the activity is from the agent side sending data and EM returning acknowledgements), the EM identifies that the channel has been idle and tries to ping it. This happens every 120 seconds, but in most cases, a write fails and EM knows pretty quickly of a disruption in a connection.
So, the worst case scenario is 120 seconds, but in our experience, the detection is instantaneous.

In case of HTTP tunneling, because the EM waits for a HTTP request with payload from the agent and relies on the response payload to send any data back, we rely on the inactive HTTP session timeouts, which is 60 seconds(It was 300 seconds for versions before 9.x). So, if the agent fails to send requests, the EM finds out within 60 seconds.
Additional Information:
In the case where the agent doesn't disconnect when you believe it should, it is likely that the connection is still there, but the agent cannot collect data, due to a server hang, for example - the agent connection thread can still be fine.
That is also why the connection state will not change until the agent is restarted, thus ending the hiatus.
While relying on the agent connection state is a good idea, we would suggest to add alerts based on other metrics like responses per interval or CPU utilization, choosing metrics that are normally very active in the application.