Getting "agent down" alerts from Introscope after changing loadbalancing.xml settings.

Document ID : KB000030055
Last Modified Date : 14/02/2018
Show Technical Document Details

Description:

Changed the loadbalancing.xml file to reassign a few agents to report into different Collectors. After this change, getting alerts from Introscope stating that the Agents were down.

 

Solution:

This is happening by design, so nothing wrong was done in this case.

The reason for the Connection Status metric to still alerting that it is disconnected on the original Collector is due to keeping that metric alive because APM cannot tell that the Agent was

1) taken down by someone on purpose,

2) undergoing a network hiccup,

3) having some operation that caused the Agent to crash,

4) having some operation that caused the Agent to disconnect, etc.

So what is done here is keep the metric alive in the event that the Agent just had a network hiccup and will reconnect again (just as an example). Given that the Agents do not have actual identifiers to say "I'm Agent #1", we cannot tell that the Agent that just disconnected from Collector A is the same exact Agent as the one that just connected to Collector B even though they exhibit the same name.

Here is how the metrics work. Each Collector has their own Smartstor database. All the Collectors have no idea about each other. As far as they're concerned, they are the only Collector reporting to MOM. And only MOM knows who all the Collectors are.

So if you have Agent A and it connects to Collector A, if pointing the GUI at Collector A, you will see the metrics there for Agent A.

Suppose Agent A moves to Collector B on a Tuesday. If  pointing the GUI to Collector B, you will only see metrics starting Tuesday for Agent A, and none for Monday even though the Agent never shut down.

However what the MOM does is collect the metrics from all Collectors that had Agent A connected to it. Thus displaying a whole history of the Agent visually. It does this each time you view a metric for a particular Agent or set of Agents.

The data is merged only when viewing metrics in the GUI. They are not merged in all the Collectors' Smartstor Database nor the MOM's Smartstor Database.

The one thing is, you are allowed to have an agent of the same type for the same name, i.e. Weblogic Agent named MyAgent. It would then appear as MyAgent and MyAgent%1 (our naming convention to differentiate between the two MyAgents.). This is why Agent A will be showing a disconnect and an actual connect at the same time.  

See the attached screenshot. Notice that even though it's the same Agent, it's represented in two different colors as the EM thinks its two separate Agents. In the reproduction of the issue, at 1:23:15, this was the last time the Agent showed that it was connected. At 1:24:15 the Agent started to show it was disconnected. At 1:25:15, the Agent reappeared as connected to the other Collector, yet the other metric still shows as disconnected and will do so until the age out time is met (30 minutes is the default). So if going to the metric group, and go back in time, it will show two metrics at that particular time - a disconnect and a new connection. The data gets aged out from the view, however the data still resides in Smartstor database until its aged out of all 3 tiers - see Smartstor tier settings in the IntroscopeEnterpriseManager.properties file.

agent disconnect - reconnect.png 

So how can you gracefully change which Agents go to which Collectors via loadbalancing.xml?  There are a few ways to handle this type of alerting:

1. Create a metric group that looks at the Connection Status metric on all Collectors for a specific Agent. From this, you can set an alert to trigger if all the metric values are equal to 3. You would want to have a delay of 2 minutes in order to give the Agent time to move from one Collector to another.

2. Create an alert downtime schedule for the Connection Status alerts. This would have to run for 30 minutes as that is how long it takes for a metric to age out (stop reporting).

3. Limit the amount of time for metric aging. However this will affect ALL metrics, thus skewing statistics if you have a network hiccup, a slow performing Agent or Collector, etc.   Please note this is NOT recommended unless you have a valid business case.