Unexpected Load Balancing Behavior after Collector Reconnection.

Document ID : KB000005833
Last Modified Date : 14/02/2018
Show Technical Document Details
Issue:

    After a collector restart, I do not see agents immediately connecting via load balancing to a recently re-connected collector. Why does this happen?

Environment:
This behavior can occur with any APM 10.0 environment regardless of the agents connecting to the environment.
Cause:

    There are number factors considered when determining how a MOM redistributes agents connected to a cluster via load balancing. First, from the MOM's IntroscopeEnterpriseManager.properties file is introscope.enterprisemanager.loadbalancing.interval which determines how frequently a MOM attempts to move agents between collectors (default is 600 seconds).

After this load balancing interval has been reached, the MOM performs a calculation to determine which Collector EMs should have agents redistributed (to lower overloaded Collector EMs' workload). This calculation is performed using the introscope.enterprisemanager.loadbalancing.threshold. (Also from the MOM's IntroscopeEnterpriseManager.properties, the default is 20,000.). This threshold represents the number of metrics a Collector's workload must differ from the weight-adjusted cluster average (metric average for all Collectors) before agents are redistributed between collectors.

During load balancing, using this threshold, the MOM will try to move Agents from overloaded Collectors (determined as Collector whose metric count is greater than the sum of the weight-adjusted cluster average and introscope.enterprisemanager.loadbalancing.threshold) to under-loaded Collectors  (determined as Collectors whose metric count is less than weight-adjusted cluster average). However, if no Collectors metric count exceeds the cluster's weight-adjusted cluster average and load balancing threshold, no agents will be moved.

Now let's assume that during that during load balancing, an agent is disconnected and needs to reconnect to the cluster. At this point, it would be logical to assume that the reconnecting agent will be load balanced to the recently reconnected collector, as this collector has the least agent load. Unfortunately, this is not true under all circumstances.

Specifically, here are some of the reasons why the new agent might not connect to the least loaded collector:

- Agents can be "latched" or specifically assigned to specific collectors directly, regardless of other load balancing rules.

- In loadbalancing.xml, agents can be specifically included or excluded from specific collectors. Should a reconnecting agent be excluded from under loaded Collectors, this agent will only have the option of connecting to Collectors with high load already

- In the case of temporary agent disconnection, an agent is not officially removed from that Collector until after a certain time period has passed (referred to as "unmounted"). If the agent has not unmounted from a collector, it will prioritizing reconnection to this collector, assuming the collector is not in an overloaded state

- The MOM EM keeps track of the Collectors an agent has already connected to (referred to as "history") and will prioritize reconnecting an agent to a Collector that the agent has already connected to, assuming that Collector is not overloaded.

In all cases, enabling DEBUG logging on the MOM EM will cause additional load balancing messages to be logged which will provide reasons for why agents are connected to specific collectors. For example, here is a situation in which Collector01 is favored over other available collectors, due to the agent have a previous history with Collector01:


[DEBUG] [PO:main Mailman 8] [Manager.LoadBalancer] SuperDomain/Domain|<HostName>|<Process Name>|<Agent Name> has history in [Collector01.ca.com@5001]
[DEBUG] [PO:main Mailman 8] [Manager.LoadBalancer] Prefer Collector01.ca.com@5001 because it has history of the agent

Resolution:

    As described above, there are a number of valid reasons for which agents may not immediately connect to a reconnected Collector. Should you have any questions on this behavior, refer to https://docops.ca.com/ca-apm/10-5/en/administrating/configure-enterprise-manager/configure-mom-agent-load-balancing or contact CA Support for further assistance.