APM have slow response times when one Collector has network problems

Document ID : KB000095759
Last Modified Date : 10/06/2018
Show Technical Document Details
Issue:
When one of Collector (on-premise in that case) has network problems (MOM log shows timeout errors for that collector), then the Webview investigator becomes slow. 
Even getting “host” metric for the MOM is slow. 

When that collector is removed from the loadbalancing.xml definition file of MOM, then the Webview investigator gives good/immediate response time.

There is no agent currently connecting to that APM cluster.
Environment:
APM 10.5 SP 2, APM 10.7
Cause:
The Collector doesn't need to contact other Collectors, it just needs to resolve the other one's address.

This is what happens:
- Each time any changes occurred for Loadbalancer, MOM will send the new LB info to all Collectors to keep them up to date.
- The Loadbalancer info might include lists of Collectors, depends on what is configured.
- The Collector will try to resolve the other Collectors' address in the new Loadbalancer, info to enumerate all addresses/hosts for each collector, which will be sent to Agents later.
- A thread dump shows the above step was done on the "Dispatcher" thread, which could delay other requests later from MOM, therefore impact the cluster performance.
 
Resolution:
A workaorund is to ensure that the Collector IPs resolve locally - if DNS is having issues then add the IP s to the local hosts file.