Seeing Performance and other Issues with Summary MOM solution.

Document ID : KB000006177
Last Modified Date : 14/02/2018
Show Technical Document Details
Issue:

 Seeing a variety of issues with Summary MOM:

  -Out of Memory (OOM) messages when adding a 4th sender

  - Load balancing issues. Differences in number of agents and metrics across collectors

  - Dashboard sluggish and draws right to left slowly

  - General EM Cluster Performance issues

  - Seeing data gaps in dashboard graphs

  - Search from Investigator in live mode returns AgentNotFoundException.

  - Live graph not showing data.

  - Collector hangs after restart.

 

Environment:
Customer had issues with Summary MOM 3.1 and APM 10.1.
Cause:

- GC Heap and Load Balancer Settings. (General cluster performance issues.)  

- Sending queries that returned more metrics than supported. (Summary MOM Capacity issue.) 

- Sending 1 data point across 2 harvest durations. This resulted in a gap in the first period but eventually would show up. (Summary MOM timing issue.)

- Live query impact.

Resolution:

What was done/proposed:

  - Optimize sender regexes across multiple sender collectors to send only needed metrics by adding an exclusion metric.

  - Data was sent at harvesting time processing 1 data point across 2 harvest durations. So there is a delay in seeing metric. So was a timing not a capacity issue. 

 - Add more senders. Changed architecture to only have 2-3 senders per summary MOM (receiver). 

 - Updated Summary Engine code to be more efficient and provide more logging details helpful for debugging. 

 - Change EM JVM settings to 

    - Use G1GC (XX:+UseG1GC )

    - Removed Permgen settings. (As PermGen was removed in JVM 1.8.)

    - Remove  UseConcurrentMarkSweepGC,

 - Upgrade from Summary MOM Release 3.1 to 3.4. (Multiple updates tested that became eventually Release 3.4)

 - Increase introscope.enterprisemanager.framework.receiver.queue.size

 - Proposed changing introscope.enterprisemanager.loadbalancing.staywithhistoricalcollector to always. 3 possible values are: always, notoverloaded, rarely - the default is notoverloaded. This was not done.

 - Added APM 10.1 HF 25 to deal with EM issues when EM not responsive

 - Reduced load balancing metric threshold 

 - Proposed switching Switching Workstation to 64 bit JVM but not done.

 

 

Additional Information:

Also see these KDs for assistance

https://www.ca.com/us/services-support/ca-support/ca-support-online/knowledge-base-articles.tec1526667.html -- Tips for loadbalancing configuration when upgrading an Enterprise Manager Cluster

https://www.ca.com/us/services-support/ca-support/ca-support-online/knowledge-base-articles.TEC1749494.html  -- Unable to start Introscope Enterprise Manager and see the error message, "Could Not Create Java Virtual Machine."

It may be helpful to look at this Supportability metric if available: Custom Metric Host (Virtual)|Custom Metric Process (Virtual)|Custom Metric Agent (Virtual)|Enterprise Manager|JSON Metric Receiver|Queue Depths:pendingMetrics