Introscope Enterprise Manager Troubleshooting and Best Practices

Document ID : KB000093176
Last Modified Date : 08/06/2018
Show Technical Document Details
Introduction:
The following is a high-list of techniques and suggestions to employ when troubleshooting Introscope EM common performance and configuration issues.
A) Common issues
B) Checklist and Best Practices
C) What diagnostic files should I gather for CA Support?
Environment:
Any 10.x
Instructions:
A) Common Messages

Below is the list of the common ERROR, WARN, exceptions messages that affects the Introscope EM or Cluster performance:

1. [WARN] [Harvest Engine Pooled Worker] [Manager.Agent]  [The EM has too many historical metrics reporting from Agents and will stop accepting new metrics from Agents.  Current count
 
Recommendation:
Open the EM|collectors/config/apm-events-thresholds-config.xml
Increase the introscope.enterprisemanager.metrics.historical.limit clamp
This is a hot property, there is no need to restart the EM
 

2. [WARN] [PO:WatchedAgentPO Mailman 1] [Manager.Agent] The Agent <your.agent> is exceeding the per-agent metric clamp (current=5000, max=5000). New metrics will not be accepted
 
Recommendation:
If the problem is related to some missing metrics in the investigator and the agent is listed in the above WARN message then
Open the the EM|collectors/config/apm-events-thresholds-config.xml
Increase the introscope.enterprisemanager.agent.metrics.limit
This is a hot property, there is no need to restart the EM
 

3. [WARN] [Dispatcher 1] [Manager] Timed out adding to outgoing message queue. Limit of 3000 reached.
 
Recommendation:
-Open the the EM|collectors/config/IntroscopeEnterpriseManager.properties
-Set transport.outgoingMessageQueueSize=8000
If it is already 8000, increase the value by 2000, you need to adjust the value as required

- Add the below 2 properties in all the EMs (MOM and collectors)
transport.override.isengard.high.concurrency.pool.max.size=10
transport.override.isengard.high.concurrency.pool.min.size=10

In case for the MOM should have one thread per Collector, plus about 1 for every 10 workstations connections, plus 1 for every 25 Webview users connections ( as it is equivalent to a single workstation connection)

A restart of the EMs is required for the changes to take effect.

NOTE: Increasing the outgoing message queue allows you to have a bigger buffer.  Increasing the thread pool size allows you to have more worker threads to send outgoing messages. These important adjustments are required when, sending messages, usually between collectors and MOM, becomes a bottle neck for performance.


4. java.io.IOException: Too many open files
 
Recommendation: (unix only)
Make sure the max open file handle is at least 4096 on both MOM and Collectors. You can check current open file descriptors by using "ulimit -n" against the user who starts EM processes. You might need to increase the maximum number of open files allowed for that user.
 

5. java.io.IOException: No space left on device
 
Recommendation:
Increase HD space as soon as possible to prevent a database (smartstor, traces, heuristic) corruption.
 

6. [ERROR] [Lucene Merge Thread #1] [Manager] Uncaught Exception in Enterprise Manager:  In thread Lucene Merge Thread #1 and the message is org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
 
Recommendation: 
This message indicates that the Trace database is corrupted
Stop the Introscope EM, delete the /traces folder and start the EM again

If you are unsure of the location, open the config/IntroscopeEnterpriseManager.properties, check the introscope.enterprisemanager.transactionevents.storage.dir property
 

7. 
"[Manager] Outgoing message queue is not moving"
"[Manager] Outgoing message queue is moving slowly"
"[WARN] [master clock] [Manager] EM load exceeds hardware capacity. Timeslice data is being aggregated into longer periods."
 
Recommendation:
This is due to a capacity or configuration issue, review the below "Checklist" section
 

8. java.lang.OutOfMemoryError: GC overhead limit exceeded
 
Recommendation:
Open the EM|collectors/Introscope_Enterprise_Manager.lax
Increase EM memory heapszie by 2GB
 

9. [Manager.TransactionTracer] The Enterprise Manager cannot keep up with incoming event data
 
Recommendation:
Reduce the traces incoming rate, see below "Checklist" section, point # 9
 

10. [WARN] [master clock] [Manager.AppMap.Alert-Mapping] Processing of alerts is overloaded, ignoring <xxx> new alerts!
 
The above message indicates that a lot of Alerts are automatically created and propagated to AppMap
Symptoms: overhead in memory, GC, harvest duration and disk space due to the extra alert states changes
 
Recommendation:
 
a) Try reducing uvb metric clamp introscope.apmserver.uvb.clamp from default 50000 to a very small number like 5, this will reduce the metric handling for Differential Analysis alerts.
 
b) Reduce or stop the action notifications by:
- Excluding the specific frontend applications
- Excluding the specific frontend applications and then creating a separate differential control element specifically for that frontend (which allows fine-tuning of notifications).
- A simple way to reduce the number of notifications is to add actions to the danger list only.
  

11. [Manager] TransactionTrace arrival buffer full, discarding trace(s)
 
Team Center uses Transaction trace data as the source for the Team Center Map by default. This causes the MOM to retrieve the Transaction Traces Data from Agents, occasionally which slows down the Collectors
 
Recommendation:
Open the the EM_HOME/config/IntroscopeEnterpriseManager.properties
locate introscope.enterprisemanager.transactiontrace.arrivalbuffer.capacity=
The default is 2500, you can try to increase the value to 5000, the impact will be on memory, you might need to increase the EM heap size
 

12[Manager.Cluster] Collector clock is too far skewed from MOM. Collector clock is skewed from MOM clock by XXXX ms. The maximum allowed skew is 3,000 ms. Please change the system clock on the collector EM.
 
Recommendation:
Collector system clocks must be within 3 seconds of MOM clock setting. Ensure MOM and collectors synchronize their system clocks with a time server such as an NTP server otherwise EMs will disconnect
 

13. [WARN] [Dispatcher 1] [Manager.IsengardObjectInputStream] Internal cache is corrupt. Cannot determine class type for Object 1. A prior class deserialization error may have corrupted the cache. 

This is due to a capacity or configuration issue, review the below "Checklist" section


14. [ERROR] [pool-1-thread-2] [Manager] failure to add data to baseline
java.lang.ClassCastException

 
Recommendation: 
This message indicates that the Baseline database is corrupted
Stop the Introscope EM, delete the /data/variance.db and start the EM again

If you are unsure of the location, open the IntroscopeEnterpriseManager.properties, check the introscope.enterprisemanager.baseline.database property
 
 

B) Checklist and Best Practices

Below is the list of common configuration problems that affects the Introscope EM or Cluster performance:

1. Heap size memory:
 
Make sure to set the initial heap size (-Xms) equal to the maximum heap size (-Xmx), since no heap expansion or contraction occurs, this can result in significant performance gains in some situations.

- In a unix setup you need to update the EM_HOME/Introscope_Enterprise_Manager.lax
lax.nl.java.option.additional

- In a window setup you need to update the EM_HOME/bin/EMService.conf (windows)
wrapper.java.initmemory=
wrapper.java.maxmemory=


2. Management Modules:
 
Management modules should only be deployed in the MOM. Make sure the collectors do not start with any Management modules to prevent any unnecessary extra load.

- Stop the collector(s) only
- Rename EM_HOME/config/modules as modules_collector_backup
- Create an empty "modules" directory
- Start the collector
 

3. If you are using JVM 1.8, ensure that G1 GC is in use.
 
- Open the EM_HOME/bin/EMService.conf (windows) or EM_HOME/Introscope_Enterprise_Manager.lax (other platforms)
- Remove the following java arguments if they are present : -XX:+UseConcMarkSweepGC and -XX:+UseParNewGC
- Add : -XX:+UseG1GC and -XX:MaxGCPauseMillis=200

For example:
- config/EM_HOME/Introscope_Enterprise_Manager.lax:
lax.nl.java.option.additional=-Xms20480m -Xmx20480m -Djava.awt.headless=true -Dmail.mime.charset=UTF-8 -Dorg.owasp.esapi.resources=./config/esapi -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xss512k

-bin/EMService.conf
wrapper.java.additional.7=-XX:+UseG1GC
wrapper.java.additional.8=-XX:MaxGCPauseMillis=200


4. Configure correctly the EM to run as a nohup process (unix only)

When running the EM in unix you also need to perform this step manually. 
Open the EM_HOME/Introscope_Enterprise_Manager.lax, locate the property lax.stdin.redirect, update it as below:

Replace
lax.stdin.redirect=console

with:
lax.stdin.redirect=
 
see https://docops.ca.com/ca-apm/10-7/en/administrating/configure-enterprise-manager/start-or-stop-the-enterprise-manager/#StartorStoptheEnterpriseManager-RuntheEnterpriseManagerinnohupModeonUNIX
 
"Note: Only run the Enterprise Manager in nohup mode after you configure it. Otherwise, the Enterprise Manager might not start, or can start and consume excessive system resources."
 

5. Disable DEBUG logging 
 
By default logging is set to INFO in the IntroscopeEnterpriseManager.properties:
log4j.logger.Manager=INFO,console,logfile
 
If have enabled DEBUG logging at some point, ensure it is disabled to prevent any impact in disk I/0.
 

6. OutOfMemory and slowness due to "heavy historical queries"
 
As a best practice add the below 2 hidden properties in all the EM_HOME/config/IntroscopeEnterpriseManager.properties:
introscope.enterprisemanager.query.datapointlimit=100000
introscope.enterprisemanager.query.returneddatapointlimit=100000

You can verify this condition by opening the Metric Browser, expand the branch:
 Custom Metric Host (virtual)
   - Custom Metric Process (virtual)
      - Custom Metric Agent (virtual)(collector_host@port)(SuperDomain)
         - Enterprise manager
            - Internal | Query
               - Data Points Retrieved From Disk Per Interval    
               - Data Points Returned Per Interval
 
- if introscope.enterprisemanager.query.datapointlimit > 100000
Add hidden property introscope.enterprisemanager.query.datapointlimit=100000 to the IntroscopeEnterpriseManager.properties

this property defines the limit for Enterprise Manager query data point retrieval. It defines the maximum number of metric data points that a Collector or standalone Enterprise Manager can retrieve from SmartStor for a particular query.
This property limits the impact of a query on disk I/0, the default value for this property is 0 (unlimited)
 
-if introscope.enterprisemanager.query.returneddatapointlimit >. 100000 
Add hidden property introscope.enterprisemanager.query.returneddatapointlimit=100000 to the IntroscopeEnterpriseManager.properties

this property defines the limit for Enterprise Manager query data point return. It defines the maximum number of metric data points that a Collector or standalone Enterprise Manager can return or retrieve for a particular query.
This property limits the impact of a query on memory, the default value for this property is 0 (unlimited)


7. You must provide a dedicated disk I/O path for SmartStor

Make sure Smatstor db is pointing to a dedicated disk controller and introscope.enterprisemanager.smartstor.dedicatedcontroller=true which allows the EM to fully utilize this setting. Failing to do this, will reduce collector performance by 50%

From https://docops.ca.com/ca-apm/10-7/en/ca-apm-sizing-and-performance/system-information-and-requirements/ca-apm-data-storage-requirements
 
"The dedicated controller property is set to false by default. You must provide a dedicated disk I/O path for SmartStor to set this property to true; it cannot be set to true when there is only a single disk for each Collector. When the dedicated controller property is set to false, the metric capacity can decrease up to 50 percent."

In a SAN storage environment, each SmartStor should map to a unique logical unit number (LUN) that represents a dedicated physical disk. With this configuration only, it is safe to set introscope.enterprisemanager.smartstor.dedicatedcontroller=true.

From https://docops.ca.com/ca-apm/10-7/en/files/415269782/465939451/2/1524053302595/CA+APM+Reference+Architecture+2018+0411+6.pdf, page 28:

"Specific RAID volumes or LUN should be created with dedicated disks/spindles for the CA software to avoid disk I/O contention from other applications which may be sharing the same RAID or storage array. The greater the volume of disks/spindles allocated to the RAID volume or LUN will provide greater IO distribution and will maximize read/write times for the processes."
 
If you are using a virtual environment, refer to:
https://docops.ca.com/ca-apm/10-7/en/ca-apm-sizing-and-performance/capacity-planning-and-server-deployment-options#CapacityPlanningandServerDeploymentOptions-VMWareRequirementsandRecommendations


8. Huge Smartstor metadata

If you have increased the default EM clamps in the EM_HOME/config/apm-events-thresholds-config.xml to support your current load, for example increased introscope.enterprisemanager.metrics.historical.limit and introscope.enterprisemanager.agent.connection.limit
Keep in mind that this change will have an impact on the EM performance, as a best practices always:

a) use Smartstor tool to remove periodically any unwanted metric
https://comm.support.ca.com/kb/how-do-i-cleanup-smartstor-data/kb000056953
https://docops.ca.com/ca-apm/10-7/en/administrating/configure-enterprise-manager/configure-and-manage-smartstor-data/#ConfigureandManageSmartStorData-UsetheSmartStorCommand-LineTools
 
b) by default we store data for 1 year, however you can adjust this data retention by reducing introscope.enterprisemanager.smartstor.tier3.age for example. Changing tier3 will help for some time, also reducing the value gradually would help prevent performance issues (high Smartstor Duration)
https://docops.ca.com/ca-apm/10-7/en/administrating/configure-enterprise-manager/configure-and-manage-smartstor-data
 
c) Split the load, by adding an additional collector
 
d) If you using a pre-10.5 release, you should consider to upgrade to take advantage of the New Smartstore metadata model. The metadata has been enhanced to solve the below main issues:
-EM is eating big amount of memory when you have too many inactive metrics.
-EM is eating too much CPU when it has too many inactive metrics.
-There is high I/O for a rewrite metrics.metadata file.
-Minimize/eliminate the need for a cleanup process
 
The new metadata model is now able to handle more than 30 million historical metrics vs. the old 1.2 million and there is no impact in cpu, memory and I/O
 

9. Too many incoming traces: 
Large amount of transaction traces on your system will impact the EM|Collector performance
 
Recommendation:
Check the EM_HOME/logs/perflog.txt, if "Performance.Transactions.Num.Traces" is higher than 1 million or increasing rapidly then try to reduce or limit the number of traces:
 
a) Reduce data retention by half (default value is 14 days)
Open the EM_HOME/config/IntroscopeEnterpriseManager.properties, set introscope.enterprisemanager.transactionevents.storage.max.data.age=7
 
b) Clamp the transaction Traces sent by agent to EM Collectors
Open the EM_HOME/config/apm-events-thresholds-config.xml
Reduce introscope.enterprisemanager.agent.trace.limit, from 1000 to 50.
This clamp limits the number of transaction events per agent the Enterprise Manager processes per interval
 
c) Increase the Differential Analysis auto tracing triggering threshold
Open the EM_HOME/config/apm-events-thresholds-config.xml
Add introscope.enterprisemanager.baseline.tracetrigger.variance.intensity.trigger=30 (default 15)
This is a hot property, this will reduce the auto generated traces from Differential Analysis, it will force Differential Analysis to wait for more spikes before triggering evidence or transaction traces automatically to capture diagnostic about the variance.  
 
The above suggestions will help you reduce the # of traces, however you should investigate why and which agent is causing the high amount of traces
Open the Metric Browser,  expand the branch: Custom Metric Agent | Agents | <host> | <process> | <agentname>:Transaction Tracing Events Per Interval


10. Unable to connect to Webview/Workstation or connectivity is very slow

The problem could be due toe Outgoing Delivery threads got stuck on NIO writing, as a best practice disable Java NIO.

Open the EM_HOME/config/IntroscopeEnterpriseManager.properties, add the below hidden property:

transport.enable.nio=false

You need to restart the Enterprise Manager(s)
Apply this change in all the EMs (MOM and collectors)

NOTE: Disabling NIO will switch back to the previous classic socket operations to revert to the  polling architecture, there is not loss of functionality.
The main difference between Java IO and Java NIO is IO is stream oriented where caching is not there while NIO is buffer oriented and uses caching to read data and has additional flexibility due the buffering. Apart from flexibility you may have other overheads of verification before data processing and overwriting dangers. Once the data is read, it does not make any difference in what you do with it and how you handle it. Hence, using IO/NIO should not make any other difference than these known issues from JVM side. 
 
How to verify this condition?
Go to the MOM, take 2 to 3 thread dumps spaced 15 seconds apart by running: kill -3 <MOM-PID>
Verify the condition as indicated in below KB:
https://comm.support.ca.com/kb/_MOM-Performance-issues-Agent-disconnections-unable-to-login-to-the-workstation/KB000045655


11. Check if the cluster is unbalanced
 
If see a discrepancy on metrics across the collectors, for example, some collectors have 200K and others 20K metrics
Keep in mind that the MOM load balancing mechanism only cares when a server is overloaded not underloaded.
 
Suggestions:
 
a) Reduce introscope.enterprisemanager.loadbalancing.threshold=20000 to 10000 so collector load is more even across the cluster.
There is no need to restart to EM as it is a hot property but you have wait 10 minutes (introscope.enterprisemanager.loadbalancing.interval=600)
 
b) Update the loadbalancing.xml to explicitly allocate the agents to the appropriate collectors.
 
c) You can try to disable the default load balancing mechanism during MOM startup only to prevent the issue of having agents moving across all the collectors as this causes an unnecessary duplication of metrics on every collector’s metadata.
See https://comm.support.ca.com/kb/tip-for-loadbalancing-configuration-when-upgrading-an-enterprise-manager-cluster/kb000031358

- Stop the cluster (MOM and collectors)

- Open the MOM_HOME/config/IntroscopeEnterpriseManager.properties
- Add the below hidden property: introscope.enterprisemanager.loadbalancing.staywithhistoricalcollector=always

This will force the agents to collect to the appropriate collectors, once all of your agents are reconnected you must change the property value to notoverloaded to restore default load balancing mechanism.


12. Check if EM metric clamps have been reached.

If a metric clamp has been reached, the next coming metrics will be ignore and lost
 
To Check the EM clamps : Open the Metric Browser, expand the branch
 
 Custom Metric Host (virtual)
   - Custom Metric Process (virtual)
      - Custom Metric Agent (virtual)(collector_host@port)(SuperDomain)
         - Enterprise manager
            - Connections
 
looks at the values for:
 
  - "EM Historical Metric Clamped"
  - "EM Live Metric Clamped"
 
The above metrics should all be 0.
 
To check the Agent clamp : expand the branch
 
 Custom Metric Host (virtual)
   - Custom Metric Process (virtual)
      - Custom Metric Agent (virtual)(collector_host@port)(SuperDomain)
         - Agents
            - Host
               - Process
                   - <AgentName>
 
looks at the value for : "is Clamped" metric, it should be 0.

Recommendation:
-Open the EM_HOME\config\apm-events-thresholds-config.xml 
-Increase the below clamps as needed:

introscope.enterprisemanager.metrics.historical.limit
introscope.enterprisemanager.metrics.live.limit

This is a hot property, there is no need to restart the EM

See https://docops.ca.com/ca-apm/10-7/en/using/apm-metrics/cluster-supportability-metrics#ClusterSupportabilityMetrics-CollectorMetrics


13. Clear the OSGI cache:

- Stop the Introscope Enterprise Manager
- Go to <em install dir>/product/enterprisemanager/configuration/' folder, delete all folders except the config.ini 
- Go to <em install dir>/product/webview/configuration/' folder, delete all folders except the config.ini 
- go to <EM_HOME>/work - remove all files 
- Start the Intrsocope Enterpriese Manager

You need to apply this recommendation in all the Introscope EMs (MOM and collectors)
https://comm.support.ca.com/kb/how-to-clear-the-introscope-osgi-cache/KB000100299
 

14. Start the MOM and collector in the correct order

It is recommended to start Collectors before MOM to avoid overload MOM, but there is also a risk of overloading Collector that were started first if the Agents were not restarted. As a best practice you should start all Collectors at the same time and then the MOM.

 

C) What to collect if the problem persist?
 
Gather the following information from all the Introscope Enterprise Manager instances (MOM and collectors) and open a support case 
 
1. EM_HOME/logs
2. EM_HOME/config
3. EM_HOME/install/*.log
4. Hardware specs of the servers and a general overview of the implementation indicating where the collectors and MOM are
5. If the EM hangs, collect a series of 10 thread dumps at 10 second intervals when the problem occurs, run:
kill -3 <EM-PID>
6. If the EM crashed because of an OutOfMemory situation, add the additional jvm arguments to create automatically a head dump next time the problem occurs:
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<your_target_directory>
 
Additional Information:
Docops:
https://docops.ca.com/ca-apm/10-7/en/ca-apm-sizing-and-performance/enterprise-manager-and-cluster-sizing
https://docops.ca.com/ca-apm/10-7/en/troubleshooting/enterprise-manager-and-collectors-troubleshooting
https://docops.ca.com/ca-apm/10-7/en/troubleshooting/sizing-and-performance-troubleshooting

White Paper:
​https://docops.ca.com/ca-apm/10-7/en/files/415269782/465939451/2/1524053302595/CA+APM+Reference+Architecture+2018+0411+6.pdf

Troubleshooting:
https://comm.support.ca.com/kb/_What-do-the-fields-in-Perflogtxt-mean/KB000031758