Introscope Enterprise Manager Troubleshooting and Best Practices

Document ID : KB000093176
Last Modified Date : 03/09/2018
Show Technical Document Details
Introduction:
The following is a high-list of techniques and suggestions to employ when troubleshooting Introscope EM common performance and configuration issues.

A) Common issues
B) Checklist and Best Practices
C) What diagnostic files should I gather for CA Support?
Environment:
Any 10.x
Instructions:
A) Common Messages

Below is the list of the common ERROR, WARN, exceptions messages that affects the Introscope EM or Cluster performance:

1. EM Clamps being reached:

Configuration file = EM_HOME/config/apm-events-thresholds-config.xml
All the below are hot properties, there is no need to restart the EM. 
Make sure to apply the change in all the EMs (MOM and collector)

[WARN] [Harvest Engine Pooled Worker] [Manager.Agent]  [The EM has too many historical metrics reporting from Agents and will stop accepting new metrics from Agents.  Current count

Recommendation:
increase introscope.enterprisemanager.metrics.historical.limit

[INFO] [PO:client_main Mailman 1] [Manager] Collector <collector-name>@<port> reported Clamp hit for MaxAgentConnections limit.

​Recommendation:
increase introscope.enterprisemanager.agent.connection.limit 

[Manager.Agent] The EM has too many live metrics reporting from Agents  and will stop accepting new metrics from Agents.[ Current count = ..

​Recommendation:
increase introscope.enterprisemanager.metrics.live.limit

[WARN] [PO:WatchedAgentPO Mailman 1] [Manager.Agent] The Agent <your.agent> is exceeding the per-agent metric clamp (current=5000, max=5000). New metrics will not be accepted
 
Recommendation: increase introscope.enterprisemanager.agent.metrics.limit
 

2. Tune Client Message Queues

WARN] [Dispatcher 1] [Manager] Timed out adding to outgoing message queue. Limit of <#> reached.
 
Recommendation:
-Open the the EM|collectors/config/IntroscopeEnterpriseManager.properties
-Set transport.outgoingMessageQueueSize=8000
If it is already 8000, increase the value by 2000, you need to adjust the value as required

- Add the below 2 properties in all the EMs (MOM and collectors)
transport.override.isengard.high.concurrency.pool.max.size=10
transport.override.isengard.high.concurrency.pool.min.size=10

In case for the MOM should have one thread per Collector, plus about 1 for every 10 workstations connections, plus 1 for every 25 Webview users connections ( as it is equivalent to a single workstation connection)

A restart of the EMs is required for the changes to take effect.

NOTE: Increasing the outgoing message queue allows you to have a bigger buffer.  Increasing the thread pool size allows you to have more worker threads to send outgoing messages. These important adjustments are required when, sending messages, usually between collectors and MOM, becomes a bottle neck for performance.


3. Operating system issues:

java.io.IOException: Too many open files
 
Recommendation: (unix only)
Make sure the maximum number of open files is at least 4096 in both MOM and Collectors, you can check the current setting by using "ulimit -a"
You can increase the current setting as below for example:

ulimit -n 16384
or
ulimit -n unlimited
 

java.io.IOException: No space left on device
 
Recommendation:
Increase HD space as soon as possible to prevent a database (smartstor, traces, heuristic) corruption.
 

4. Traces database:

[ERROR] [Lucene Merge Thread #1] [Manager] Uncaught Exception in Enterprise Manager:  In thread Lucene Merge Thread #1 and the message is org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
 
Recommendation: 
This message indicates that the Trace database is corrupted
Stop the Introscope EM, delete the /traces folder and start the EM again

If you are unsure of the location, open the config/IntroscopeEnterpriseManager.properties, check the introscope.enterprisemanager.transactionevents.storage.dir property
 

5. Cluster capacity or configuration issue:

[Manager] Outgoing message queue is not moving
[Manager] Outgoing message queue is moving slowly
[WARN] [master clock] [Manager] EM load exceeds hardware capacity. Timeslice data is being aggregated into longer periods.

 
Recommendations:
Find out if the messages are due to a recent change in the MOM and Collectors to accommodate the load

a) Check if the number of agents and/or live metrics has been increased
You can find out this by reviewing the EM_HOME/logs/perflog.txt (Performance.Agent.NumberOfAgents and Performance.Agent.NumberOfMetrics columns) from MOM and collectors.
https://comm.support.ca.com/kb/_What-do-the-fields-in-Perflogtxt-mean/KB000031758

b) Check if any of the below clamps in the EM_HOME/config/apm-events-thresholds-config.xml has recently been increased:
introscope.enterprisemanager.agent.metrics.limit
introscope.enterprisemanager.agent.connection.limit 
introscope.enterprisemanager.metrics.historical.limit
 
If this was the case, restore the values. These are hot properties, there is no need to restart the Introscope EM. 

c) Check each of the "Best practices" covered in the next section
 

6. java.lang.OutOfMemoryError: GC overhead limit exceeded
 
Recommendation:
Open the EM|collectors/Introscope_Enterprise_Manager.lax
Increase EM memory heapszie by 2GB
 

7. [Manager.TransactionTracer] The Enterprise Manager cannot keep up with incoming event data
 
Recommendation:
Reduce the traces incoming rate, see below "Checklist" section, point # 10
 

8. [WARN] [master clock] [Manager.AppMap.Alert-Mapping] Processing of alerts is overloaded, ignoring <xxx> new alerts!
 
The above message indicates that a lot of Alerts are automatically created and propagated to AppMap
Symptoms: overhead in memory, GC, harvest duration and disk space due to the extra alert states changes
 
Recommendation:
 
a) Try reducing uvb metric clamp introscope.apmserver.uvb.clamp from default 50000 to a very small number like 5, this will reduce the metric handling for Differential Analysis alerts.
 
b) Reduce or stop the action notifications by:
- Excluding the specific frontend applications
- Excluding the specific frontend applications and then creating a separate differential control element specifically for that frontend (which allows fine-tuning of notifications).
- A simple way to reduce the number of notifications is to add actions to the danger list only.
  

9. [Manager] TransactionTrace arrival buffer full, discarding trace(s)
 
Team Center uses Transaction trace data as the source for the Team Center Map by default. This causes the MOM to retrieve the Transaction Traces Data from Agents, occasionally which slows down the Collectors
 
Recommendation:
Open the the EM_HOME/config/IntroscopeEnterpriseManager.properties
locate introscope.enterprisemanager.transactiontrace.arrivalbuffer.capacity=
The default is 2500, you can try to increase the value to 5000, the impact will be on memory, you might need to increase the EM heap size
 

10[Manager.Cluster] Collector clock is too far skewed from MOM. Collector clock is skewed from MOM clock by XXXX ms. The maximum allowed skew is 3,000 ms. Please change the system clock on the collector EM.
 
Recommendation:
Collector system clocks must be within 3 seconds of MOM clock setting. Ensure MOM and collectors synchronize their system clocks with a time server such as an NTP server otherwise EMs will disconnect
 

11. [WARN] [Dispatcher 1] [Manager.IsengardObjectInputStream] Internal cache is corrupt. Cannot determine class type for Object 1. A prior class deserialization error may have corrupted the cache. 

This is due to a capacity or configuration issue, review the below  section


12. Baseline database

ERROR] [pool-1-thread-2] [Manager] failure to add data to baseline
java.lang.ClassCastException

 
Recommendation: 
This message indicates that the Baseline database is corrupted
Stop the Introscope EM, delete the /data/variance.db and start the EM again

If you are unsure of the location, open the IntroscopeEnterpriseManager.properties, check the introscope.enterprisemanager.baseline.database property
 
 

B) Checklist and Best Practices

Below is the list of common configuration problems that affects the Introscope EM or Cluster performance:

1. Heap size memory:
 
Make sure to set the initial heap size (-Xms) equal to the maximum heap size (-Xmx), since no heap expansion or contraction occurs, this can result in significant performance gains in some situations.

- In a unix setup you need to update the EM_HOME/Introscope_Enterprise_Manager.lax
lax.nl.java.option.additional

- In a window setup you need to update the EM_HOME/bin/EMService.conf (windows)
wrapper.java.initmemory=
wrapper.java.maxmemory=


2. Management Modules:
 
Management modules should only be deployed in the MOM. Make sure the collectors do not start with any Management modules to prevent any unnecessary extra load.

- Stop the collector(s) only
- Rename EM_HOME/config/modules as modules_collector_backup
- Create an empty "modules" directory
- Start the collector
 

3. If you are using JVM 1.8, ensure that G1 GC is in use.
 
- Open the EM_HOME/bin/EMService.conf (windows) or EM_HOME/Introscope_Enterprise_Manager.lax (other platforms)
- Remove the following java arguments if they are present : -XX:+UseConcMarkSweepGC and -XX:+UseParNewGC
- Add : -XX:+UseG1GC and -XX:MaxGCPauseMillis=200

For example:
- config/EM_HOME/Introscope_Enterprise_Manager.lax:
lax.nl.java.option.additional=-Xms20480m -Xmx20480m -Djava.awt.headless=true -Dmail.mime.charset=UTF-8 -Dorg.owasp.esapi.resources=./config/esapi -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xss512k

-bin/EMService.conf
wrapper.java.additional.7=-XX:+UseG1GC
wrapper.java.additional.8=-XX:MaxGCPauseMillis=200


4. Configure correctly the EM to run as a nohup process (unix only)

When running the EM in unix you also need to perform this step manually. 
Open the EM_HOME/Introscope_Enterprise_Manager.lax, locate the property lax.stdin.redirect, update it as below:

Replace
lax.stdin.redirect=console

with:
lax.stdin.redirect=
 
see https://docops.ca.com/ca-apm/10-7/en/administrating/configure-enterprise-manager/start-or-stop-the-enterprise-manager/#StartorStoptheEnterpriseManager-RuntheEnterpriseManagerinnohupModeonUNIX
 
"Note: Only run the Enterprise Manager in nohup mode after you configure it. Otherwise, the Enterprise Manager might not start, or can start and consume excessive system resources."
 

5. Disable DEBUG logging 
 
By default logging is set to INFO in the IntroscopeEnterpriseManager.properties:
log4j.logger.Manager=INFO,console,logfile
 
If have enabled DEBUG logging at some point, ensure it is disabled to prevent any impact in disk I/0.
 

6. OutOfMemory and slowness due to "heavy historical queries"
 
As a best practice add the below 2 hidden properties in all the EM_HOME/config/IntroscopeEnterpriseManager.properties:
introscope.enterprisemanager.query.datapointlimit=100000
introscope.enterprisemanager.query.returneddatapointlimit=100000

You can verify this condition by opening the Metric Browser, expand the branch:
 Custom Metric Host (virtual)
   - Custom Metric Process (virtual)
      - Custom Metric Agent (virtual)(collector_host@port)(SuperDomain)
         - Enterprise manager
            - Internal | Query
               - Data Points Retrieved From Disk Per Interval    
               - Data Points Returned Per Interval
 
- if introscope.enterprisemanager.query.datapointlimit > 100000
Add hidden property introscope.enterprisemanager.query.datapointlimit=100000 to the IntroscopeEnterpriseManager.properties

this property defines the limit for Enterprise Manager query data point retrieval. It defines the maximum number of metric data points that a Collector or standalone Enterprise Manager can retrieve from SmartStor for a particular query.
This property limits the impact of a query on disk I/0, the default value for this property is 0 (unlimited)
 
-if introscope.enterprisemanager.query.returneddatapointlimit >. 100000 
Add hidden property introscope.enterprisemanager.query.returneddatapointlimit=100000 to the IntroscopeEnterpriseManager.properties

this property defines the limit for Enterprise Manager query data point return. It defines the maximum number of metric data points that a Collector or standalone Enterprise Manager can return or retrieve for a particular query.
This property limits the impact of a query on memory, the default value for this property is 0 (unlimited)

7. Apply latest HOTFIXES 

10.7 hotfixes - https://comm.support.ca.com/kb/apm-10-7-hotfixes/KB000105898

8. You must provide a dedicated disk I/O path for SmartStor

Make sure Smatstor db is pointing to a dedicated disk controller and introscope.enterprisemanager.smartstor.dedicatedcontroller=true which allows the EM to fully utilize this setting. Failing to do this, will reduce collector performance by 50%

From https://docops.ca.com/ca-apm/10-7/en/ca-apm-sizing-and-performance/system-information-and-requirements/ca-apm-data-storage-requirements
 
"The dedicated controller property is set to false by default. You must provide a dedicated disk I/O path for SmartStor to set this property to true; it cannot be set to true when there is only a single disk for each Collector. When the dedicated controller property is set to false, the metric capacity can decrease up to 50 percent."

In a SAN storage environment, each SmartStor should map to a unique logical unit number (LUN) that represents a dedicated physical disk. With this configuration only, it is safe to set introscope.enterprisemanager.smartstor.dedicatedcontroller=true.

From https://docops.ca.com/ca-apm/10-7/en/files/415269782/465939451/2/1524053302595/CA+APM+Reference+Architecture+2018+0411+6.pdf, page 28:

"Specific RAID volumes or LUN should be created with dedicated disks/spindles for the CA software to avoid disk I/O contention from other applications which may be sharing the same RAID or storage array. The greater the volume of disks/spindles allocated to the RAID volume or LUN will provide greater IO distribution and will maximize read/write times for the processes."
 
If you are using a virtual environment, refer to:
https://docops.ca.com/ca-apm/10-7/en/ca-apm-sizing-and-performance/capacity-planning-and-server-deployment-options#CapacityPlanningandServerDeploymentOptions-VMWareRequirementsandRecommendations


9. Huge Smartstor metadata

If you have increased the default EM clamps in the EM_HOME/config/apm-events-thresholds-config.xml to support your current load, for example increased introscope.enterprisemanager.metrics.historical.limit and introscope.enterprisemanager.agent.connection.limit
Keep in mind that this change will have an impact on the EM performance, as a best practices always:

a) use Smartstor tool to remove periodically any unwanted metric
https://comm.support.ca.com/kb/how-do-i-cleanup-smartstor-data/kb000056953
https://docops.ca.com/ca-apm/10-7/en/administrating/configure-enterprise-manager/configure-and-manage-smartstor-data/#ConfigureandManageSmartStorData-UsetheSmartStorCommand-LineTools
 
b) by default we store data for 1 year, however you can adjust this data retention by reducing introscope.enterprisemanager.smartstor.tier3.age for example. Changing tier3 will help for some time, also reducing the value gradually would help prevent performance issues (high Smartstor Duration)
https://docops.ca.com/ca-apm/10-7/en/administrating/configure-enterprise-manager/configure-and-manage-smartstor-data
 
c) Split the load, by adding an additional collector
 
d) If you using a pre-10.5 release, you should consider to upgrade to take advantage of the New Smartstore metadata model. The metadata has been enhanced to solve the below main issues:
-EM is eating big amount of memory when you have too many inactive metrics.
-EM is eating too much CPU when it has too many inactive metrics.
-There is high I/O for a rewrite metrics.metadata file.
-Minimize/eliminate the need for a cleanup process
 
The new metadata model is now able to handle more than 30 million historical metrics vs. the old 1.2 million and there is no impact in cpu, memory and I/O
 

10. Too many incoming traces: 
Large amount of transaction traces on your system will impact the EM|Collector performance
 
Recommendation:
Check the EM_HOME/logs/perflog.txt, if "Performance.Transactions.Num.Traces" is higher than 1 million or increasing rapidly then try to reduce or limit the number of traces:
 
a) Reduce data retention by half (default value is 14 days)
Open the EM_HOME/config/IntroscopeEnterpriseManager.properties, set introscope.enterprisemanager.transactionevents.storage.max.data.age=7
 
b) Clamp the transaction Traces sent by agent to EM Collectors
Open the EM_HOME/config/apm-events-thresholds-config.xml
Reduce introscope.enterprisemanager.agent.trace.limit, from 1000 to 50.
This clamp limits the number of transaction events per agent the Enterprise Manager processes per interval
 
c) Increase the Differential Analysis auto tracing triggering threshold
Open the EM_HOME/config/apm-events-thresholds-config.xml
 
The above suggestions will help you reduce the # of traces, however you should investigate why and which agent is causing the high amount of traces
Open the Metric Browser,  expand the branch: Custom Metric Agent | Agents | <host> | <process> | <agentname>:Transaction Tracing Events Per Interval


11. Unable to connect to Webview/Workstation or connectivity is very slow

The problem could be due toe Outgoing Delivery threads got stuck on NIO writing, as a best practice disable Java NIO.

Open the EM_HOME/config/IntroscopeEnterpriseManager.properties, add the below hidden property:

transport.enable.nio=false

You need to restart the Enterprise Manager(s)
Apply this change in all the EMs (MOM and collectors)

NOTE: Disabling NIO will switch back to the previous classic socket operations to revert to the  polling architecture, there is not loss of functionality.
The main difference between Java IO and Java NIO is IO is stream oriented where caching is not there while NIO is buffer oriented and uses caching to read data and has additional flexibility due the buffering. Apart from flexibility you may have other overheads of verification before data processing and overwriting dangers. Once the data is read, it does not make any difference in what you do with it and how you handle it. Hence, using IO/NIO should not make any other difference than these known issues from JVM side. 
 
How to verify this condition?
Go to the MOM, take 2 to 3 thread dumps spaced 15 seconds apart by running: kill -3 <MOM-PID>
Verify the condition as indicated in below KB:
https://comm.support.ca.com/kb/_MOM-Performance-issues-Agent-disconnections-unable-to-login-to-the-workstation/KB000045655


12. Check if the cluster is unbalanced
 
If see a discrepancy on metrics across the collectors, for example, some collectors have 200K and others 20K metrics
Keep in mind that the MOM load balancing mechanism only cares when a server is overloaded not underloaded.
 
Suggestions:
 
a) Reduce introscope.enterprisemanager.loadbalancing.threshold=20000 to 10000 so collector load is more even across the cluster.
There is no need to restart to EM as it is a hot property but you have wait 10 minutes (introscope.enterprisemanager.loadbalancing.interval=600)
 
b) Update the loadbalancing.xml to explicitly allocate the agents to the appropriate collectors.
 
c) You can try to disable the default load balancing mechanism during MOM startup only to prevent the issue of having agents moving across all the collectors as this causes an unnecessary duplication of metrics on every collector’s metadata.
See https://comm.support.ca.com/kb/tip-for-loadbalancing-configuration-when-upgrading-an-enterprise-manager-cluster/kb000031358

- Stop the cluster (MOM and collectors)

- Open the MOM_HOME/config/IntroscopeEnterpriseManager.properties
- Add the below hidden property: introscope.enterprisemanager.loadbalancing.staywithhistoricalcollector=always

This will force the agents to collect to the appropriate collectors, once all of your agents are reconnected you must change the property value to notoverloaded to restore default load balancing mechanism.


13. Check if EM metric clamps have been reached.

If a metric clamp has been reached, the next coming metrics will be ignore and lost
 
To Check the EM clamps : Open the Metric Browser, expand the branch
 
 Custom Metric Host (virtual)
   - Custom Metric Process (virtual)
      - Custom Metric Agent (virtual)(collector_host@port)(SuperDomain)
         - Enterprise manager
            - Connections
 
looks at the values for:
 
  - "EM Historical Metric Clamped"
  - "EM Live Metric Clamped"
 
The above metrics should all be 0.
 
To check the Agent clamp : expand the branch
 
 Custom Metric Host (virtual)
   - Custom Metric Process (virtual)
      - Custom Metric Agent (virtual)(collector_host@port)(SuperDomain)
         - Agents
            - Host
               - Process
                   - <AgentName>
 
looks at the value for : "is Clamped" metric, it should be 0.

Recommendation:
-Open the EM_HOME\config\apm-events-thresholds-config.xml 
-Increase the below clamps as needed:

introscope.enterprisemanager.metrics.historical.limit
introscope.enterprisemanager.metrics.live.limit

This is a hot property, there is no need to restart the EM

See https://docops.ca.com/ca-apm/10-7/en/using/apm-metrics/cluster-supportability-metrics#ClusterSupportabilityMetrics-CollectorMetrics


14. Clear the OSGI cache:

- Stop the Introscope Enterprise Manager
- Go to <em install dir>/product/enterprisemanager/configuration/' folder, delete all folders except the config.ini 
- Go to <em install dir>/product/webview/configuration/' folder, delete all folders except the config.ini 
- go to <EM_HOME>/work - remove all files 
- Start the Intrsocope Enterpriese Manager

You need to apply this recommendation in all the Introscope EMs (MOM and collectors)
https://comm.support.ca.com/kb/how-to-clear-the-introscope-osgi-cache/KB000100299
 

15. Loadbalancing optimizations :MOM hangs due to large amount of Agents connecting at same time, multiple connection and disconnections, multiple clamps being reached during the startup or rebalance.

Suggestions:

a) Start the MOM and collector in the correct order It is always recommended to start Collectors before MOM to avoid overload MOM, but there is also a risk of overloading Collector that were started first if the Agents were not restarted. As a best practice you should start all Collectors at the same time and then the MOM.

b) Configure the MOM to startup with introscope.enterprisemanager.loadbalancing.staywithhistoricalcollector=always this will force the agents to collect to the appropriate collectors, once all of your agents are reconnected, you can change the property value to notoverloaded to restore default load balancing behavior, see https://comm.support.ca.com/kb/Tip-for-loadbalancing-configuration-when-upgrading-an-Enterprise-Manager-cluster/KB000031358

c) Whenever possible manually update the loadbalancing.xml to explicitly allocate the agents to appropriate collectors. This should solve the issue of having agents moving across all the collectors as this is causing an unnecessary duplication of metrics on every collector’s metadata.

d) If required you can configure the Loadbalancing rebalance to take place every 30 minutes in the MOM properties (by default it is set to 10 minutes: introscope.enterprisemanager.loadbalancing.interval=600)

 

C) What to collect if the problem persist?
 
Gather the following information from all the Introscope Enterprise Manager instances (MOM and collectors) and open a support case 
 
1. EM_HOME/logs
2. EM_HOME/config
3. EM_HOME/install/*.log
4. Hardware specs of the servers and a general overview of the implementation indicating where the collectors and MOM are
5. If the EM hangs, collect a series of 10 thread dumps at 10 second intervals when the problem occurs, run:
kill -3 <EM-PID>
6. If the EM crashed because of an OutOfMemory situation, add the additional jvm arguments to create automatically a head dump next time the problem occurs:
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<your_target_directory>
7. If the EM is running in unix, output of:
lsof -p <EM-pid> | wc -l
df -h
Additional Information:
Docops:
https://docops.ca.com/ca-apm/10-7/en/administrating/properties-files-reference
https://docops.ca.com/ca-apm/10-7/en/ca-apm-sizing-and-performance/enterprise-manager-and-cluster-sizing
https://docops.ca.com/ca-apm/10-7/en/troubleshooting/enterprise-manager-and-collectors-troubleshooting
https://docops.ca.com/ca-apm/10-7/en/troubleshooting/sizing-and-performance-troubleshooting

White Paper:
​https://docops.ca.com/ca-apm/10-7/en/files/415269782/465939451/2/1524053302595/CA+APM+Reference+Architecture+2018+0411+6.pdf