Introscope Enterprise Manager Troubleshooting

Document ID : KB000093176
Last Modified Date : 27/04/2018
Show Technical Document Details
Introduction:
Introscope Enterprise Manager Troubleshooting, Checklist and Common issues
Environment:
Any 10.x
Instructions:
Common Messages

Below is the list of the common ERROR, WARN, exceptions messages that affects the Introscope EM or Cluster performance:

1. [WARN] [Harvest Engine Pooled Worker] [Manager.Agent]  [The EM has too many historical metrics reporting from Agents and will stop accepting new metrics from Agents.  Current count
 
Recommendation:
Open the EM|collectors/config/apm-events-thresholds-config.xml
Increase the introscope.enterprisemanager.metrics.historical.limit clamp
This is a hot property, there is no need to restart the EM
 
2. [WARN] [PO:WatchedAgentPO Mailman 1] [Manager.Agent] The Agent <your.agent> is exceeding the per-agent metric clamp (current=5000, max=5000). New metrics will not be accepted
 
Recommendation:
If the problem is related to some missing metrics in the investigator and the agent is listed in the above WARN message then
Open the the EM|collectors/config/apm-events-thresholds-config.xml
Increase the introscope.enterprisemanager.agent.metrics.limit
This is a hot property, there is no need to restart the EM
 
3. [WARN] [Dispatcher 1] [Manager] Timed out adding to outgoing message queue. Limit of 3000 reached.
 
Recommendation:
Open the the EM|collectors/config/IntroscopeEnterpriseManager.properties
increase transport.outgoingMessageQueueSize=8000 (if already set to 8000, increase the value by 2000)
 
set:
transport.override.isengard.high.concurrency.pool.min.size=10
transport.override.isengard.high.concurrency.pool.max.size=10
 
A restart of the EMs is required for the changes to take effect.
Increasing the outgoing message queue allows you to have a bigger buffer.  Increasing the thread pool size allows you to have more worker threads to send outgoing messages. These important adjustments are required when, sending messages, usually between collectors and MOM, becomes a bottle neck for performance.
 
4. java.io.IOException: Too many open files
 
Recommendation: (unix only)
Make sure the max open file handle is at least 4096 on both MOM and Collectors. You can check current open file descriptors by using "ulimit -n" against the user who starts EM processes. You might need to increase the maximum number of open files allowed for that user.
 
5. java.io.IOException: No space left on device
 
Recommendation:
Increase HD space as soon as possible
 
6. [ERROR] [Lucene Merge Thread #1] [Manager] Uncaught Exception in Enterprise Manager:  In thread Lucene Merge Thread #1 and the message is org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
 
Recommendation: 
This message indicates that the Trace database is corrupted
Stop the Introscope EM, delete the /traces folder and start the EM again
 
7. 
"[Manager] Outgoing message queue is not moving"
"[Manager] Outgoing message queue is moving slowly"
"[WARN] [master clock] [Manager] EM load exceeds hardware capacity. Timeslice data is being aggregated into longer periods."
 
Recommendation:
This is due to a capacity or configuration issue, review the below "Checklist" section
 
8. java.lang.OutOfMemoryError: GC overhead limit exceeded
 
Recommendation:
Open the EM|collectors/Introscope_Enterprise_Manager.lax
Increase memory heapszie by 2GB
 
9. [Manager.TransactionTracer] The Enterprise Manager cannot keep up with incoming event data
 
Recommendation:
Reduce the traces incoming rate, see below "Checklist" section, point # 10
 
10. [WARN] [master clock] [Manager.AppMap.Alert-Mapping] Processing of alerts is overloaded, ignoring <xxx> new alerts!
 
The above message indicates that a lot of Alerts are automatically created and propagated to AppMap
Symptoms: overhead in memory, GC, harvest duration and disk space due to the extra alert states changes
 
Recommendation:
 
a) Try reducing uvb metric clamp introscope.apmserver.uvb.clamp from default 50000 to a very small number like 5, this will reduce the metric handling for Differential Analysis alerts.
 
b) Reduce or stop the action notifications by:
- Excluding the specific frontend applications
- Excluding the specific frontend applications and then creating a separate differential control element specifically for that frontend (which allows fine-tuning of notifications).
- A simple way to reduce the number of notifications is to add actions to the danger list only.
 
c) Try disabling the appMap alerting mapping temporarily in the MOM (empty teamcenter-status-mapping.properties)
 
11. [Manager] TransactionTrace arrival buffer full, discarding trace(s)
 
Team Center uses Transaction trace data as the source for the Team Center Map by default. This causes the MOM to retrieve the Transaction Traces Data from Agents, occasionally which slows down the Collectors
 
Recommendation:
Open the the EM|collectors/config/IntroscopeEnterpriseManager.properties
set introscope.enterprisemanager.transactiontrace.arrivalbuffer.capacity=
The default is 2500, you can increase the value to 5000, the impact will be on memory, you might need to increase the EM heap size
 
12[Manager.Cluster] Collector clock is too far skewed from MOM. Collector clock is skewed from MOM clock by XXXX ms. The maximum allowed skew is 3,000 ms. Please change the system clock on the collector EM.
 
Recommendation:
Collector system clocks must be within 3 seconds of MOM clock setting. Ensure MOM and collectors synchronize their system clocks with a time server such as an NTP server otherwise EMs will disconnect
 
13. [WARN] [Dispatcher 1] [Manager.IsengardObjectInputStream] Internal cache is corrupt. Cannot determine class type for Object 1. A prior class deserialization error may have corrupted the cache. 

This is due to a capacity or configuration issue, review the below "Checklist" section

 

Checklist:

 
Below is the list of common configuration problems that affects the Introscope EM or Cluster performance:

1. Heap size memory:
 
Make sure you have allocated enough memory, look at the EM_HOME/logs/perflog,txt.
Make sure to set the initial heap size (-Xms) equal to the maximum heap size (-Xmx) in the Introscope Enterprise Manager.lax(unix) or EMService.conf(windows)
Since no heap expansion or contraction occurs, this can result in significant performance gains in some situations.

For more information about the perflog.txt, see https://comm.support.ca.com/kb/_What-do-the-fields-in-Perflogtxt-mean/KB000031758
 
2. Management Modules:
 
Management modules should only be deployed in the MOM. Make sure the collectors do not start with any MM to prevent any unnecessary extra load
 
3. Switch Concurrent GC to G1GC
 
If you are using JVM 1.8, G1 collector is recommended for applications requiring large heaps (sizes of around 6 GB or larger) with limited GC latency requirements.
Open the EM_HOME/Introscope_Enterprise_Manager.lax, locate lax.nl.java.option.additional property
 
Replace:
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC  -XX:CMSInitiatingOccupancyFraction=50
 
With:
-XX:+UseG1GC -XX:MaxGCPauseMillis=200
 
4. Running EM as a nohup process
 
Recommendation: (unix only)
Make sure "lax.stdin.redirect" property in the Introscope_Enterprise_Manager.lax has been unset. 
 
From https://docops.ca.com/ca-apm/10-7/en/administrating/configure-enterprise-manager/start-or-stop-the-enterprise-manager/#StartorStoptheEnterpriseManager-RuntheEnterpriseManagerinnohupModeonUNIX
 
"Note: Only run the Enterprise Manager in nohup mode after you configure it. Otherwise, the Enterprise Manager might not start, or can start and consume excessive system resources."
 
5. EM logging:
 
By default logging is set to INFO in the IntroscopeEnterpriseManager.properties:
log4j.logger.Manager=INFO, console
 
If have enabled DEBUG logging at some point, make sure it is disabled to prevent any impact in disk I/0.
 
6. Check if any APM clamp have been reached:
 
a) Open the Workstation, use the "Status console” to quickly check the health of the cluster and the status of the most common clamps and events.
https://docops.ca.com/ca-apm/10-7/en/using/apm-workstation/triage-with-the-workstation/the-apm-status-console
 
b) From the Investigator:
 
b.1 - Check the EM clamps : expand the branch
 
 Custom Metric Host (virtual)
   - Custom Metric Process (virtual)
      - Custom Metric Agent (virtual)(collector_host@port)(SuperDomain)
         - Enterprise manager
            - Connections
 
looks at the values for:
 
  - "EM Historical Metric Clamped"
  - "EM Live Metric Clamped"
 
The above metrics should all be 0.
 
b.2 - Check the Agent clamp : expand the branch
 
 Custom Metric Host (virtual)
   - Custom Metric Process (virtual)
      - Custom Metric Agent (virtual)(collector_host@port)(SuperDomain)
         - Agents
            - Host
               - Process
                   - <AgentName>
 
looks at the value for : "is Clamped" metric, it should be 0.
 
7. Check if the crash/OOM/slowness is relate to the some heavy historical queries
 
Open the Investigator, expand the branch:
 
 Custom Metric Host (virtual)
   - Custom Metric Process (virtual)
      - Custom Metric Agent (virtual)(collector_host@port)(SuperDomain)
         - Enterprise manager
            - Internal | Query
               - Data Points Retrieved From Disk Per Interval    
               - Data Points Returned Per Interval
 
- if introscope.enterprisemanager.query.datapointlimit > 100K, add hidden property introscope.enterprisemanager.query.datapointlimit=100000 to the IntroscopeEnterpriseManager.properties
introscope.enterprisemanager.query.datapointlimit defines the limit for Enterprise Manager query data point retrieval. It defines the maximum number of metric data points that a Collector or standalone Enterprise Manager can retrieve from SmartStor for a particular query. This property limits the impact of a query on disk I/0.
The default value for this property is 0 (unlimited)
 
-if introscope.enterprisemanager.query.returneddatapointlimit >. 100K, add hidden property introscope.enterprisemanager.query.returneddatapointlimit=100000 to the IntroscopeEnterpriseManager.properties
introscope.enterprisemanager.query.returneddatapointlimit defines the limit for Enterprise Manager query data point return. It defines the maximum number of metric data points that a Collector or standalone Enterprise Manager can return or retrieve for a particular query. This property limits the impact of a query on memory.
The default value for this property is 0 (unlimited)
 
NOTE:
-Queries Exceeding Max Data Points Read From Disk Limit Per Interval (if value = 1, introscope.enterprisemanager.query.datapointlimit has been reached)
-Queries Exceeding Max Data Points Returned Limit Per Interval (if value = 1, introscope.enterprisemanager.query.returneddatapointlimit has been reached)
 
8. Make sure Smatstor db is pointing to a dedicated disk controller and introscope.enterprisemanager.smartstor.dedicatedcontroller=true which allows the EM to fully utilize this setting. Failing to do this, will reduce collector performance by 50%
 
https://docops.ca.com/ca-apm/10-7/en/ca-apm-sizing-and-performance/system-information-and-requirements/ca-apm-data-storage-requirements
 
"The dedicated controller property is set to false by default. You must provide a dedicated disk I/O path for SmartStor to set this property to true; it cannot be set to true when there is only a single disk for each Collector.
When the dedicated controller property is set to false, the metric capacity can decrease up to 50 percent."
 
NOTE: If you are using a virtual environment, you might need to use a SAN server before setting the dedicatedcontroller property to true, see:
https://docops.ca.com/ca-apm/10-7/en/ca-apm-sizing-and-performance/capacity-planning-and-server-deployment-options#CapacityPlanningandServerDeploymentOptions-VMWareRequirementsandRecommendations
https://docops.ca.com/ca-apm/10-7/en/ca-apm-sizing-and-performance/system-information-and-requirements/ca-apm-data-storage-requirements#CAAPMDataStorageRequirements-SmartStorRequirements
 
9. Huge Smartstor metadata
 
a) use Smartstor tool to remove periodically any unwanted metric (see
https://comm.support.ca.com/kb/how-do-i-cleanup-smartstor-data/kb000056953
https://docops.ca.com/ca-apm/10-7/en/administrating/configure-enterprise-manager/configure-and-manage-smartstor-data/#ConfigureandManageSmartStorData-UsetheSmartStorCommand-LineTools
 
b) Reduce introscope.enterprisemanager.smartstor.tier3.age, however changing tier3 will help for some time and you will need to reduce tier3 age bit by bit. We have seen that some customers tried to reduce the age from 6 months to 1 month for example and this resulted in performance issues (high Smartstor Duration) so reducing the value gradually would help prevent this type of situations.
 
c) Split the load, by adding an additional collector
 
d) (Recommended) Upgrade to 10.5 or higher releases to take advantage of the New Smartstore metadata model, in 10.5, Smartstor metadata was redesign to solve the below main issues:
-EM is eating big amount of memory when you have too many inactive metrics.
-EM is eating too much CPU when it has too many inactive metrics.
-There is high I/O for a rewrite metrics.metadata file.
-Minimize/eliminate the need for a cleanup process
 
The new metadata model is able to handle more than 30M historical metrics vs the old 1.2M and there is no impact in cpu, memory and I/O
 
10. Too many incoming traces: 
Large amount of transaction traces on your system will impact the EM|Collector performance
 
Recommendation:
 
Check the collectors' perflog, if "Performance.Transactions.Num.Traces" is higher than 500K or increasing rapidly then try to reduce or limit the # of traces by:
 
a) By default, we keep traces for 14 days, you can try to lower this value by half:
introscope.enterprisemanager.transactionevents.storage.max.data.age=7
 
b) Clamp the transaction Traces sent by agent to EM Collectors
Open the EM_HOME/config/apm-events-thresholds-config.xml
Reduce introscope.enterprisemanager.agent.trace.limit, from 1000 to 50.
This clamp limits the number of transaction events per agent the Enterprise Manager processes per interval
 
c) Increase the Differential Analysis auto tracing triggering threshold
Open the EM_HOME/config/apm-events-thresholds-config.xml
Add introscope.enterprisemanager.baseline.tracetrigger.variance.intensity.trigger=30 (default 15)
This is a hot property, this will reduce the auto generated traces from Differential Analysis, it will force Differential Analysis to wait for more spikes before triggering evidence or transaction traces automatically to capture diagnostic about the variance.  
 
The above suggestions will help you reduce the # of traces, however you should investigate why and which agent is causing the high amount of traces
Open the investigator,  expand the branch:
Custom Metric Agent | Agents | <host> | <process> | <agentname>:Transaction Tracing Events Per Interval
 
11. If MOM and collectors are not located in the same subnet as the agents you might experience continuous EMs/Agents disconnections.
 
Recommendation:
For transatlantic agent->Em connections it is recommended to use HTTP tunneling.
 
12. Loadbalancing causing the EM to hang
This could be due to the large amount of Agents connecting at same time affecting memory and resources.
 
Recommendation:
a) update the loadbalancing.xml to explicitly allocate the agents to appropriate collectors. This should solve the issue of having agents moving across all the collectors as this is causing an unnecessary duplication of metrics on every collector’s metadata.
 
b) configure the MOM to startup with introscope.enterprisemanager.loadbalancing.staywithhistoricalcollector=always this will force the agents to collect to the appropriate collectors, once all of your agents are reconnected change the property value to notoverloaded to restore default load balancing behavior, see https://comm.support.ca.com/kb/tip-for-loadbalancing-configuration-when-upgrading-an-enterprise-manager-cluster/kb000031358
 
 
 
What to collect if the problem persist?
 
Gather the following information from all the Introscope Enterprise Manager instances (MOM and collectors) and open a support case 
 
1. EM_HOME/logs
2. EM_HOME/config
3. EM_HOME/install/*.log
4. Hardware specs of the servers and a general overview of the implementation indicating where the collectors and MOM are
5. If pre-10.5: screenshot of the "Custom Metric Host (Virtual) | Custom Metric Process (Virtual) | Custom Metric Agent (Virtual) | Enterprise Manager | Data Store | Smartstor | Metadata | Metrics with Data” supportability metric from all Collectors.
6.If the EM hangs, collect a series of 10 thread dumps for the MOM (and collectors) at 5 second intervals when the problem occurs. Use "kill -3 <PID>" where <PID> is the Process Id of the MOM java process & the thread dump will go into the em.log file.
Additional Information:
https://docops.ca.com/ca-apm/10-7/en/ca-apm-sizing-and-performance/enterprise-manager-and-cluster-sizing
https://docops.ca.com/ca-apm/10-7/en/troubleshooting/enterprise-manager-and-collectors-troubleshooting
https://docops.ca.com/ca-apm/10-7/en/troubleshooting/sizing-and-performance-troubleshooting