Looks at common Enterprise Manager performance problems and how to troubleshoot them.
Memory problems typically manifest as a poorly performing EM or sometimes as OutOfMemory (OOM) errors. Check the following for memory-related problems:
Load Within Sizing Guidelines
From the perflog, check that the following values are within the sizing guidelines for your environment. For example, the EM cannot handle more than 500k live metrics or 400 agents.
Also check the following:
The total number of applications reporting to the EM not to exceed 1500. (On an average about five applications per agent.)
There are less than 15k metrics per agent.
There are less than than 500 event inserts (includes traces, errors, and What's Interesting events from the Application Overview) into the traces database per timeslice.
There are no more than 20 connected Workstations.
There is a maximum of 5000 metrics from aggregate agents.
Too much memory (JVM heap space) can also be a problem. On a 32-bit JVM, do not use more than 1.5 GB of heap. When allocating the total amount of memory on the machine, remember to factor enough room for the operating system.
In the EM lax file, check the configuration of the GC flags to see they are set too high or too low. Too many flags, in an effort to micromanage the memory, actually can lead to memory problems. On the other hand, check if you are not using flags, which are good to have on a multiprocessor system like UseConcMarkSweepGC, UseParNewGC etc.
At a steady load within sizing guideline limits, if the FreeMemory value keeps decreasing, it may indicate a memory leak. You can also check SmartStor data for indications of memory leaks. If GC does not reclaim space back, it indicates a memory leak, and CA APM Support can assist you in analyzing this.
1. A good indicator of an EM which is running on a box which has high CPU usage is high values for the HarvestDuration as seen in the perflog.
2. EM | Connections | Metrics Queued will start to increase if harvest is slow.
3. High GC percent time spent can also lead to high CPU. This GC time percent value can be seen in the 3rd column in the perflog. High GC time spent can be a result of a relatively high metric load when compared to memory allocated, a spike in the load leading to a demand for additional memory or also due to poorly configured GC flags.
4. One way to determine CPU usage of the EM is to create a summing calculator using the CPU usage of all the EM threads as seen under the supportability node.
5. If you have the customers with SmartStor data, clicking on the top level of the node Custom Metric Agent | Enterprise Manager | Internal | Threads will get a tabular view of all the EM threads and you can see the CPU usage number for each thread to find out which thread is most busy.
6. The Windows perfmon can be used to determine the box's CPU load and on *nix systems, commands like "top" can be used.
7. The application overview grid is pretty CPU intensive and can be turned off if not needed by using the property introscope.enterprisemanager.appliactionoverview= false.
8. CPU related problems may also be due to poor network performance resulting in the EM having to deal with accumulated metrics from previous timeslices. Look at values of MetricDataPending (which should ideally be at 0 but a consistently big value which is a good percentage of the metric load can be a possible problem) and MetricDataRate (which should be around the same value as the metric value) for possible clues regarding this.
1. Top of the hour problems are typically due to Smartstor spooling problems which is most likely due to lack of disk file cache or lack of enough physical memory left for the OS after the JVM has been allocated its share. We recommend at least 3G of physical memory for the box, preferably 4G. We also recommend at least a 2 CPU box for the EM as uniprocessor machines perform very poorly.
2. Total number of metrics including historical metrics (not just live metrics) which have metadata entries for them are another factor in the EM's performance in general and SmartStor performance too. This can be found out from the "Enterprise Manager | Data Store | SmartStor | Metadata | Metrics with Data" metric. Ideally this should not be more than 500k. Metadata can be cleaned to remove unneeded metrics as follows:
a) Print out the current metric list using:
java -Xms512m -Xmx512m -cp "IntroscopeServices.jar;EnterpriseManager.jar" com.wily.introscope.server.enterprise.entity.fsdb.MetadataFile metrics.metadata -dump | sort > metrics.metadata.dump.sort
b)Next use SmartstorTools to do things like:
i) Pruning (or removing dead metrics):
D:\Introscope72\lib>java -Xms1024m -Xmx1024m -cp SmartStorTools.jar;EnterpriseManager.jar;IntroscopeServices.ja r;IntroscopeClient.jar Prune -src d:\Introscope72\data -backup d:\Introscope72\data\metrics.metadata.bkup
ii) Removing select unwanted metrics like say socket metrics:
D:\Introscope72\lib>java -Xms1024m -Xmx1024m -cp SmartStorTools.jar;EnterpriseManager.jar;IntroscopeServices.jar;IntroscopeClient.jar RemoveMetrics -src d:\Introscope72\data -dest c:\metrics_removed -metrics .*Sockets.*
3. Smartstor duration should be typically under 7.5 secs as seen in the
Perflog and anything above that indicates a poorly performing disk
4. The Windows perfmon can be used to get a measure of the disk subsystem's performance while on *nix systems, iostat can be used.
5. Smartstor and persistent collections should not be combined.
1. One sign of a poorly performing network is if either the MetricDataRate in the perflog or if "Enterprise Manager | Connections | Number of Metrics Handled" value as seen in the investigator is consistently less than the total metrics.
2. Poor network performance can lead to an overloaded EM and poor EM performance. In a cluster the ping time on the MOM is a clue to whether there are poor network times between the MOM and collectors or could be also due to overloaded collectors unable to respond to the ping request.
3. Poor network problems could also be due to improper AutoNegotiate settings on the EM's NIC card. Ideally it should be set to 1000/full duplex.
The log messages below are symptoms of an overloaded Enterprise Manager.
[WARN] [Manager.Clock] Timeslice processing delayed due to system activity. Combining data from timeslices x to y
[WARN] [Manager.TransactionTracer] The Enterprise Manager cannot keep up with incoming event data. Some of the incoming events will be dropped.
[WARN] [Manager.SmartStor] Cannot keep up with data persistence - dropping data from timeslice x:y:z
[VERBOSE] [Manager] com.wily.introscope.spec.server.beans.baseliningengine.BaseliningException: Time series data received out of order. (Please note that the logging level of this message will be changed to DEBUG in the future)