The EM collects Metrics about itself and reports them to a file called perflog.txt. This file rolls over to another file called perflog.txt.previous when the file size limit configured in the IntroscopeEnterpriseManager.properties file is reached.
Some fields differ between the MOM and Collector perflogs.
All performance Metrics are for 15 second time periods.
Regardless of the format of the perflog.txt file, it includes fields and the Metric values for those fields. The fields are in the following order in the file, and their definitions are as follows:
The total memory available to the EM. This value can grow if the JVM heap grows. If initial heap (-Xms) and max heap (-Xmx) are equal, this number should not change much over time since the maximum heap will be allocated immediately at startup instead of being acquired by the JVM as needed.
The total free memory available to the EM. If this number gets too low, this may indicate a problem with running out of memory/heap, and may require increasing the Xmx JVM parameter. Look for occurrences of free memory dropping to a 2-digit number or less on the Collector. If you see this, then increase the heap size available to the JVM. Add memory to the server if necessary. However, if you already have sufficient JVM memory allocated, then proceed by further investigating the rest of the columns. It is unusual to see this problem on a MOM, but it can happen.
An estimate of what percentage of time the JVM is spending doing garbage collection (GC) since the last instance. If this number gets too high (depends on setting/load), there might be a problem with the GC settings and may require tuning of the GC JVM startup flags, as the JVM could be spending too much time dealing with the GC.
The number of Workstations currently connected to the EM.
The EM goes through a "harvest" cycle every 15 seconds, that is the EM is taking to aggregate 15 second interval metrics in preparation for writing them to the Smartstor database. All data available from the Agents is distributed to consumers in the EM at this time. This metric records how long the harvest cycle is taking in milliseconds. It is generally a good indicator if the EM is keeping up with the current workload. Anything over 3000 is probably an indicator of an overloaded EM. See https://communities.ca.com/docs/DOC-231167867 for next steps.
SmartStor writes all Metric data every 15 seconds (after the harvest). This Metric tracks how long it takes SmartStor to write data. A value over 3000 is probably an indicator of an overloaded EM. Inconsistent values indicate contention for disk-related resources. Consistently high values indicate inadequate disk-write bandwidth for the metric load being handled. CA recommends using a separate disk on a dedicated controller to store Smartstor data. Check the location of the
Smartstor /data directory to ensure it is not on the same disk as the Enterprise Manager itself, and check IntroscopeEnterpriseManager.properties to verify that
when Smartstor data is on a separate, dedicated disk.
Shows the number of Agent throttling tickets. A ticket is handed out to new Agents connecting. When an Agent gets a ticket, it sends new Metrics for registration. When that is finished, the ticket is returned for use by another Agent. There is one ticket per CPU available to the EM process. If this value stays zero for long periods of time (other than at EM startup), then there may be a problem with Agent throttling and the EM should be restarted. This metric can go to zero (0) when an EM with a lot of agents, restarts. This happens when these numerous agents rush to connect to EM at the same time. Agents without a ticket need to wait for one to become available. This may cause a situation where it takes time for the users to see all the metrics appear in the Workstation.
The number of Metric data values queued in memory and waiting to be processed by the EM. If this value grows above two times the total number of Metrics, the EM is not keeping up with incoming data.
The number of Agents currently reporting data to the EM. By default, there is one Agent per Domain, without any Agents connected. Currently, by default, our max limit is 400. This can be modified via (for pre 9.1) introscope.enterprisemanager.agent.connection.limit in the EM properties file (9.1 and above, the file is <EM home>/config/apm-events-thresholds-config.xml).
By default, all agents should be configured to point to the MOM. The MOM will assign Agents to Collectors automatically and enforce load balancing across the cluster at 15 minute intervals.
If having Agents configured to talk to specific Collectors and not the MOM, then this could negatively affect load balancing.
If having manually configured loadbalancing.xml to force some Agents to specific Collectors, then this could negatively affect load balancing.
If having chosen to manually update loadbalancing.xml, to specify TIM Collectors, for instance, be sure to copy the file to the MOM and every Collector in the cluster. This file must be identical across the cluster.
If having TIMs reporting to your cluster, then you must designate one or more Collectors to be TIM Collectors. TIM Collectors should have no agents reporting to them.
A cluster consists of one MOM and a maximum of 10 Collectors of all types, including TIM Collectors. Adding more than 10 Collectors to a cluster can negatively impact the performance of the MOM.
The total number of metrics reporting into the EM. Refer to the APM Performance and Sizing Guide for maximum supported metrics.
The rate of incoming data. The value should fluctuate around the Agent.NumberOfMetrics when the system is running properly. If dropping below Performance.Agent.NumOfMetrics for up to 60 seconds, then the EM is not keeping up with the flow of incoming data. The Performance.Agent.NumberOfMetrics and this field should always be equal or very close to each other. If the Performance.Agent.MetricDataRate is much higher than the Performance.Agent.NumberOfMetrics, it is a clear indication that the EM cannot cope with the amount of metrics coming into the EM. At this point, the Performance.Harvest.HarvestDuration should be very high as well.
The number of Metric Groups defined for all Management Modules in the EM. This gives an indication of how long new Metric registration will take, since all new Metrics are evaluated against every Metric Group in the system. Only the MOM should have Management Modules in the <EM Home>/config folder. Collectors should not have any Management Modules as they take up resources. If you have too many Metric Groups (in the thousands), consider reviewing them to see if any are not needed because they take up resources.
The Enterprise Manager attempts to insert all incoming events into a Transaction Trace insert queue. This column displays the average number of events in the queue during the previous time slice. The EM may choose to throw away a trace if the EM trace queue is at capacity.
The Number of Traces in the Insert Queue indicates whether the Enterprise Manager is keeping up with Transaction Trace processing. If the Transaction Trace insert queue is full when a new event comes in, the event is dropped. You can view the Transactions:Number of Dropped Per Interval metric to see the number of Transaction Traces that the Enterprise Manager could not handle during the interval and were dropped.
The number of traces persisted to disk. This metric is not how many traces were persisted during a 15 second period, but rather represents the total number of stored traces during a 15 second period. This number normally increments upwards until a trace purging operation is performed.
The maximum allowed for any one Collector is 500,000. If the number of traces coming in exceeds this number, then consider disabling socket, file, and network I/O traces on all agents to reduce the metric load. To find out which tracer types are reporting the most traces, it is recommended to disable each type one at a time, then examine the perflog again for improvement.
To disable traces, check to see which PBL file you are using in [AGENT_HOME]/wily/core/config/IntroscopeAgent.profile by checking the directives property:
Here, we are using websphere-typical.pbl
Checking in websphere-typical.pbl, we see that toggles-typical.pbd is called.
Edit toggles-typical.pbd and comment out the TurnOn directives for socket, file, and network I/O traces as shown:
# Network Configuration
# NOTE: Only one of SocketTracing and ManagedSocketTracing should be 'on'. ManagedSocketTracing is provided to
# enable pre 9.0 socket tracing.
# File System Configuration
# TurnOn: FileSystemTracing
# NIO Socket Tracer Group
A restart of the monitored application will be required.
See TEC1498147 for more details.
The Enterprise Manager attempts to insert all incoming events into a Transaction Trace Insert Queue. The Number of Traces in the Insert Queue supportability metric displays the average number of events in the queue during the previous time slice.
The amount of memory that is used in the Enterprise Manager waiting for outbound historical data to be sent over the network.
The number of Metric data queries.
The Collector metrics received per interval metric is the sum of Collector metric data points that the MOM received during each 15-second time period. The data points come from these sources:
- Metric subscriptions on behalf of Management Modules (for example, dashboards, calculators, alerts)
- Queries that clients generate. (For example, Workstation and CLW queries.)
- Queries for metrics that built-in alerts and calculators generate. (For example, alerts and calculators that support the application triage map.)
This is an indicator of the cluster query load, and the network bandwidth consumption for communication between the Collector and the MOM. Some variation is expected. Large spikes indicate heavy spontaneous query activity. The Collector Metrics Received Per Interval value approximates the number metrics that calculators are processing.
Performance.MOM.NumberofCollectorMetrics (MOM only)
The Number of Collector Metrics metric shows the total number of metrics currently being tracked in a cluster. This metric is the sum of the values of the Enterprise Manager | Connections | Number of Metrics supportability metric across all the Collectors in the cluster.
This field represents the count of internal metric groups configured for Triage Map Contributing alerts. Each triage map contributing alert creates a metric group internally, which is called MapEntityMetricGroup. With this we can determine the number of triage map contributing alerts defined in system also.
A Supportability metric to track the Persistant Writer's time taken to write in a cycle that has been added to monitor the Heuristics SubSystem's performance:
Data Store|Heuristics:Total Perst Insertion Duration Per Interval
The aggregate time, in ms, it takes the system to record transaction component information, send it to the Enterprise Manager, and store it in the APM database, during the last harvest period. This metric applies only to Collectors.
The aggregate time, in ms, it takes the system to record transaction segment information, send it to the Enterprise Manager, and store it in the database, during the last harvest period. This metric applies only to Collectors.
Queries Per Interval
This metric shows number of queries for metric data that were received during the previous time slice. The balance of metric writes versus metric queries determines your SmartStor disk configuration requirements.
This metric shows the average query duration during the previous time slice.
The aggregate number of components recorded by the agent and sent to the Collector in the last harvest period. This metric implies the proportional amount of time devoted to APM database update or insert during the last harvest period.
The aggregate number of segments recorded by the agent and sent to the Collector in the last harvest period. This metric implies the proportional amount of time devoted to APM database update or insert during the last harvest period. However, it does not add time to the harvest period itself, but only indicates the load on the APM database and the system as a whole.