Understanding Heuristics baselines in Introscope and APM CE (CEM)

Document ID : KB000019516
Last Modified Date : 14/02/2018
Show Technical Document Details

Description:

Heuristics Metrics report status, and are the metric which determines the color of Traffic Lights in the Introscope Application Overview tab. They are calculated, aggregated metrics and do not represent any single, individual metric reported from an Agent. Heuristics are often used to report overall status in a Dashboard and may be used to generate Alerts.

Heuristic metrics are integers which take the following values in Traffic Lights in Application Overviews:

  • 0 = White/Grey - No data is available to calculate a heuristic metric.

  • 1 = Green - The application is functioning within its expected standard deviation.

  • 2 = Yellow - Caution. The application is between 2 and 4 standard deviations away from what is expected.

  • 3 = Red - Danger. The application is greater than 4 standard deviations away from what is expected.

Solution:

Heuristics are calculated using the Baselines Database in Introscope. Both Introscope and APM CE (CEM) employ baseline algorithms to monitor applications.

Introscope determines the color of an alert indicator for an application in the Overview tab by evaluating current metrics against the baseline accumulated for those metrics. The Introscope baseline algorithm determines the next expected value for a metric as well as the normally expected deviation from that value. If the actual value exceeds or deviates from the expected value by more than 2 standard deviations, the Traffic Light will turn Yellow, indicating caution. If the actual value exceeds or deviates from the expected value by more than 4 standard deviations from the expected value, the Traffic Light will turn Red, indicating danger. Alerts can be set on these heuristic values.

The APM Baselines database is a set of data about an application which is used for comparison. Internally, the APM baseliner evaluates the slope of the time series and determines the expected value of the slope. Recent data is given more weight than older data. The Baseliner learns over time, and understands the difference between an application's behavior at 3:00pm on a weekday v. 3:00pm on a Sunday. The Baselines database contains the most common, expected range of values for each metric in your system, and uses these values to determine if there are significant abnormalities occurring.

The power of heuristics lies in their ability to determine expected values for an application given the day of the week, time of day, and other factors, to produce a value that is meaningful for your applications. By setting alerts based on heuristics rather than fixed threshold values you supply, you allow APM to do the work of determining when an application is in trouble for you, since, what is acceptable on a Sunday afternoon for your application, may not be acceptable on Thursday at 9:00am. The APM baseliner is smart enough to know the difference, and over time, it accumulates enough data to make accurate determinations about what is and is not expected behavior for your applications.

You have the option to turn off Heuristics

The Application Heuristics section of IntroscopeEnterpriseManager.properties contains the following property:

introscope.enterprisemanager.application.overview.baselines

To improve Enterprise Manager performance, you can turn off heuristics by setting this property to false.

If you do this, however, the Enterprise Manager will turn off its assessment of JVM, application, and backend health. It also turns off the baseliner, which means that metrics in the Workstation will be inactive and report no values. The metrics will appear gray unless the configuration is set back to true, at which point they will be active again. Heuristics make limited and short-term demands on the disk subsystem while providing many Alerting benefits, so, the decision to turn off Heuristics should be made with this in mind.

Each heuristic has a set of baselines that it uses to determine what normal behavior is for the following Key Performance Indicators:

Frontends and Backends:

  • Response time

  • Errors

  • Stalls

JVM and App Server:

  • CPU (only when the platform monitor is available)

  • Number of pending requests (WebLogic only)

  • Number of available JDBC connections

  • Number of reporting agents (cluster agent only (teaser))

APM baselining is based on the Holt's Linear Exponential Smoothing algorithm.

For every timeslice, the baseliner tries to predict what the value and the expected deviation for a metric should be. It then compares the actual value to the prediction.

  • If the actual is more than 2 expected deviations from normal, we call it abnormal.

  • If the actual is more than 4 expected deviations from normal, we call it very abnormal.

In the first minute of a baseliner's life, it will show green, but it is actually learning. Baselines start predicting normal almost immediately, and become smarter over time. Recent data is weighted more heavily than older data. This means that if an application's performance gets better, the baseliner learns the new response times in a reasonable timeframe.

When enough data has been collected (normally a day), the Application Overview will compare the expected value and expected deviations in the current 30 minute period to the values in the same 30 minute period in previous days. When more than a week of data is collected, the Application Overview compares the expected value and expected deviation in the current 30 minute period to the same 30 minute period in previous weeks. Weekdays are compared against weekdays, and weekends are compared against weekends.

Each heuristic outputs its current state (green, yellow, red) as a metric under the Heuristics node in the Frontends section of the Investigator Metric Tree.

APM CE (CEM) calculates a defect specification baseline based on 28 days of historical data for behavioral defects. Once 28 days of data have been collected for defects, you have the option of changing the defect specification by resetting the baseline. For example, the slow time defect specification has a default value of 5 seconds. If you gather actual transaction data, then set the baseline for the slow time defect, it will change from the default of 5 seconds, to the suggested defect specification value (say 7.2 seconds).