Understanding the Three Rule Condition Types in Live Exceptions

Document ID : KB000030145
Last Modified Date : 14/02/2018
Show Technical Document Details

Time Over Threshold

This is the most basic condition type you can set. It is a straightforward value that the polled data point must exceed in order to raise an alarm.

EXAMPLE: Raise an alarm if Bandwidth Utilization exceeds 90%, your rule might look like this.

TOT01.png

 

Deviation From Mean

When dealing with standard deviations, what you are actually looking at is the percentage of polled data points that fall within a certain range on a bell curve. So for instance, if you set a rule condition to be "above the mean by one standard deviation," what this means is that 84.1% of the data points on the bell curve are below the upper limit of one standard deviation, and 15.9% of the data points are above, and Live Exceptions will alarm if the polled value falls in that 15.9% above range. Similarly, we can say the same applies to "below the mean" except that 84.1% of the values would be to the above the lower limit of one standard deviation and 15.9% would be below. 

If you only want Live Exceptions to alarm if the polled value falls outside one standard deviation of the mean,that means it should alarm if the polled value falls into the 15.9% above one standard deviation, or the 15.9% below. Live Exceptions should not alarm if the polled value falls into the 68.2% in the middle of the bell curve. 

Given this information, there are two sets of possible values to enter into a profile rule, depending on if you are creating a rule condition that is above or below the mean, or if the condition is outside the mean. These values are as follows: 

For above the mean or below the mean: 

1 standard deviation = 84.1 
2 standard deviations = 97.5 
3 standard deviations = 99.87 

For outside the mean: 

1 standard deviation = 68.2 
2 standard deviations = 95.0 
3 standard deviations = 99.74 

Compare to Baseline: Deviation from Normal calculates and mean and standard deviation from baseline values that are calculated once per day (or dynamically; see below). There are three possible options for selecting a baseline.

  • Entire Baseline: If this is used, the polled data point is compared to the baseline mean and standard deviation calculated for every data point collected over the last X weeks (maximum 6 weeks).
    This is generally most useful for partition utilization comparisons where the utilization of a partition is fairly consistent and changes slowly over time.
  • Same Hour and Day: The polled data point is compared to the baseline mean and standard deviation calculated for the same hour and day of the week for the last X weeks (maximum 6 weeks, minimum 3 weeks). For example, a data point gathered at 1:20am on Monday morning will be compared to the baseline mean and standard deviation calculated for 1:00am to 1:59am for X weeks.
    This is generally the most useful baseline for deviation comparisons on Bandwidth and CPU Utilization, as the expected values normally are very different when looking at business hours versus off-hours.
  • Short-Term Baseline: The polled data point is compared to baseline mean and standard deviation calculated on a moving window of time between 30 minutes (minimum) and 1440 minutes (maximum; this is 24 hours). Standard deviation is recalculated at each poll, using previous polls as determined by the baseline window. Not used often and can be resource intensive.
    Use the Short-Term Baseline when you need to capture spikes.

EXAMPLE #1: Raise an alarm if Bandwidth Utilization is greater than 1 standard deviation from the mean. Use a 3-week baseline for same hour and day.

 DFM01.png

 

Example #2: Raise an alarm if Bandwidth Utilization is greater than 1 standard deviation from the mean AND if the polled data value is greater than 90%. Use a 3-week baseline for same hour and day.

DFM02.png

 

Time Over Dynamic Threshold

This is a tougher concept to understand, and also makes use of mean and standard deviation. In a sense it combines a static threshold value with deviation from normal.

The mean is the average value of all the data points that have been polled in a particular time frame (usually 24 hours). Standard deviation is the average width of the deviation of values from the mean, and measures the likelihood that a certain set of values will change from its current trajectory (up or down). The standard deviation is calculated once each day.

You also set the size of the potential range in the Percentile from fields. Live Exceptions calculates the potential range based on the percentile that you choose and the historical data for the element. If you specify a high percentile, the potential range is wider because the sample value is larger. If you specify a smaller percentile, the potential range is narrower because the sample value is smaller.


The standard deviation is calculated continually (dynamically) as new data points arrive, and eHealth keeps track of the daily variations in usage. For Bandwidth Utilization I would suggest using 100 percentile, or all potential data points.

This is the most confusing concept, because the value you enter into the potential range defines how standard deviation is calculated. For example, if I used the 100th percentile, each day my standard deviation is calculated using every data point collected. If I used the 90th percentile, my standard deviation is calculated only on the data points that fall within a narrower range.

Let’s use a small sampling of only ten data points for an example:

50, 74, 98, 73, 83, 70, 56, 61, 77, 75

The mean of these points is 71.7. If I calculate off of the 100th percentile of data (ie., all data points), standard deviation equals 13.7. If I calculate off of the 90th percentile, I am excluding the highest value in the data set (I have 10 data points and I am looking at the bottom 90%) which is 98. My standard deviation now would be 10.8. As you can see smaller percentile result in smaller ranges.

Time over dynamic threshold rules can generate an alarm if the polled data value + 1 standard deviation exceeds a threshold you assign. For example, let's say your rule is measuring Bandwidth Utilization, and you want to throw an alarm if utilization exceeds 90%. Since very high utilization can affect your network, you choose a large percentile sample, like the 95th percentile. Your rule would be the following: 

Above 95 percentile from 90 percent. If the actual polled value of bandwidth utilization plus the standard deviation (calculated against the bottom 95% of collecteed data points) exceeds 90% for some specified time range. 

Let's say today the standard deviation is equal to 5.5, and our first polled data point is 81%. Our calculation is 81 + 5.5 = 86.5%. This does not breach the threshold of 90 percent, so no alarm. The next data point is 89 percent. Our calculation is now 89 + 5.5 =94.5%. This breaches our threshold of 90% so an alarm is raised. 

Example: Raise an alarm if Bandwidth Utilization of the data point(s) + (standard deviation based on the 95th percentile of data) exceeds 90%.

 

TODT01.png