How is Uptime Calculated by the ASM API and How does it Affect Uptime Display in PSP Pages?

Document ID : KB000030682
Last Modified Date : 14/02/2018
Show Technical Document Details

Question:  

How is Uptime Calculated by the ASM API and how does it affect Uptime Display in PSP Pages?

 

Answer:

The answer depends on the API call being used. Uptime metrics are calculated for both the rule_stats API call and rule_psp API call. For the rule_stats api call (https://api.cloudmonitor.ca.com/1.6/rule_stats?doc), uptime is calculated as a straight percentage. The calculation performed, for a given interval, is ((checks – checks_errors)/ checks)*100.

The rule_psp API call (https://api.cloudmonitor.ca.com/1.6/rule_psp?doc) does not return a straight average, and instead uses a weighted average. The algorithm used is based on an exponential moving average (https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average). The ASM API implementation uses an algorithm based upon the frequency the monitor runs, to give greater weight to more recent monitor results.  This method is applied to both the daily and current (approximately last 20 minutes) uptime calculations. 

Specifically, there is a "weighting coefficient", alpha, with a value between 0 and 1. The value of alpha depends on the monitoring interval and is used to determines how fast the older results become insignificant. Each interval, the last moving average and new value are used an a new average is calculated as follows:

new_average = alpha * new_value + (1 - alpha) * last_average

 

Let's consider the following example: A 5-minute monitor with a corresponding alpha coefficient of 0.4. Variable t represents time, cur_avg is the moving average at that time:

The service is working and the new value returned is 100:

t = 0:

cur_avg = 100.0 (initial state)

 

t = 5:

new_val = 100

cur_avg = 0.4 * 100.0 + (1.0 - 0.4) * 100.0 = 40.0 + 60.0 = 100.0

 

t = 10:

new_val = 100

cur_avg = 0.4 * 100.0 + (1.0 - 0.4) * 100.0 = 40.0 + 60.0 = 100.0

 

Now the service starts failing and the new value reported each interval is 0:

t = 15:

new_val = 0

cur_avg = 0.4 * 0 + (1.0 - 0.4) * 100.0 = 0.0 + 60.0 = 60.0

 

t = 20:

new_val = 0

cur_avg = 0.4 * 0 + (1.0 - 0.4) * 60.0 = 0.0 + 36.0 = 36.0

 

t = 25:

new_val = 0

cur_avg = 0.4 * 0 + (1.0 - 0.4) * 36.0 = 0.0 + 21.6 = 21.6

 

t = 30:

new_val = 0

cur_avg = 0.4 * 0 + (1.0 -0.4) * 21.6 = 0.0 + 12.96 = 12.96

 

Then the service starts working again:

t = 35:

new_val = 100

cur_avg = 0.4 * 100.0 + (1.0 - 0.4) * 12.96 = 40.0 + 7.78 = 47.78

 

t = 40:

new_val = 100

cur_avg = 0.4 * 100.0 + (1.0 - 0.4) * 47.78 = 40.0 + 28.67 = 68.67

 

t = 45:

new_val = 100

cur_avg = 0.4 * 100.0 + (1.0 - 0.4) * 68.67 = 40.0 + 41.20 = 81.20

 

t = 50:

new_val = 100

cur_avg = 0.4 * 100.0 + (1.0 - 0.4) * 81.20 = 40.0 + 48.72 = 88.72

 

t = 55:

new_val = 100

cur_avg = 0.4 * 100.0 + (1.0 - 0.4) * 88.72 = 40.0 + 53.23 = 93.23

 

t = 60:

new_val = 100

cur_avg = 0.4 * 100.0 + (1.0 - 0.4) * 93.23 = 40.0 + 55.94 = 95.94

 

How does this affect PSP display?

As can be seen in this example the availability drops from 100% (green) to 60% (red) in a single interval, providing immediate insight into a production outage on the PSP page In addition, the availability of service rapidly climbs again when the service becomes available. Please be aware though that the alert status indicators (green, orange, and red) color coding used in the PSP will take a number of intervals to get back to orange (95%) threshold and green (98%) threshold.

 

Why are the Uptime Calculations Different between the Rule_stats and Rule_psp API Calls?

The use of a straight average in the rule_stats command provides an accurate long term calculation of a website’s uptime however, is not sensitive to recent changes in website performance. The  rule_psp uses a weighted average instead of a straight average is because the uptime values returned by this API call are used in the Public Status Pages created by ASM.  From a design perspective, a PSP should accurately reflect the current status of a website. In ASM this means weighing more recent monitor uptime results more heavily, so that application and PSP users have an accurate reflection of your Websites current status.