So, let's take a look at the next query:
select h.job_id as job_id, j.name as job_name, c.name category_name,?
dateadd(hh,run_time/10000,dateadd(mi,run_time%10000/100,dateadd(ss,run_time%100, convert(datetime,convert(nvarchar(8),run_date),112)))) as rundate, datediff(minute, dateadd(hh,run_time/10000, dateadd(mi,run_time%10000/100, dateadd(ss,run_time%100, convert(datetime,convert(nvarchar(8),run_date),112)))), getdate()) as elapsed_time
from msdb.dbo.sysjobhistory h INNER JOIN msdb.dbo.sysjobs j ON h.job_id = j.job_id
inner join msdb.dbo.syscategories c on j.category_id = c.category_id
where c.category_class = 1 and h.run_status <> 1 and h.step_id = 0 and
dateadd(hh,run_time/10000, dateadd(mi,run_time%10000/100, dateadd(ss,run_time%100, convert(datetime,convert(nvarchar(8),run_date),112)))) > dateadd(ss,-60*60*36,getdate())
order by elapsed_time
This is the query used by the checkpoint and the one we'll be using in our example to see values such as:
- job_id, job_name, category_name, rundate and elapsed_time.
Above all, "elapsed_time" is the most important one here.
By definition, elapsed_time is equal to ?Current time - Job started time
When we set a threshold in our agent_job_failure checkpoint, what we are doing in reality is set a value (in minutes) which will generate an alarm for every job failure whose elapsed_time is lower than the threshold.
In other words, let's say we set a threshold of 5 (5 minutes).
We create a job failure that will be running for 5 minutes (it will fail once per minute) as follows:
1) We create a job failure and set a schedule that goes from 5.00 p.m to 5.05 p.m with an interval of 1 minute. So, we'll be having a job failing 5 times within 5 minutes.
2) Let's say we set our "check_interval" parameter in our agent_job_failure checkpoint to 1 (it will check the checkpoint every 1 minute).
3) Let's write down the times at which our job is failing:
Let's assume our current time is 5.06 p.m.
At 5.00, job failed, elapsed_time (current time - time job failed) = 6?
At 5.01, job failed, elapsed_time = 5
At 5.02, job failed, elapsed_time = 4
At 5.03, job failed, elapsed_time = 3
At 5.04, job failed, elapsed_time = 2
At 5.05, job failed, elapsed_time = 1
With this scenario and our checkpoint's check interval equal to 1, the first job will be cleared at 5.06.
Why? The same way that in order to have an alarm for a job failure the elapsed_time has to be lower than the threshold, for a job failure to have a clear alarm, the elapsed_time must be higher than the threshold.
So, at 5.06, there's just one elapsed_time higher (6) than the threshold, which is 5, and this is the time when the first job that failed will be cleared.
At 5.07, we'll have, elapsed_time from the first job failure equal to 7, from the second, equal to 6, from the third, equal to 5, and so on.
So, at 5.07, we'll just have 4 alarms remaining.
At 5.08, we'll have 3 alarms remaining and eventually, at 5.11, we should have no alarms remaining in our Alarm Window since all the job that failed have their elapsed_time higher than the threshold (5 in this example)