Job Running on Load Balanced Agent Receives False CHASE Alarm

Document ID : KB000094323
Last Modified Date : 03/05/2018
Show Technical Document Details
A job is defined to run on a machine name that points to a  load balancer such as F5. It can be routed by the load balancer to run on one of several physical agent machines on a round-robin basis. When the job is in RUNNING status and the chase command is executed to check on running jobs, a CHASE alarm is raised for the job even though it is confirmed as still running on the agent. If the chase command is run with the -E argument, it will set the job to FAILURE status.
This is a limitation that is mentioned in the product reference for the chase command...


The chase command does not always work in an environment with agents installed in a load-balancing cluster. In that case, the chase command works only when the cluster name is configured to point to the same node where the job ran. For example, suppose that you have installed agents in a 3-node cluster that is named CLUSTERNODE that is configured to point to either NODE1, NODE2, or NODE3. Also, suppose that you create a machine definition that specifies CLUSTERNODE as the node_name attribute value. If the cluster name, CLUSTERNODE, points to NODE1 when you start the job and you configure it to point to NODE2 after the job completes, then all chase requests are directed at NODE2. The chase command does not verify jobs running on NODE1 until you configure the cluster name to point back to NODE1

This same limitation exists for KILLJOB events (including those generated by term_run_time) and job log retrieval from WCC or the 'autosyslog -J' command.

There is no workaround/solution for this that will allow you to continue using a load balancer hostname in job definitions.
Utilize a WAAE virtual machine that contains the real machines that are currently behind the load balancer. The scheduler would then pick the machine to which the job runs based on the configured MachineMethod in the instance configuration. Once it picks a real machine, that real machine name would be stored and used by future chase commands, etc.

In 11.3.6 SP7,  a new MachineMethod option called "roundrobin" was introduced. With this machine method, the Scheduler distributes jobs to the machines within a virtual machine on a 'round-robin' basis rather than performing a check to see which one has the most availability. Also in SP7, there is a new option to override the instance-wide MachineMethod setting within any virtual machine's definition. This allows you to utilize the machine method best suited for each virtual machine rather than forcing the same to be used by all.