How do I setup a heartbeat checking between two CA Automation Point computers?

Document ID : KB000027743
Last Modified Date : 14/02/2018
Show Technical Document Details


CA Automation Point customers want to know that all of their CA Automation Point machines are running properly. They also want to be able to "hot swap" machines in case of a hardware error. You can use a combination of REXX programs, PPQs, and rules to "heartbeat check" between two CA Automation Point machines.


How do I setup a heartbeat checking between two CA Automation Point computers?


CA Automation Point r11.4.x and r11.5.x



To start, you need to configure program-to-program queues (PPQs) on both CA Automation Point machines. PPQs are an inter-process communications tool. They are small data repositories that can be accessed via TCP/IP between Automation Point machines. In this article, we discuss how to use PPQs to pass an "I'm Alive" message between two Automation Point computers.

To configure PPQs, on each of the two Automation Point machines, do the following:

  1. From the Configuration Manager dialog, go to Expert Interface -> Infrastructure -> Program to Program Queues. The Program to Program Queues dialog displays.

  2. On the Program to Program Queues dialog, check the Enable Use of PPQs box and make sure that TCP/IP is included under Configured Network Transports.

  3. Under TCP/IP Settings, enter the TCP/IP hostnames or IP addresses of the remote Unicenter Automation Point machine with which you want to communicate.

The PPQ Service starts when you close Configuration Manager.

Once PPQs are configured, you need to write a REXX program that creates the queues shared between the Automation Point computers. The REXX program attempts to create a shared queue between machines. The name of the shared queue is the Automation Point machine name.

In this example, we configure a REXX program so that it starts as soon as Unicenter Automation Point starts, on each of the Automation Point computers. The first computer to start creates the shared queue. The REXX program would look like this:

/* Hbeat_start.rexx */ 
/* First we initialize our REXX variables to 0, then try to create */ 
/* the shared PPQ queue. If the create fails, we send a message    */ 
/* to the AP message window, which can be automated by rules       */ 
remotemachinename_status = 0
address GLV "putp remotemachinename_status" 
remotemachinename_failure = 0 
address GLV "putp remotemachinename_failure" 
Address PPQ "create queue (machinename)  share(yes)"
If rc <> 0 
  address axc "wtxc ' PPQ create failed. Please check network connectivity '"

Now, we can set up the rules file to perform the heartbeat checks. In this case, assume we have a time rule set that fires a REXX program.

The rule would look like this:


PPQs can be manipulated directly from rules, but REXX programs are much more flexible. The REXX program first writes a "|" to the proper PPQ queue, reads the proper PPQ queue, and set two variables remotemachinename_status and remotemachinename_failure. Replace machinename with the local Automation Point machine name and remotemachinename with the remote Unicenter Automation Point machine name. If the queue is read successfully, the remotemachinename_status variable is set to 1. Otherwise it remains 0. The remotemachinename_failure variable counts how many consecutive times the program fails to read the queue. If the remotemachinename_failure variable becomes greater than 3, the program sends a message to the Automation Point messages window.

The REXX program would look like this:

/* We write our heartbeat message to the proper queue. If we do not */
/* get a 0 return code, we do a wtxc, which can have a rule written */ 
/* against it to do a notification. The user should change          */ 
/* machinename to the name of the remote PC                         */ 
Address PPQ "write queue( machinename ) item(heartbeat)"
If rc <> 0 then address axc "wtxc ' PPQ write failed. Please check network connectivity '" 
call checkit 
call resolve 
/* Now we look to see if we have received a heartbeat from*/ 
/* the remote AP machine                                  */ 
Address PPQ "read queue( remotemachinename ) prefix(item)" 
If rc == 0 then do 
    remotemachinename _status = 1 
    address GLV " putp remotemachinename _status" 
    remotemachinename _failure = 0 
    address GLV "putp remotemachinename _failure" 
else do 
    remotemachinename _status = 0 
    address GLV " putp remotemachinename _status " 
    address GLV " get r emotemachinename _failure " 
    /* Increase the failure count by 1 */ 
    nu_fail = remotemachinename _failure + 1 
    remotemachinename _failure = nu_fail 
    address GLV "putp remotemachinename _failure" 
/* If the failure count gets above 3 we need to do something */ 
/* so we send a message to the AP message window             */ 
Address GLV " get remotemachinename _failure" 
If remotemachinename _failure > 3 
address axc "wtxc 'Heartbeat failure on remote AP'"

Additional Information:

This simple example can easily be expanded to fit your particular needs.