How to monitor remote LPARs using CA OPS/MVS Event Management and Automation.

Document ID : KB000008980
Last Modified Date : 14/02/2018
Show Technical Document Details
Issue:

We had a situation where two LPARs stopped and we did not catch it.  On a central LPAR, we monitor both messages:

  • OPS3440O MSF SYSTEM XXXX HAS NOT RESPONDED TO A PING
  • OPS3504O SYSTEM ID XXXX IS NOW INACTIVE 

If we get 5 OPS3440O messages within 10 minutes, we alert the operators. This time, we only got 2 OPS3440O messages, and no alerting was done. 

If we get an OPS3504O INACTIVE without stopping the remote OPSMVS controlled, we alert the operator and create a ticket.  This message did not occur.

From the OPSLOG on the central LPAR:
03DEC 19:02:16 OPS3440O MSF SYSTEM ABC1 HAS NOT RESPONDED TO A PING           
03DEC 19:02:16 OPS3440O MSF SYSTEM XYZ1 HAS NOT RESPONDED TO A PING           
03DEC 19:04:16 OPS3440O MSF SYSTEM ABC1 HAS NOT RESPONDED TO A PING           
03DEC 19:04:16 OPS3440O MSF SYSTEM XYZ1 HAS NOT RESPONDED TO A PING           
03DEC 19:04:22 OPS3541O APPC SEND FUNCTION FAILED FOR ABC1 - RPL6RC = X'0048000
03DEC 19:04:22 OPS3540O APPC SEND DATA FUNCTION FAILED FOR ABC1 - RC=001A, RESO
03DEC 19:04:22 OPS3541O APPC SEND FUNCTION FAILED FOR XYZ1 - RPL6RC = X'0048000
03DEC 19:04:22 OPS5006O MFSNTP subtask terminating                            
03DEC 19:04:22 OPS3540O APPC SEND DATA FUNCTION FAILED FOR XYZ1 - RC=001A, RESO
03DEC 19:04:22 OPS5006O MFSNTP subtask terminating                            
03DEC 19:04:36 OPS3541O APPC RECEIVE FUNCTION FAILED FOR XYZ1 - RPL6RC = X'004C
03DEC 19:04:36 OPS3540O APPC RECEIVE FUNCTION FAILED FOR XYZ1 - RC=001B, RESOUR
03DEC 19:04:36 OPS5006O MFRCTP subtask terminating     
03DEC 19:05:19 OPS3541O APPC RECEIVE FUNCTION FAILED FOR ABC1 - RPL6RC = X'004C
03DEC 19:05:19 OPS3540O APPC RECEIVE FUNCTION FAILED FOR ABC1 - RC=001B, RESOUR          
03DEC 19:05:19 OPS5006O MFRCTP subtask terminating            

We did not get the OPS3504O SYSTEM ID ABC1 IS NOW INACTIVE or OPS3504O SYSTEM ID UCS1 IS NOW INACTIVE messages. 


Do you have a suggestion on how to monitor LPARs using MSF?

Environment:
CA OPS/MVS 12.3CA CCI MSFCA OPS/MVS MSF
Cause:

Replies from the pings are not possible due to a problem with the following: Network, VTAM or a major system error.

Resolution:

Write a rule to take action on these messages:

OPS3541O APPC SEND FUNCTION FAILED FOR PCS2 - RPL6RC = X'00480000', RPLFDBK = X'000B00', OR OPS3540O APPC SEND DATA FUNCTION FAILED FOR PCS2 - RC=001A, RESOURCE FAILURE NO RETRY

>  The OPS3541O with "SEND FUNCTION FAILED" and RPL6RC = X’0048...’ You can assume the MSFID is dead.
>  The OPS3540O could also be used when it shows "RESOURCE FAILURE NO RETRY"

Additional Information:

OPSYSPLX Function

Understanding the MSF

Here are some other possible solutions:

>  You can register OPS with Automatic Restart Management (ARM) to have it restart automatically after a failure.

>  If you have the CA Automation Point software, you can use it to monitor failure message from OPS and restart the product.

>  We have also seen customers run a second, scaled down version, of CA OPS that only manages the production copy and restarts it after a failure.

>  One thing that could be done is add a OPSYSPLX call (if this system is in the sysplex) to your logic to assure the system didn't stop.

systems = OPSYSPLX(i,s) - https://docops.ca.com/ca-opsmvs-123-EN/reference-information/command-and-function-reference/ops-rexx-built-in-functions/opsysplx-function

and it will give you the in WORD number 11 the status - and word number 5 is the last status date. If the status isn't ACTIVE, you may have a problem, but you would need to further evaluate based upon its actual value..

You can even add the name to the OPSYSPLX('i','s',SYSA) to look only at the system you are interested in. Again this must be within a sysplex to use.

If the system froze, the same thing would occur on any release of CA OPS/MVS, since no error is detected and VTAM does not do anything (it is frozen)... If all systems, other than the monitoring system are in a SYSPLEX, they could have TOD rules to check the status within their sysplexs, and if something appears wrong, notify the monitoring system.