Unexpected Alarm Generation Failure - Root Cause Analysis

Document ID : KB000121217
Last Modified Date : 26/11/2018
Show Technical Document Details
Introduction:
Recently we experienced a UIM outage, the outage presented itself by a lack of updates being made to the existing alarms within the Infrastructure Management alarm console (ie:Time Received data not updated when polling cycles are expected to occur) and new alarms were not created. 
Environment:
- UIM 8.5.1
Instructions:
The options for addressing a scenario where alarms are no longer being generated due to an issue within the UIM environment include:

1. Deploy a second instance of UIM to monitor the first one via the dirscan probe to see if the nas/AE queue files are growing/changing, or...
2. Use of a subscription service such as Runscope or AWS Lambda to externally monitor UIM Health, or
3. Follow the content in the attached KB Article titled 'Best Practices for Monitoring CA UIM (self-health monitoring)' and use logmon to monitor the nas and AE log files for fatal errors and the string, max restarts (at loglevel 1)
 
Additional Information:

Here is a useful link that references hubmon (a custom probe) as well as a solution using MS SQL Server to monitor the NAS_TRANSACTION_LOG.

https://communities.ca.com/message/242017385-nas-and-alarm-enrichment-queue-monitoring