Best Practices for Monitoring CA UIM (self-health monitoring)

Document ID : KB000009640
Last Modified Date : 14/02/2018
Show Technical Document Details
Introduction:

Best Practices for Monitoring CA UIM (self-health monitoring)

Background:

Customers request guidance on UIM self-health monitoring. The following list provides suggestions to implement monitoring of UIM itself.

Environment:
- UIM any version
Instructions:

Availability
 
net_connect:
Monitor up/down status using both ICMP (port 0) pings as well as handshaking specific service ports, e.g., NimBUS port 48000 for robot and 48002 for hub.
 
System Resources

cdm or rsp probe:

Monitor CPU, Disk, Memory, I/O
 
processes:
CPU / Memory for select processes e.g., hub.exe, nas.exe etc
 
Probe Status

logmon:
Monitor for max restarts entries in core probe logs such as:
 
hub, controller, UMP probes (dap, dashboard_engine and wasp)
 
You can use logmon and parse the probe log(s) for "Max restarts reached for probe" or any  hub, nas, and data_engine errors in logs

dirscan:
Use the dirscan probe locally on each hub to monitor the q files (size) and alarm when it is greater than <size_of_file>

Optionally, you could deploy a remote nas and emailgtw on one of your remote hubs to send an EMAIL when a queue alarm is generated.

Make sure that under the setup/hub section, set hub and controller loglevel to at least 3 and logsize of at least 8000 so we have more details just in case this happens again.

discovery_server:
- use processes and monitor java.exe using the associated command line for discovery_server
- use logmon to monitor the log for "exception"

Data

data_engine:
Monitor for data_engine errors/exceptions and alarm on them
Use appropriate probe depending on what type of database is being used, e.g., sqlserver, oracle, mysql

cdm:
Monitor size of database files

Alerting

emailgtw:
- use processes to monitor the emailgtw.exe process
- use logmon to look for each of these errors in the log:

   "error on session"
   "failed to start"
   "FAILED to connect session"
 
Network errors

interface_traffic:
Monitor key interfaces for discards/errors, e.g., hub/tunnel machines
 
Services/Events
 
ntservices / ntevl:
- used to monitor services or events, e.g., application

Application
Application crashes
 
Windows:
Application crashes via ntevl
 
On Linux/Unix:
 
dirscan can be used to monitor for presence of core files
 
UMP performance
 
instrument the JMX on wasp by adding these startup arguments to the Extra Java VM arguments:

-Dcom.sun.management.jmxremote.port=27000 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false

Then, QoS can be gathered via the jvm_monitor or a third-party app such as VisualVM:

http://visualvm.java.net/
http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/jmx_connections.html

Gateways

spectrumgtw:
- use processes and monitor java.exe using the associated command line for spectrumgtw
- use logmon to monitor the spectrumgtw.log for "exception"