All weekend, with maintenance schedules in place, UIM was basically worthless. Resources on the primary Linux server were load average over 300+ % CPU usage. The data_engine and and qos_processor queues over 10+ million. This would not recover.
Deleted all of the maintenance schedules, but the maintenance_mode probe resources still in the 300+ CPU % used range.
An error similar to the following was seen in the maintenance_mode probe log file:
Apr 24 11:13:46:233 WARN [attach_socket, com.nimsoft.monitor.probe.MaintenanceModeProbe] Failure registering /<UIM domain>/<primary hub name>/<primary hub robot name>/ems .(2) communication error, Error when trying to send on session (S) com.nimsoft.nimbus.NimServerSession(Socket[addr=/10.205.224.54,port=51450,localport=48046]): Broken pipe
UIM Server 8.51
maintenance_mode: 8.40 (GA version) and 8.52 HF1
Unable to determine the exact cause. The issue does appear to be related to the maintenance_mode probe's inability to register with the ems probe.
Since it appears that the maintenance_mode probe is having an issue with registering the ems probe, the following was found to have resolved this issue:
1. Deactivate the following probes in this order:
2. Make backup copies of the ems and trellis probe directories.
3. Delete the ems probe and when it is removed, delete the probe directory and its contents
4. Delete the trellis probe and when it is removed, delete this probe's directory and its contents.
5. Deploy the following set of probes in this order:
- trellis: 2.01
- alarm_routing_service: 9.00
- nas_api_service: 8.51
- ems: 9.01
These are the GA versions of the probes deployed with UIM 8.51. If you have been using a later GA release of ems, deploy that version instead.
6. Activate the 3 probes in reverse order: