Symptoms began as inability to discover Pingable or SNMP Manageable devices in a DC.
Then all DCs showed a state of Unknown.
Then all DCs were unable to discover new devices.
All supported CA Performance Management releases
Reviewing the AMQ Queue's for DCs @ DA_Host:8161 (default is admin:admin to log in) we see large responses from DA to DCs that are waiting. This is seen by the large numbers in the left column. These indicate backed up requests the DC made to the DA, that the DA is unable to send to the DC.
When AMQ broker is in a bad state we often see the activemq.log file rolling over frequently. We see that here as well.
There is further debug to see at DA:8581/debug/dispatcher/container...
DC #1 was stopped, DC #2 had it's AMQ queue purged to free things up as it was causing other DCs to have issues. Once that was done we observed most of the DCs go Green and show a valid status again.
DC #2 was not successful in deleting devices that were removed. It was continually trying to get info about it from the DA. This overwhelmed a shared Q between DA and DCs. Once the Q was cleared via AMQ admin page, it freed up most of the other DCs to become available.
- There was a recent ulimit issue on the DA last week. It caused problems making requests and fulfilling them in the system.
- There were Device deletions taking place at the same time. These were removed from the DA and DR.
- The DCs still had knowledge of the devices due to not having received the delete requests as a result of the earlier DA ulimit problem.
- The DCs were constantly refreshing requests for the deleted devices SNMP Profile and config. The DA was responding that no SNMP Profile is found, but the DC is too busy to receive and process the request.