Best Practice to deal with excessive snmp and icmp communication lost alarms in CA Spectrum.

Document ID : KB000021760
Last Modified Date : 14/02/2018
Show Technical Document Details

Description:

This best practice knowledge base document will focus on Management Agent Lost alarms and Device Has Stopped Responding to Polls alarms.

The following alarms are generated in Spectrum OneClick more often than the actual device agent goes down:

MANAGEMENT AGENT LOST (alarm id 0x10701)
DEVICE HAS STOPPED RESPONDING TO POLLS (alarm id 0x10009)

The following events are seen:

0x10d35
{d "%w- %d %m-, %Y - %T"} - Device {m} of type {t} has stopped responding to polls and/or external requests. An alarm will be generated. (event [{e}])

0x10daa
{d "%w- %d %m-, %Y - %T"} - Device {m} of type {t} is no longer responding to primary management requests (e.g. SNMP), but appears to be responsive to other communication protocol (e.g. ICMP). This condition has persisted for an extended amount of time. An alarm will be generated. (event [{e}])

If the agent on the device is not actually going down as often as these alarms are being generated then the communication timeout values in Spectrum may not be high enough.

Solution:

Out of the box Spectrum is configured to send an SNMP get to the device per the polling interval. By default on average, the polling interval is set to 300 seconds. When this interval is reached, Spectrum sends a poll. If no response is received, then it will utilize the values in the DCM Timeout and DCM retry attributes to determine how long to wait, and how many times to send out another request. By default the timeout is 3000ms (3 seconds) and 3 retries. If no response is received from the device within this timeframe, an ICMP ping request is sent out. If a response is received, a Management Agent Lost alarm is generated. If no response is received, a Device Has Stopped Responding to Polls alarm is generated.

There are times when the 3000ms and 3 retries is not long enough for the agent to respond. In this case, a general recommendation is to increase the timeout 10000ms and leave the retries at 3, for a total of 30 seconds. This can be done in the Information view of the device, in the CA Spectrum Modeling Information area.

Increase the DCM Timeout (ms) attribute to 10000 (10 seconds)
Leave the DCM Retry Count at 3

You can also modify these attributes on multiple models using the Attribute Editor and selecting the DCM attributes under the SNMP Communication folder.

These numbers are a general recommendation for Spectrum models generating an unusual amount of communication lost alarms and these changes may not be needed "across the board". For exact response times a sniffer trace needs to be used. To determine which models are generating an excessive amount of these events, you can run the <SPECROOT>/PerfCollector9 script from command line.  It will run a mysql query to obtain the event counts for 0x10d35 and 0x10daa and the corresponding model handles from the DDM database.  The counts will be listed in the <SPECROOT>/Performance/<machinename>_0x10d35_summary.txt and <SPECROOT>/Performance/<machinename>_0x10daa_summary.txt files.