Data Aggregator(DA) lost contact with the Data Collector(DC)

Document ID : KB000032486
Last Modified Date : 14/02/2018
Show Technical Document Details

Issue

Data Collector lost contact with Data Aggregator due to lack of space on the Data Collector partition

The following event was seen on the Data Aggregator

A ACCOUNTING LogEvent has been DCStatus on DataAggregator:10.95.xxx.xxwith the IP address of 10.95.xxx.xx.

State: CLOSED
Severity: UNKNOWN
Occurred On: Fri Sep 11 11:43:02 EDT 2015 Item Description: Linux txanunxlipcp002.itnet.local 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 Event Description: Data Collector pink-charlie.net:77156654-7599-4fc4-8f57-51e7f48a3aa1 has lost contact with the Data Aggregator. The last heartbeat was received 58 seconds ago. Please refer to the Data Aggregator and Data Collector logs for more details.
Event ID: 6666065

Environment:

CA Performance Manager 2.5.0 Redhat Linux 2.6
 

Details:

The following was seen in the /opt/CA/IMDataAggregator_1_CA_RE\CA\IMDataAggregator\apache-karaf-2.3.0\data\log\karaf.log:
 


INFO | er-thread-199257 | 2015-09-11 11:42:53,479 | nagedDeviceResourceDiscoveryImpl | nagedDeviceResourceDiscoveryImpl 539 | .im.aggregator.discovery | | MF {http://im.ca.com/normalizer}NormalizedAddressInfo on device 100073 is currently being discovered. Bailed out.

INFO | er-thread-199262 | 2015-09-11 11:42:55,523 | nagedDeviceResourceDiscoveryImpl | nagedDeviceResourceDiscoveryImpl 539 | .im.aggregator.discovery | | MF {http://im.ca.com/normalizer}NormalizedAddressInfo on device 99207 is currently being discovered. Bailed out.

WARN | atTimer-thread-3 | 2015-09-11 11:43:02,796 | DCHeartBeatLog | r.controller.DCMHeartbeatManager 131 | ore.collector.interfaces | | No response has been received from DC 623 in timeframe 58913 (ms):

DC Contact Lost, previous responses=3506754463, current responses=3506754463

WARN | atTimer-thread-3 | 2015-09-11 11:43:02,797 | DCHeartBeatLog | r.controller.DCMHeartbeatManager 147 | ore.collector.interfaces | | No response has been received from txanunxlipcp006.goldlnk.rootlnka.net:77156654-7599-4fc4-8f57-51e7f48a3aa1 in timeframe 58913 ms: DC Contact Lost. Previous responses=3506754463, current responses=3506754463. Time since last heartbeat check: 10000 ms

INFO | atTimer-thread-3 | 2015-09-11 11:43:02,797 | IPDomainStatusManager | oller.impl.IPDomainStatusManager 71 | ager.core.collector.impl | | Set IP domain 2 status to NO_DC_RUNNING

INFO | ory-thread-20593 | 2015-09-11 11:43:02,797 | DeviceContactStatusImpl | aultmgmt.DeviceContactStatusImpl 341 | .im.aggregator.discovery | | DC 623 status changed from RUNNING to CONTACT_LOST

ERROR | atTimer-thread-3 | 2015-09-11 11:43:02,797 | DCHeartBeatLog | impl.DCMContactStatusManagerImpl 116 | ager.core.collector.impl | | Lost contact to DC pink-charlie.net:77156654-7599-4fc4-8f57-51e7f48a3aa1. State changed from RUNNING to CONTACT_LOST. The last heartbeat was received 58913 ms ago

No further information was seen until the data collector was stopped/started:

 INFO  | atTimer-thread-3 | 2015-09-11 13:27:32,797 | DCHeartBeatLog | impl.DCMContactStatusManagerImpl  126 | ager.core.collector.impl |       | Contact established to DC txanunxlipcp006.goldlnk.rootlnka.net:77156654-7599-4fc4-8f57-51e7f48a3aa1. State changed from CONTACT_LOST to RUNNING.  The last heartbeat was received 2438 ms ago

 

Cause determined:

In /var/adm/messages:

Sep 11 11:42:30 pink-charlie abrt[29632]: Write error: No space left on device
Sep 11 11:42:31 pink-charlie abrt[29632]: Error writing '/var/spool/abrt/ccpp-2015-09-11-11:42:08-3140.new/coredump'

Resolution:

Add additional space to partition to allow enough space for core dumps.