Archive Manager consumed all the Memory and Swap causing the SpectroSERVER to shutdown

Document ID : KB000007587
Last Modified Date : 14/02/2018
Show Technical Document Details
Issue:

We had production outage with our MLS SpectroSERVER. The SpectroSERVER had shutdown because the Archive Manager consumed all the available memory and swap space. 

This had an impact on the other 7 SpectroSERVERs and I had to failover all 8 SpectroSERVERs and recycled Spectrum to recover from the issue.

Cause:

The Stack pulled from the core file contained the following information:

0000000100c846f1 libc.so.1`_lwp_kill+8(6, 0, 1001af388, ffffffffffffffff, ffffffff77c3e000, 0) 
0000000100c847a1 libc.so.1`abort+0x118(1, 1d8, ffffffff77aca32c, 1f1efc, 0, 0) 
0000000100c84881 libPort.so.1`__1cQCs_out_of_memory6F_v_+0x3c(ffffffff7cd06918, 88, ffffffff7cd05bf8, 0, ffffffff78100200, 0) 
0000000100c84931 libPort.so.1`__1cOCs_new_handler6F_v_+0x3c(ffffffff7cd068a0, 40, ffffffff7cd05bf8, 0, ffffffff77c3e000, 10019bac0) 
0000000100c849e1 libCrun.so.1`__1c2n6FL_pv_+0x44(2b, 1, 0, ffffffff7d79ff20, 0, ffffffff77e0f4a0) 
0000000100c84a91 libVPapi.so.1`__1cHattrdup6FpkvnKCsAttrDescMCsAttrType_e__pv_+0x30(1629a7860, 1100c84c41, 1629a7860, ffffffff77ad17dc, 
ffffffff77c3e000, 2b) 
0000000100c84b51 libVPapi.so.1`__1cJCsVarData2t5B6Mr0_v_+0x38(100b108c0, 1f326a9e0, 11, 1, 100b108c0, ffffffff77e0f4a0) 
0000000100c84c11 libVPapi.so.1`__1cNCsVarDataListEcopy6Mrk0_nHCsErrorJCsError_e__+0x2c(6e3843510, 1629aae60, 1f326a9e0, 0, 100b07670, 
ffffffff77e0f4a0) 
0000000100c84cc1 libVPapi.so.1`__1cNCsVarDataList2G6Mrk0_r0_+0x1c(6e3843510, 1629aae60, 100b07670, 6e3843448, 0, 6e3843530) 
0000000100c84d71 libVPapi.so.1`__1cOCsEventMessageEcopy6Mrk0i_v_+0x124(6e3843360, 1629aacb0, 0, 0, ffffffff77c3e000, ffffffff7d79ff20) 
0000000100c84e51 libVPapi.so.1`__1cOCsEventMessage2t5B6Mrk0i_v_+0x90(6e3843360, 1629aacb0, 1001a2868, 6e3843530, ffffffff7d79ff20, 0) 
0000000100c84f01 libVPapi.so.1`__1cLCsEventListEcopy6Mrk0_v_+0x6c(100b0bd80, ec0, c00, 7, ffffffff7d79ff20, 2400) 
0000000100c84fb1 libVPapi.so.1`__1cLCsEventList2t5B6Mrk0_v_+0x1c(100b0bd80, 259c1e1d0, 1000, 100b0bd88, ffffffff7d79ff20, 
ffffffff77e0f4a0) 
0000000100c85061 libVPapi.so.1`__1cQCsScrollResponseEcopy6kM_pnOCsVnmParmBlock__+0x40(259c1e1b0, 1000, 1dab68, ffffffff7d79ff20, 
ffffffff77c3e000, 100b0bd60) 
0000000100c85111 libVPapi.so.1`__1cICsVnmMsg2t5B6Mrk0_v_+0x54(100b07640, 1f325f510, 0, 0, 100b07640, ffffffff77e0f4a0) 
0000000100c851c1 libgserv.so.1`__1cQCsNMgrClientListGnotify6MpnICsVnmMsg__nHCsErrorJCsError_e__+0xb8(100beb5d8, 1f325f510, 1014937b0, 
101541310, 101541318, ffffffff7ed1fdf0) 
0000000100c85271 __1cRCsEventAttachListGnotify6MrknOCsEventMessage__nHCsErrorJCsError_e__+0x138(100beb5d8, 1001b0, 1629aac98, 259c1e1b0 
, 100000, 100e57c80) 
0000000100c85321 __1cMCsLogManagerNprocess_event6MrnOCsEventMessage_rknNCsLscpeHandle__nHCsErrorJCsError_e__+0x5c(100beb5c0, 1f326c620, 
1f326c830, b1d2a, b1d29, 1001a3078) 
0000000100c853d1 __1cRCsEventLogManagerNprocess_event6MrnOCsEventMessage_rknNCsLscpeHandle__nHCsErrorJCsError_e__+0xc(100beb5c0, 
1f326c620, 1f326c830, 1f325f5a0, 10003e71c, 1001a3078) 
0000000100c85481 __1cMCsLogManagerJqueue_log6M_v_+0x6c(100beb5c0, 1f326c5a0, 1001af000, 1001afde0, 1001af, 100000) 
0000000100c85531 __1cMCsLogManagerUqueue_thread_wrapper6Fp0_v_+4(100beb5c0, 100c0a680, 0, 0, 1e, 0) 
0000000100c855e1 libmoot.so.1`__1cGThreadUcall_thread_function6M_v_+8(100c3f340, 1e, 0, 100040ef8, 0, 0) 
0000000100c85691 libmoot.so.1`moot_thread_start+0x30(1e, 70, 0, ffffffff7d917328, ffffffff78100200, 0) 
0000000100c85741 libc.so.1`__makecontext_v2+0x108(0, 0, 0, 0, 0, 0) 

Resolution:

Sustaining Engineering developed the Spectrum_10.01.01.D229 Debug patch that has added some additional debugging code to better understand the memory growth for ArchMgr.

The Archive Manager will now log such details on an interval in the ARCHMGR.OUT:

Jul 25 10:38:37 : ArchMgr performance statistics:

                        Total   Change  Rate(/sec)

Seconds counted:        1261    61

Queued messages:        3201    3012    49.4    (now 0, avg. 467, max 2296)

Dequeued messages:      3201    3012    49.4

Notifications:          3201    3012    49.4

Inserted messages:      2982    2914    47.8

Failed Inserts:         0       0       0.0

Requests:               164     41      0.7     (now 0, avg. 1, max 1)

        (avg. dur. 0.0 sec, max dur. 0.0 sec)

Responses:              164     41      0.7

No responses:           0       0       0.0

Sent messages:          0       0       0.0

Attachments:            0       0       0.0     (now 0, avg. 0, max 0)

Detachments:            0       0       0.0

Memory Usage: 30.7MB real / 134.8MB virtual

 

The debug is disabled by default. To enable the debug you must modify the .configrc file and add the report_performance=TRUE statement. By default the performance data will be written to the ARCHMGR.OUT event 10 minutes (3600 seconds). The interval can be specified by adding the report_perf_interval=<number_seconds> to the .configrc file if needed. 

 

The patch also ships a second ArchMgr binary. The ArchMgr_memc binary provides a second option for gathering more memory debug data. This should only be used at the request of Sustaining Engineering. To use the ArchMgr_memc, you simply make a backup copy of the regular ArchMgr binary. The make a copy of ArchMgr_memc and name it ArchMgr. Run 'touch checkMemory' in the $SPECROOT/SS/DDM directory before starting Archive Manager.

 

Provide a copy of the ARCHMGR to Sustaining Engineering for review.