How can I determine that multiwrite replication has failed?

Document ID : KB000055337
Last Modified Date : 14/02/2018
Show Technical Document Details

Method 1. Log File Detection

There are a number of eTrust Directory logs that could indicate when a multiwrite issue has occurred.

When any of these errors are encountered, immediate action is required on the part of the directory administrator / operator. The presence of these messages in the various log files indicates that a multiwrite master has detected a communications problem with a multiwrite peer DSA.

The issue should be rectified as soon as possible to avoid the multi-write queue filling up.

Each log file will be discussed and the exact eTrust Directory 8.1 error messages listed here.

The Alarm Log

The alarm log contains critical errors. This log is 'always on' which means it's not controlled by a logging command within the DSA configuration. When there is a multiwrite replication error, the following will be recorded:

            20061207.113023.310 MW: DSA 'ServerB' cannot be contacted             
20061207.113125.018 MW: DSA 'ServerB' cannot be contacted
20061207.113227.207 MW: DSA 'ServerB' cannot be contacted

This error indicates that for some reason, the remote multiwrite DSA cannot be contacted.

If any multiwrite DSA has been down for a long time, the multiwrite queue may start to fill up. If a queue reaches 60% full, the following message will appear in the alarm log:

            20061207.124041.646 MW: Buffer (ServerB) greater than 60% full

After the 60% mark, for each 10% increase in multiwrite queue size, there is an additional message:

            20061207.124057.343 MW: Buffer (ServerB) greater than 70% full                         
20061207.124057.980 MW: Buffer (ServerB) greater than 80% full
20061207.124058.537 MW: Buffer (ServerB) greater than 90% full
20061207.124059.144 MW: Buffer (ServerB) greater than 100% full

Once the multiwrite queue reaches 100% full, the DSA will suspend the queuing operations for the offline remote DSA and remove the queue completely. When the DSA does this, the following message will be written to the alarm log:

            20061207.124059.154 MW: Operation disabled for DSA 'ServerB' 

At this time manual re-synchronization of the master multiwrite and remote multiwrite peer DSA's database will need to be performed.

If the error displayed below is ever seen in a multiwrite DSA's alarm log, then this indicates that there has been a master multiwrite DSA shutdown. This message is to indicate that the multiwrite DSA had something in it's queue.

            20061207.122225.992 DXserver stopping but multiwrite queues exist             
20061207.122226.654 Shutting down DXserver

Shutting a multiwrite DSA down when there are updates pending in it's multiwrite queue, means that the remote DSA for which the multiwrite queue was for, is now out-of-sync with the master multiwrite DSA. Manual re-synchronization of the master and remote DSA's will need to be performed to rectify the situation.

The Stats Log

As the multiwrite queue increases in size, the statistics log will show the following:

      20061207.111733.141 STATS : Assocs 1 NilCredit 0 NoTicks 0 Queue 0+0 MWQ 0/0 Active 0 Ops 0 Entries 0       
20061207.113133.030 STATS : Assocs 1 NilCredit 0 NoTicks 1 Queue 0+0 MWQ 1/1 Active 0 Ops 0 Entries 0
20061207.113433.067 STATS : Assocs 1 NilCredit 0 NoTicks 0 Queue 0+0 MWQ 3/3 Active 0 Ops 0 Entries 0

The field within the stats log that contains the multiwrite queue information is the field 'MWQ'. In the example above multiwrite queue statistics of '3/3' can be explained as:

First column : Unsent multiwrite requests
Second column : currently queued multiwrite requests

The Warn Log

The warn log contains the following errors:

      20061207.120021.375 ERROR : DSA ServerB needs attention       
20061207.120123.555 WARN : Remote DSA 'ServerB' aborted
20061207.120225.736 WARN : Remote DSA 'ServerB' aborted

The Trace Log

With a trace command of "set trace=all;" in the DSA's logging configuration file, the following trace information suggests a multiwrite DSA connectivity issue.

      ! Connecting to TCP aaa.bbb.ccc.ddd:{port}       
! remoteSendRequest: bind sent to DSA 'ServerB'
! MW queued: 000/013
! doLocalResponse: op->multiWrite=TRUE
! System error: recv: A request to send or receive data was disallowed because the socket is not connected and
(when sending on a datagram socket using a sendto call) no address was supplied. (0x2749)
! comms_recv: Error
! Call closed: notify 4c2630
! PABTind
! ----------RemoteEvent (001/000)----------20061207.113023.310
! > > (Remote) <- #1 [ServerB] DSP ABORT-REQ
> invoke-id = 0 credit = 24
> ? 20061207.113023.310 WARN : Remote DSA 'ServerB' aborted
! Adding DSA ServerB to fast queue
! RemoteRetryAssoc
! RemoteMWFailed * 20061207.113023.310 MW: DSA 'ServerB' cannot be contacted

The very last message in the trace log is the same as the message seen in the alarm log. The interpretation is that the multiwrite master DSA attempts to contact 'ServerB' in order to chain a multiwrite operation to it. However, the remote DSA is unavailable, so the multiwrite master queues the update transaction.

Method 2. SNMP Polling

Each DSA has it's own set of SNMP metrics that can be polled using any SNMP compliant application. The User Datagram Protocol (UDP) port that is used for this polling activity can be found within the DSA's knowledge configuration file.

An example is:

snmp-port = 19389

The SNMP community string is 'public'

When multiwrite replication is functioning correctly, the SNMP metrics, as viewed from the multiwrite master, will look like the following output:

      dxRemoteDsaName.1 : ServerB       
dxMWQueueLength.1 : 0
dxMWStatus.1 : 1 (ok)
dxMWPendingRemote.1 : 0
dxMWConfirmedLocal.1 : 0

In the above example, the remote multiwrite peer is known as "ServerB".

When a multiwrite DSA is offline, the preferred master detects this and several of the SNMP metrics change. See below for an example:

      dxRemoteDsaName.1 : ServerB       
dxMWQueueLength.1 : 1
dxMWStatus.1 : 2 (failed)
dxMWPendingRemote.1 : 0
dxMWConfirmedLocal.1 : 1

As the multiwrite queue for the remote DSA increases in size, the metrics look like the following:

      dxRemoteDsaName.1 : ServerB       
dxMWQueueLength.1 : 3
dxMWStatus.1 : 9 (failed-sent)
dxMWPendingRemote.1 : 1
dxMWConfirmedLocal.1 : 3

In the above example output, is can be seen that there are 3 updates in the multiwrite queue, with one 'pending' transfer to the remote multiwrite DSA. The 'confirmedLocal' SNMP value of '3' indicates that there are 3 updates that have been committed locally. Also note that the multiwrite status has changed from a '2' to a '9'; this indicates a change in that status of the multiwrite connection. Where a '2' and a '9' both indicate a multiwrite failure.

Method 3. SNMP Traps

SNMP traps can also be used to determine if there are multiwrite problems.

In order to configure SNMP traps for your multiwrite DSA, include the following commands in the DSA's logging configuration file

       set snmp-log= udp {hostname} port {port #};        
set op-error-trap=true;

The DSA will then send SNMP traps to the host and port that has been specified.

An example of some related SNMP traps are:

      Trap (6, 0) from aaa.bbb.ccc.ddd (uptime 188.00 sec)        
- Alarm - sysName SERVERA sysDescr
CA eTrust DXserver sysLocation MW: DSA 'ServerB' cannot be contacted

As with all SNMP traps, the message string will be stored in the SNMP value of 'syslocation'.

When the multiwrite queue reaches 60% full, the DSA will issue the following SNMP trap:

      Trap (6, 0) from aaa.bbb.ccc.ddd (uptime 501.00 sec)        
- Alarm - sysName SERVERA
sysDescr CA eTrust DXserver
sysLocation MW: Buffer (ServerB) greater than 60% full

and then it will send the same trap for every 10% increase above 60%.

Once the multiwrite queue reaches 100% the DSA will suspend it's multiwrite functions and delete the full queue. At that time it will send the two SNMP traps displayed below:

      Trap (6, 0) from aaa.bbb.ccc.ddd (uptime 8037.00 sec)        
- Alarm - sysName SERVERA
sysDescr CA eTrust DXserver
sysLocation MW: Buffer (ServerB) greater than 100% full
Trap (6, 0) from aaa.bbb.ccc.ddd (uptime 8037.00 sec)
- Alarm - sysName SERVERA sysDescr
CA eTrust DXserver sysLocation MW: Operation disabled for DSA 'ServerB'

Method 4. DXConsole monitoring

A DXconsole session can be used to monitor the multiwrite status directly.

To view the multiwrite queue status, issue the following command whilst connected to a multiwrite DSA via the DXconsole:

      get dsp; 

For a multiwrite system performing normally the output will read:

      local-prefix        = local-dsa         = SERVERA       
multi-chaining = TRUE
always-chain-down = FALSE
multi-write-disp-recovery = FALSE
multi-write-disp-queue = FALSE
wait-for-multiwrite = FALSE
multi-write-queue = 10
multi-write-credit = 0
ServerB(): OK, total 0, waiting remote 0, confirmed local 0

This shows that the multiwrite queue for ServerB that is running on ServerA is functioning normally and has nothing queued.

For a multiwrite DSA that is currently queuing operations for a remote DSA that is offline, the output will read:

      dsa> get dsp;       
local-prefix = local-dsa = SERVERA
multi-chaining = TRUE
always-chain-down = FALSE
multi-write-disp-recovery = FALSE
multi-write-disp-queue = FALSE
wait-for-multiwrite = FALSE
multi-write-queue = 10
multi-write-credit = 0
ServerB(): **MW-FAILED**, total 1, waiting remote 0, confirmed local 1

Output for a failed & purged multiwrite queue will read:

      dsa> get dsp;       
local-prefix = local-dsa = SERVERA
multi-chaining = TRUE
always-chain-down = FALSE
multi-write-disp-recovery = FALSE
multi-write-disp-queue = FALSE
wait-for-multiwrite = FALSE
multi-write-queue = 10
multi-write-credit = 0
ServerB(): **QUEUE-PURGED-OUT-OF-ORDER**, total 0, waiting remote 0, confirmed local 0

This shows that ServerA's multiwrite queue for ServerB has been purged due to the queue filling up.