I need to recover a multi-write peer database that has been down for sometime. How do I re-sync it with it's multi-write peers?

Document ID : KB000055398
Last Modified Date : 14/02/2018
Show Technical Document Details

Description:

Scenario

This page gives an example of how to recover a peer that has been down for a long time (and the queues on the master have blown).

Slight variations of this procedure are also relevant to a peer that is on a machine that needs to be rebuilt (the extra step is a re-install) or bringing a new peer into a replication step (the extra steps are adding the new peer into the knowledge file and re-initing the servers).

Topology

  1. Democorp (preferred master)

  2. Democorp2

  3. Democorp3

  4. Democorp4

Background for scenario

Democorp4 has been offline for quite sometime. Democorp (preferred master) has been queuing updates for Democorp4, but the queue has been completely filled and Democorp has had to mark Democorp4 as being OFFLINE and has subsequently deleted it's Democorp4 multi-write queue. This thus requires Democorp4's database to be manually recovered. The following details the steps to recover Democorp4 and to ensure that all outstanding updates made after the re-sync process have been applied to Democorp4, ensuring that Democorp4's database is synchronized with the rest of the multi-write peers.

Indications that there is a problem

Output from Democorp's (preferred master) alarm log reads:

20060816.104325 MW: Buffer (DEMOCORP4) greater than 60% full
20060816.104325 MW: Buffer (DEMOCORP4) greater than 70% full
20060816.104325 MW: Buffer (DEMOCORP4) greater than 80% full
20060816.104326 MW: Buffer (DEMOCORP4) greater than 90% full
20060816.104326 MW: Buffer (DEMOCORP4) greater than 100% full
20060816.104326 MW: Operation disabled for DSA 'DEMOCORP4'

This indicates that Democorp4 is now out of sync with the rest of the multi-write set and needs to be recovered manually.

Recovery Process

It is assumed that the failed peer (democorp4) is shutdown.

  1. Init the preferred master

    Issue dxserver init on Democorp to reset the queue status.
    dxserver init democorp
    Prior to init, the queues read
    DEMOCORP2(): OK, total 0, waiting remote 0, confirmed local 0
    DEMOCORP3(): OK, total 0, waiting remote 0, confirmed local 0
    DEMOCORP4(): **QUEUE-PURGED-OUT-OF-ORDER**, total 0, waiting remote 0, confirmed local 0
    Post init, the queues read
    DEMOCORP2(): OK, total 0, waiting remote 0, confirmed local 0
    DEMOCORP3(): OK, total 0, waiting remote 0, confirmed local 0
    DEMOCORP4(): RECOVERING, total 0, waiting remote 0, confirmed local 0
    Initializing the preferred master before shutting down the syncing DSA (Democorp3) ensures that all future updates chained by Democorp are captured for Democorp4 when it is bought online.

    Note that enough time between the init and the shutdown of the good peer must be left so that any outstanding updates prior to the init have been processed on the good peer.

  2. Shutdown, dump and restart a good peer

    1. Shutdown Democorp3.
      dxserver stop democorp3
      At this point, updates will be being queued for Democorp3 as well as Democorp4 as can be seen from Democorp DSA's console:
      dsa>get dsp;
      DEMOCORP2(): OK, total 0, waiting remote 0, confirmed local 0
      DEMOCORP3(): **MW-FAILED**, total 1, waiting remote 0, confirmed local 1
      DEMOCORP4(): **MW-FAILED**, total 1, waiting remote 0, confirmed local 1
    2. Dump Democorp3 data with operational attributes
      dxdumpdb -O democorp3 -f data.ldif
    3. Restart Democorp3
      dxserver start democorp3
      Democorp3 should quickly get back into synch as can be seen from Democorp DSA's console
      dsa>get dsp; 
      DEMOCORP2(): OK, total 0, waiting remote 0, confirmed local 0
      DEMOCORP3(): OK, total 0, waiting remote 0, confirmed local 0
      DEMOCORP4(): **MW-FAILED**, total 1, waiting remote 0, confirmed local 1
  3. Load and restart the failed peer

    1. Sort the data

      ldifsort data.ldif data-sorted.ldif
    2. Load the data
      dxloaddb -p <c "AU"><o "Democorp"> -a 15 -n 1277 data-sorted.ldif Democorp4
    3. Start Democorp4

      After the appropriate retry time, Democorp synchronizes the outstanding multi-write queue contents with Democorp4. Note that during the resynchronization process there may be a small number of errors reported due to replaying of operations that were already done on the good peer before it was shutdown and dumped. These can safely be ignored.
      dsa>get dsp;
      DEMOCORP2(): OK, total 0, waiting remote 0, confirmed local 0
      DEMOCORP3(): OK, total 0, waiting remote 0, confirmed local 0
      DEMOCORP4(): OK, total 0, waiting remote 0, confirmed local 0
  4. Check databases are in synch

    DB stats between the three databases should then be compared to ensure that they are in-sync. A high level comparison can be obtained using DXSTATDB. Example statistics are displayed below as a reference.

    dxstatdb democorp

    Statistics:

    Number of attributes types =      17
    Number of entries = 1293
    Number of node entries = 101
    Number of leaf entries = 1192
    Number of alias entries = 0
    Number of level 1 entries = 15
    Number of level 2 entries = 90
    Number of level 3 entries = 1188
    Number of level 4+ entries = 0
    Number of values = 12208
    Number of blob (>2K) values = 1

    dxstatdb democorp2

    Statistics:

    Number of attributes types =      17
    Number of entries = 1293
    Number of node entries = 101
    Number of leaf entries = 1192
    Number of alias entries = 0
    Number of level 1 entries = 15
    Number of level 2 entries = 90
    Number of level 3 entries = 1188
    Number of level 4+ entries = 0
    Number of values = 12208
    Number of blob (>2K) values = 1

    dxstatdb democorp3

    Statistics:

    Number of attributes types =      17
    Number of entries = 1293
    Number of node entries = 101
    Number of leaf entries = 1192
    Number of alias entries = 0
    Number of level 1 entries = 15
    Number of level 2 entries = 90
    Number of level 3 entries = 1188
    Number of level 4+ entries = 0
    Number of values = 12208
    Number of blob (>2K) values = 1

    dxstatdb democorp4

    Statistics:

    Number of attributes types =      17
    Number of entries = 1293
    Number of node entries = 101
    Number of leaf entries = 1192
    Number of alias entries = 0
    Number of level 1 entries = 15
    Number of level 2 entries = 90
    Number of level 3 entries = 1188
    Number of level 4+ entries = 0
    Number of values = 12208
    Number of blob (>2K) values = 1

Conclusions:

Following the above steps will ensure that:

  • Democorp4 is completely resynchronized with the other three DSA's

  • Democorp4 is back in the multi-write set and

  • That all updates are actively being chained by Democorp to all three multi-write DSA's.