Sizing Multi-write Queues
How big should the multi-write queue be to handle a day's outage?
If a multi-write DSA receives 50,000 updates per day then a formula to determine the queue size for a 24 hour period would be:
50,000 x 2
(safety factor) = 100,000
The command would be:
set multi-write-queue = 100000;
- 50,000 updates per day is roughly 1 per sec over 12 hours and quite a bit less over the other 12 hours.
- 50,000 updates is quite a lot for a single namespace. In a distributed environment of, say, 10 namespaces, the busiest namespace may only receive, say, 10,000 updates per day.
Is this a total queue size?
No. Each DSA has a multi-write queue and the queue size parameter applies to each.
If the limit is hit for one queue, this DSA (currently marked MW-FAILED) will have all of its operations discarded (and it will be marked QUEUE-PURGED-OUT-OF-ORDER).
In the case of a multi-server failure that involves the preferred master, recovery procedures should be invoked.
How much memory does a multi-write queue require?
Each modify takes a minimum of 4Kb to store on a MW queue, so a queue size of 100,000 would require at least 400MB of memory. While large modify requests may take more than 4Kb, most typical requests will fit within this size.
How fast is recovery?
Recovery is as fast as the I/O allows on the peer.
However, there is a point when recovery procedures would be much quicker than trying to catch up with a large amount of updates. As a rule of thumb, if the number of updates (e.g. 200,000) is a fair proportion of the number of entries (e.g. 1,000,000) then it would be better to invoke recovery procedures.
Monitoring Multi-write Queues
DXmanager 8.1 is the best way to visualize the queues. This is a very simple way of showing if the maximum queue size is insufficient for the outages being incurred. Note, DXManager is available as part of the Enterprize version of the product and may not necessarily be available if you are using a product that embeds eTrust Directory - any queries, please contact your CA Account representative. With respect to queue monitoring (all of the following are in the eTrust Directory Administrator Guide):
- Stats trace - prints queue information once per minute
- Console "get dsp;" - prints queue information at that moment
- Alarms - are raised at 60%, 70%, 80%, 90% and 100% of queue size
- Shutdown - "set wait-for-multiwrite = true;" will stop a DSA shutting down while there are MW queues
- SNMP - can retrieve anytime an overall stats queue (dxStatsQueue) as well as per server counters (dxMWQueueLength etc)
Multi-write and Slow Links
When should multi-write-groups be used?
For any situation where there are slow links. Typically multi-write groups would be organized into regions.
- Within a region it is assumed that the links are good and so usual "write-through" replication can occur.
- Between regions, it is assumed that links are poor and so multi-write occurs in a "write-behind" mode.
Are multi-write groups a type of cascaded replication?
Yes, multi-write groups introduces an extra step in the replication.
- Within a group, (as with normal multi-write) servers are meshed and so a write to one server will result in a write to all peers
- Between multi-write groups, there is only a single write to the peer group from which the peer group DSA distributes that change.
This is best explained by way of example. Assume that there were two groups of three DSAs
- group-A: A1, A2, A3
- group-B: B1, B2, B3
A write to A1 would result in three types of replication:
- "write-through" replication A1 -> A2 and A1 ->A3
- "write-behind" replication A1 -> B1.
- When B1 receives the write, B1 will then (write-through) replicate to B2 and B3.
How does load-sharing work with multi-write groups?
Load-sharing should only be configured to occur within groups, as groups are guaranteed to be in sync. Load-sharing across groups doesn't make sense because the cross group links are slow.
How does fail-over work with multi-write groups?
When forwarding queries between groups, the first available DSA in the group is used. In the above example, if B1 is not available, then B2 will be forwarded the query, etc.
In what situations should I configure 'multi-write-async'?
None. This flag has been deprecated in favor of multi-write groups. If there is only a single peer at the other end of a latent link, then simply define that peer in its own multi-write-group.
Multi-write and Security
I have set "min-auth" to none, and multi-write replication is now broken, what's wrong?
A client scenario told of an issue where they had configured the client side authentication setting "min-auth" to none and instantly replication broke.
In the warn log, the following message existed.
remoteGetNewAssoc: No compatible link type
Investigations found that while the clients were now able to connect to the directory using anonymous connections, the DSA's were configured to allow only "clear-password" binds, therefore the anonymous MW traffic was not being chained.
The resolution is to add 'anonymous' as an auth-level in each of the DSA's knowledge files. This will align both the DSA and client side authentication levels, and allow multi-write traffic.