One thing special about this issue is there is no harvest duration/GC spikes or any other obvious performance issues on MOM/Collector. The Cluster just starts to go wrong suddenly, and once it occurs a restart for the MOM might help immediately but the issue could come back very quickly. Seems that most of time it is a setup of ETC with multiple MOMs. One or more of the MOMs could have this issue.
Taking thread dumps is the only way to tell if this is the case.