The issue documented above has been resolved in the following release(s):
eHealth 6.2.2 Milestone 1
eHealth Service Packs are available for download at: http://support.ca.com. Please review the README file for each Service Pack prior to installing. CA recommends that users always keep eHealth current by installing the latest Service Pack available.
Trap server log warnings when queue is full
Problem Ticket: PRD00046584
"Warning nhiTrapServerCmu Pgm nhiTrapServerCmu: The Trap Server has dropped 4679 incoming traps since Thursday, 02/18/2010 04:40:06 PM. The Live Exceptions server may be experiencing performance issues or eHealth may be receiving an unusually high number of traps."
"Warning nhiTrapServerCmu Pgm nhiTrapServerCmu: The Trap Server output queue directory is full. The Live Exceptions server may be experiencing performance issues or eHealth may be receiving an unusually high number of traps."
These trap server warning messages are new in eHealth as of eHealth 6.2.2. The trap server maintains a queue of files that should be processed by the liveEx engine. If liveEx falls behind for some reason, this queue can fill up and the trap server will start dropping incoming traps. This behavior is documented but in the past the trap server never logged any warnings when it started dropping traps. These new warning messages are intended to highlight the fact that the trap server's outbound file queue has filled up and so incoming traps are being dropped. They will be written to the system.log and so will be visible in OCE.
The first message above will always be of the form "Trap Server has dropped <nnnn> traps since <timestamp>".
In order to avoid flooding the system.log with messages, trap server throttles these new warnings such that no more than two warnings will be logged in any 10 minute period.
Typically, this situation - in which liveEx is unable to keep up with the trap server - is dues to one of two issues:
1) eHealth is being hit by an unusual trap storm and traps are arriving at a higher rate than we can accomodate
2) some kind of systemic performance issue (disk and/or database IO) is causing liveEx to take a long time to process input and update alarms in the database.
Systemic performance problems are what we have typically seen in the field. Often liveEx will require unusually long amounts of time to process poller input as well, so these sorts of systemic IO problems usually result in a total slowdown of all the liveEx server instances..
(Legacy KB ID CNC TS34670 )