Production become slow and node can even crash

Document ID : KB000086804
Last Modified Date : 14/04/2018
Show Technical Document Details
Issue:
Error Message :
non specific

Patch level detected:Dollar Universe 6.5.01
Product Version: Dollar.Universe 6.5.01

Description :Node crashes or becomes slow and the file reorganization fails due to File System full events.
Repeated crashes occur with a relatively small number of executions per day.
Environment:
OS: All
Cause:
Cause type:
Configuration
Root Cause: 1. In DUAS v6 a job in Event Wait is not purged. Only when a job has reached a final status (Completed, Aborted, Time Overrun etc.) the job will be purged according to retention criteria.When jobs in Event Wait are not treated by Operators they will eventually not be displayed in the list of execution in UVC but remain active and pile up. At this point, the u_fmhs60 data file becomes too big due to jobs in Event Wait not being purged
Resolution:
The instability, like regular crashes and freezes, has the following reason: A huge number of jobs in Event Wait is present on the node. These jobs have become defunct because the event that they are waiting for will not arrive anymore.
You can visualize the number of jobs in Event Wait with the following command :
$UNI_DIR_EXEC/uxlst ctl status=W | wc -l

If the number of jobs in Event Wait is significant, for instance as large or larger than the total number of executions in a week, this is an unsustainable situation that will eventually lead to crashes.
IN THE SHORT TERM we recommend to remove these job.
ATTENTION: the following command is EXTREMELY slow and takes long time to execute (up to several hours). You have to start it and be patient. Please let the command finish and don't worry if it takes long.

Load the environment (. ./unienv.ksh) then run :
$UNI_DIR_EXEC/uxpur ctl status=W mu=* upr=* before="(mm/dd/yyyy,0000)"
where mm/dd/yyyy is for instance the last day of last month, this removes all Event Waits before that date)

IN THE LONG TERM you will have to remove ALL jobs in Events Waits that are not treated by your Operators. Or instruct you Operators to conscientiously treat every Event Wait that is overdue and set it to Aborted. Please realize that Events Waits that are not in the header of a Session will never go to Time Overrun! Jobs that are Aborted will be purged.

Fix Status: No Fix

Additional Information:
Workaround :
N/A