DUAS: Jobs abort with no Job Log "fork failed errno (12)"

Document ID : KB000117072
Last Modified Date : 08/10/2018
Show Technical Document Details
Issue:
In case of important batch load during a reduced period of time some Jobs ( either submitted by Dollar Universe or via uxsubjob) go to status "Aborted" without any Job Log.
At the same time, the following error appears in universe.log of the node where the Physical queue is defined:
|ERROR|X|DQM|pid=p.t| u_dqm_lanc_batch | fork failed errno (12) Cannot allocate memory

In case the job is submitted from a remote Logical Queue, the following kind of messages may appear in universe.log of the Logical Batch Queue node:
|ERROR|X|DQM|pid=p.t| u_dqm_end_job_dist | dqm_lecture_directe_job returns 3 [XXX]
|ERROR|X|DQM|pid=p.t| owls_dqm_job_end   | u_dqm_end_job_dist returns 3


Additionally, if the "ulimit -s" is superior to the total amount of RAM, Dollar Universe server processes may not start with the following errors in universe.log:
|ERROR|X|IO |pid=p.t| o_spawn_thread | Thread error: pthread_create returns 1 (errno 11: Resource temporarily unavailable)
|FATAL|X|IO |pid=p.t| u_io_srv_main      | Thread error: o_spawn_thread_no_attr returns 1

 
Environment:
Unix / Linux
Cause:
Excessive default parameter stack size ( ulimit -s) defined in /etc/security/limits.conf that is applied to every process forked in the system leading to out of addressing space in case of many fork creation during a small period of time.

Example:
* soft stack 4194304

By default, this parameter is set to 10240KB or 8192KB in Linux whereas it had been increased to a much bigger value ( 4194304 KB = 4294967296 Bytes = 4 GB).
This can be observed in universe.log during startup:
|INFO |X|DQM|pid=p.t| display_rlimit | RLIMIT_STACK: current=4294967296 max=4294967296
 
Resolution:
CA Automic recommends leaving the default OS values for the Stack Size limit (ulimit -s) for the users launching Dollar Universe processes or submitting jobs with Dollar Universe.
These values are on Linux:
For Redhat 5 and 6: 10240
For RedHat 7 and Suse Linux: 8192

In order to avoid encountering this kind of system issue, apply solution a or the solution b.

A) In case there is no particular reason ( another prerequisite from a different application)  to have a huge soft limit for the stack size, comment / modify / remove the impacted line from the /etc/security/limits.conf to a lower value ( around 10240).

B) In case you don't know why this limit has been set in /etc/security/limits.conf or you prefer not to modify this value, add a line in the $HOME/.profile of every user launching jobs with Dollar Universe or starting Dollar Universe processes. Note that this soft limit can be lightly increased if required by another Application launched by the same system user.
ulimit -s 10240

After applying one of the two proposed modifications: stop Dollar Universe, log out and log in again in the system, verify the new "ulimit -s" in the shell, and start Dollar Universe again.