Customer thought all processes were getting stuck. I had him run a process that had a start and finish step and nothing else while the problem was occurring. That process completed successfully. So we were able to determine that only gel script processes were getting stuck.
He also told me that he had to restart the bg 2 or 3 times before processes came all of the way up.
Any PPM environment that uses processes with gel scripts
The customer was using the util:sleep command in his gel scripts.
You should avoid using util:sleep in GEL scripts. This tag puts the thread executing the GEL script to sleep. There are 15 GEL threads allocated per Process Engine instance. If all 15 are sleeping, other GEL steps will be unable to execute and processes will appear to hang. Having most or even some of your gel threads sleeping can cause slow performance on your process engines.
Long Term Solution: Rewrite process gel scripts so that they do not use sleep:util. Short Term Solution: Restart the bg services. Usually one restart suffices. But sometimes multiple restarts will be required.
The problem with restarting the bg process one or more times to resolve this problem is that the problem is likely to reoccur. The more gel script processes you run using this tab, the more likely this problem is to reoccur.
Instead of using util:sleep a better way is to move the check to a post condition. Assuming your GEL script is trying to monitor the step completion in some other process instance you could have that process set a flag/value on an object that can trigger an event that the post condition in your monitoring process can detect before it moves on.
That may mean breaking your existing process and GEL script into multiple pieces rather than one bigger process.
That may also mean your process logic needs to be rethought to use a custom object or custom attribute on some other object that can used in your process in a post condition.
Because the post condition pipeline can handle and iterate through many post conditions without clogging up its bandwidth, fixing your processes so that they use post conditions instead of util:sleep will allow your to sure that this problem won't occur again..
You could have 300+ process instances waiting on a post condition without affecting process engine performance, whereas even a few GEL scripts stuck sleeping or polling can have an adverse affect on the system throughput and behavior.
STEPS TO TROUBLESHOOT:
1. Run a process with a Start and Finish step and nothing else. There should be no actions or gel scripts in these steps. If the process does not complete, this is not the problem.
2. Find out if restarting the bg (even if it takes more than one try) fixes the problem for a while. If so, this is an indicator that we might be on the right track.
3. Take a Java thread dump of the bg service while the problem is occurring. Make sure you choose the actual bg service and not the wrapper when selecting the java process. See the attached document for instructions on how to create a thread dump. Save the output to a text file.
4. Open the text file and search for the word "custom". You will see 15 custom threads listed. If you see the word sleeping under all or most of the custom threads, the process engine lockup is caused by use of the util:sleep tag.