Root cause investigation for Automation Engine outage / freeze / unavailabilityThis article should help to determine the root cause for Automation Engine outages, seen as a kind of freezes or unavailability of the Engine. At least it should help you to narrow down the possible root causes. Furthermore it should help to prepare the required information, in case a ticket with product support is opened.
Symptoms and characteristicsThe following symptoms or characteristics are mandatory for Automation Engine outages covered in this article:
- Server processes (CP, WP, JWP) are up and running on operating system level.
That means no process crashed or terminated unwanted on the application server(s).
- Some or all server processes doesn’t issue messages into their log file any longer. Or they print the same messages, or block of messages, over and over again.
- Processing of tasks stopped completely or almost completely.
That means no more jobs, workflows, etc. are activated, generated or continued.
The following symptoms or characteristics may occur as well, however they are not mandatory for situations covered by this article:
- Users are unable to logon to the system.
They get an error message, timeout or even no response from the Engine.
- Agents are disconnecting and may reconnect with or without success.
- High CPU usage of one (or more) server processes (CP, WP, JWP).
- High memory consumption of one (or more) server processes (CP, WP, JWP).
Evaluate if the Server processes (CP, WP, JWP) are still up and running on operating system level.This is according to symptom a. described above.
Note: In case of a multi-node setup (Server processes running on two or more application server), the above needs to be done on every node, of course!
- In case the Service Manager is used, it’s very easy using the Service Manager Dialog to check that quickly. Its log file might be also a good place to check on what happened to the processes in the past.
(!) The Service Manager log files are required for a support ticket.
- Otherwise the in-house mechanism can be hopefully used therefore. Of course the operating system commands or tools can be used as well.
For instance the “Task Manager” and “Event Viewer” on Windows platforms or the “ps” command on UNIX/Linux, etc.
(!) For a support ticket, provide how this was carried out and provide results.
Check if the Server processes (CP, WP, JWP) are still issuing messages into their log files.This is according to symptom b. described above.
Check if some or all server processes have stopped logging at a specific timestamp. Or if the same messages or block of messages is logged over and over again.
This can be done easily with any file viewer or editor. Within Automic the preferred tool therefore is the so called “RSView” tool. It can be found at the Automation Engine image delivery in the folder “Tools\no_supp” as well.
(!) The log files of all Server processes (CP, WP, JWP) from all nodes are required for a support ticket.
Note: In case Server processes are distributed on two or more application server this has to be done on each of them!
Verify if the processing of tasks has stopped completely or almost completelyThis is according to symptom c. described above.
In case logging on to the system via the User Interface is still possible, observe in the Process Monitoring perspective (Activity Window) if processing of tasks has stopped completely or almost completely.
In case of symptom b. was already detected it’s most likely, that the processing is malfunctioning.
Determine if there are any locks on the databaseIn case at least the three “mandatory” symptoms could be located, it’s often a lock in the database causing the outage. Therefore it’s very important to find out, which kind of lock exists and which database session is the top locker / holds the lock.
At this stage of investigation the database administrator should be contacted! The DBA knows how to determine that quickly and properly.
However here an example how this can be done on Oracle databases.
The example is based on reference: https://www.oraclerecipes.com/monitoring/find-blocking-sessions.
Basically all 3 recipes brings the same result in different formats:
-- Recipie #1 - find blocking sessions with v$session
s.blocking_session, s.sid, s.serial#, s.seconds_in_wait
blocking_session IS NOT NULL;
-- Recipie #2 - find blocking sessions using v$lock
l1.sid || ' is blocking ' || l2.sid blocking_sessions
v$lock l1, v$lock l2
l1.block = 1 AND
l2.request > 0 AND
l1.id1 = l2.id1 AND
l1.id2 = l2.id2;
-- Recipie #3 - blocking sessions with all available information
s1.username || '@' || s1.machine
|| ' ( SID=' || s1.sid || ' ) is blocking '
|| s2.username || '@' || s2.machine || ' ( SID=' || s2.sid || ' ) ' AS blocking_status
FROM v$lock l1, v$session s1, v$lock l2, v$session s2
WHERE s1.sid=l1.sid AND s2.sid=l2.sid
AND l1.BLOCK=1 AND l2.request > 0
AND l1.id1 = l2.id1
AND l1.id2 = l2.id2;
This one can be used to map the sid to the process id:
SELECT process, machine, s.osuser, s.program, sid
v$process p, v$session s
p.addr = s.paddr
On Microsoft SQL databases there is the so called “Activity – All Blocking Transactions” report.
This is based on reference: https://support.microsoft.com/en-au/help/224453/inf-understanding-and-resolving-sql-server-blocking-problems
Note: This investigation might be only possible during the outage situation / DB lock persists.
Check the process which is related to the blocking sessionOnce the process, blocking the system (=holding the top blocking / locking session), was identified, it’s useful to find out what’s going on with that process, before killing the session or the process!
It’s most likely a server process (CP, WP, JWP), however it can be another application too.
At this stage of investigation the server administrator, responsible for the server were the process runs, should be contacted! The admin knows how to determine that quickly and properly.
However here some examples how this action can maybe performed:
(!) Everything determined in this investigation stage is a very helpful information when a support ticket is opened. It can help to find the real root cause and hopefully also a permanent solution.
- On Windows server tools like “Task Manager”, “Resource Manager” or “Event Viewer” can be used to look up CPU usage, Memory consumption, File I/O, Network traffic or any system error messages of the process. Furthermore it’s also possible to “Debug” the process.
- On UNIX / Linux server there mid be similar tools available, at least commands like “top” or “truss” can be used.
Resolution of the incidentOnce all data for a detailed root cause investigation are backed, the first attempts to restore the service can be done. Which action this could be, depends on the result of the investigation steps above. Possible ones are: stopping the process which locks; killing the blocking session; restarting the database; restarting the application; restarting the application server; etc.
Further debugging in case server processes caused the lockIn case an Automation Engine Server process (CP, WP, JWP) caused the lock, it might be necessary to create a trace of the Server process, this trace needs to be started just before the situation occurs.
That means it’s most likely necessary to be able to reproduce the issue on purpose, the issue occurs in a known frequency or the issue persists.
In case of random occurrences of the issue it will be hard to generate the required traces due to the unpredictability. However, without these trace files a root-cause analysis is not feasible.
The initial trace level should be TCP/IP=2 and database=4.
(!) The Automation Engine log and trace files are required to open a support ticket.