Production down recovery: UC4 is very slow after starting our backup server

Document ID : KB000089612
Last Modified Date : 14/04/2018
Show Technical Document Details
Issue:
Production down recovery: UC4 is very slow after starting our backup server
Resolution:

Symptoms

We need to start Automic on our backup server since we have maintenance today on our primary server.

When we try to start the WPs they run into deadlock situations, we are experiancing the following problems:

  • Login into Automic is very slow 
  • Jobs are stuck and not processed

Investigation

Data needed for analysis:

  • Automation Engine logfiles from the time the error occurred. (Default location in \AutomationEngine\Temp)
  • A "DB = 4" trace activated on the Working Process logfiles running for a couple of min. while the error occurs

You can find the following problem in the trace files of the Working Process:

The following information can be found in the working process log and tracefiles:

Primary Work Process:

U0003524 ===> Time critical DB call!       OPC: 'INSR' time: '3:680.511.742'
U0003525 ===> 'INSERT INTO MQWP (MQWP_System, MQWP_CAddr, MQWP_CSRName, MQWP_CAcv, MQWP_BAcv, MQWP_FAddr, MQWP_BTable, MQWP_SchedTime, MQWP_Status, MQWP_Priority, MQWP_Len, MQWP_Msg) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'
U0003434 Server routine  'UCMAIN_R/C_CALE' required '0' minutes and '3,735' seconds for processing.

Work Process 4:

          U0003524 ===> Time critical DB call!       OPC: 'SLCT' time: '7:985.660.752'
          U0003525 ===> 'SELECT * FROM OH, OKB WHERE OH_DeleteFlag  = 0 AND OKB_ErrMsgNr = 0 AND OH_Idnr = OKB_OH_Idnr AND OH_Client = ? AND OKB_CTYPE IN ('G','R') AND OKB_ValidTo < ?'
          U0003434 Server routine  'UCMAIN_R/CALECHK' required '0' minutes and '8,000' seconds for processing.
          SQL_Error    Database-Handles DB-HENV: 3361978 DB-HDBC: 3361a20
          U0003591 DB error info: Opc: 'SQLExecDirect' Return code: 'ERROR'
          U0003592 Status: '40001' Native error: '1205' Msg: 'Transaction (Process ID 59) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.'

The WP1 is trying to insert information concerning a schedule into the system tables, this insert runs into a time critical database call which means that the rows that the wp1 is trying to access are locked for the time that the insert takes. The timecritical DB call lasts for '3,735' seconds, at the same time the WP4 is trying to access the same rows for the same schedule and runs into a deadlock because the rows are still locked from the insert that the WP1 is stuck on.

Cause

The cause of the deadlock is the insert statement from the PWP into the MQ Tables (System Tables) this insert is taking '3,735' seconds to respond back from the Database to Automic.

The System Tables have grown to a very large size with a large unused space, this usually happens when the system gets stuck , alot of tasks remain in the System Tables and can not be processes.

Now when the system gets back to processing the table entries get worked off but the Database keeps a large unused space.


Resolution

  • Stop all server processes 
  • Perform a cold start of the system
  • The cold start cleans up the System Tables and brings them back down to size
  • After the cold start the system is back up an running