Autoping Failures During Peak Schedule Times

Document ID : KB000100377
Last Modified Date : 06/06/2018
Show Technical Document Details
Issue:
During the peak schedule times within an instance, it is noticed that autopings will fail intermittently but work when retried. During slower periods, these failures are not observed. The autoping failure messages look this the following...

CAUAJM_I_50023 AutoPinging Machine [Machine1]
CAUAJM_W_10496 Agent on [Machine1] has not responded in a timely fashion. Try again later. [CA WAAE Autoping]
CAUAJM_E_50281 AutoPing from the Scheduler WAS NOT SUCCESSFUL.

CAUAJM_I_50282 AutoPing from the Application Server WAS SUCCESSFUL.

CAUAJM_E_50026 ERROR: AutoPing WAS NOT SUCCESSFUL.
Environment:
WAAE 11.3.6 Base, SP1, and SP2
O/S: Any
Database: Any
Cause:
The autoping error, along with the other factors in this scenario provide a clue for the root cause. Notice that there are two results posted. The first is the autoping result from the Scheduler, and the second is the result from the Application Server. In the error above, you can see that only the autoping from the Scheduler failed, while the autoping from the Application Server was successful...

CAUAJM_E_50281 AutoPing from the Scheduler WAS NOT SUCCESSFUL.

CAUAJM_I_50282 AutoPing from the Application Server WAS SUCCESSFUL.

Looking at all the factors for this particular scenario...

1. Problem is intermittent and only occurs during peak scheduling times
2. The failure is only for the Scheduler
3. The product version is pre-11.3.6 SP3

...the root cause is likely due to the communication architecture between the Scheduler and the Agents in these earlier releases. Prior to SP3, the Scheduler had a single thread that handled all communication between it and the agents. Therefore, during peak scheduling times when the Scheduler is making a lot of agent connections to handle job submissions, this single threaded architecture would sometimes have trouble keeping up with the traffic. As a result, you would see intermittent timeouts for autopings from the Scheduler during these peak times.
 
Resolution:
In 11.3.6 SP3, this architecture was enhanced so that there are now multiple threads that can handle the Scheduler/Agent communication. This has allowed the Scheduler to handle more concurrent agent communication requests that minimizes the possibility of timeouts. To resolve this type of issue, upgrade the instance to a later service pack that has this improved architecture.