All Agents including the Master's local Agent will go to Service Down status daily or weekly around the same time of the day. Only the Automation Engine remains in a Running status.
Further investigation show that the issue occurs during the running of SYSTEM Process Flow, and simply restarting the RMI process or Windows Service will allow all Agents to go back into a Running status.
This can occur if Applications Manager database tables that are managed by SYSTEM Process Flow, contain an unusually large amount or records.
The SO_PRINT_LOG and SO_JOB_HISTORY tables are generally more susceptible to have extra data if improperly maintained.
SYSTEM scripts query a number or tables using one of 7 available available Master Socket Manager (MSM) threads within the RMI Java process.
The MSM threads are responsible for processing requests made from the local Agent and remote Agents such as Subvar resolution, Condition evaluation, etc.
Due to the large amount of data that is queried, the MSM threads are unable to process any Agent request while it is waiting for the a query to return, resulting in Agents going to a Service Down status due to timeouts.
To temporarily resolve the issue, restart the RMI process or Windows Service.
To permanently resolve the issue:
Refer to Database Administrator to review the SO_PRINT_LOG and SO_JOB_HISTORY table.
For the SO_PRINT_LOG table, delete all records older then 7 days which is the default retention day value for Jobs.
For the SO_JOB_HISTORY table, delete all records older then 60 days or the value set for the Job HISTORY_PURGE.