How does CA Service Desk Manager (SDM) behave within a network.

Document ID : KB000019517
Last Modified Date : 14/02/2018
Show Technical Document Details

Description:

This document covers low level details on how the primary Service Desk Manager server is communicating with secondary Service Desk Manager server and how network disruption can impact their communication.

Solution:

The Service Desk Manager installation consists of primary server and x number of secondary servers. Each server contains multiple SDM daemon processes. The main SDM communication manager process is located on the primary server only called Slump (sslump_nxd).

As noted, the Service Desk Manager application is spread across multiple processes that can be distributed across multiple servers. Each process when it starts connects to a single known process which we call our "slump" process. Since all processes connect to slump, the slump process is instrumental in knowing how to route messages to other connected processes. By default, all communication between processes would be routed through the slump process. To provide improved scalability and performance, some processes have the ability to create a "fast-channel" between each other. Again since slump is connected to each process, it helps both processes setup the communication. Each connection either via slump or fast-channel will utilize a TCP port. A SDM process such as a webengine, will have multiple ports opened to other processes (domsrvr, bpvirtdb_srvr , etc) as well as one to slump.

All processes connect to slump initially and stay connected to slump. All processes that are connected to slump will have a tcp port opened always beyond 2100 value, but not necessarily close to 2100 value. By default, 2100 tcp port is slump's listener port. If your installation is configured to use slump communication only, each Service Desk process will use the existing slump connection to communicate with other processes. Slump will handle communication between processes via its already open tcp ports between each process. If NX_SLUMP_FIXED_SOCKETS variable is set (within NX.env file), this forces slump server to open ports as close as possible to 2100 value. This is required for firewall environments, as a firewalls need to open a range of ports it should keep open to not affect Service Desk usage in a negative way.

If Service Desk is configured to use fast-channel connections only, a Service Desk process will request slump to open a fast-channel connection to the other process. Hence, what it does is, process A requests slump for a fast-channel connection to process B, slump will notify process B to open a port to process A, process B will notify slump of the new port information, slump will pass it to process A. Now, process A and B communicate directly with each other without slump server involvement. Ports created this way are random.

By default, Service Desk is configured to use fast-channel connections. It is controlled by the NX_NOFASTCHAN variable. Changing Service Desk to use slump only communication may cause a performance hit on the primary server where slump server is running. Even if fast-channel is enabled not all SDM processes will utilized it and some still rely on slump to send messages back and forth between processes.

CA support have recommended customers to use tcpviewer (freeware program) to monitor Service Desk processes and what ports they have opened to what processes if that information is needed. There is currently no verbose logging within the product to show what port numbers are being used. Hence, the recommendation to use the freeware tcpview tool.

The tcp ports opened are also random but set to be as close to 2100 and above as possible when fix socket variable is set.

SDM Reconnection Attempt Logic:

The Service Desk Manager processes when communicating to each other via a fast channel connection without slump involvement, any time there is a disconnection between these two processes, the process first reporting the event will NOT shutdown, but attempt to reconnect. Reconnection is done quickly and within a second or so. There is no logic in place for the process to wait for reconnection attempt. Reconnection attempts are indefinite.

The Service Desk Manager processes when needing to communicate to the Slump process for message routing, any time there is a disconnection, the process first reporting the event WILL shutdown. In this case, there is a perception that Slump on the primary is unreachable. This is considered a severe event and as such we request our processes to shutdown. Our Slump process is our communication manager and any perception it is unavailable warrants shutting down of processes and hibernation state. This allows the IT team to investigate the status of the primary if automatic recovery does not take place and take manual steps. In this case, multiple processes on a secondary server will eventually report the same event, and all main SDM daemons are told to shutdown. The SDM proctor process on the secondary in most cases will also have the problem and restart and then hibernate waiting for the connection with slump to be reestablished. Proctor process is the only process that hibernates on a secondary server. It will retry indefinitely to reach out to slump for connection until successful. Once proctor is communicating with slump, the daemon manager on the primary will notice processes not running on secondary and request a restart. Those processes will then connect to slump. If for some reason startup fails, the process is restarted again up to ten times before giving up. The pdm_d_refresh command may be used to reset the max restart limit and allow more restart attempts to proceed.

How is a disconnection to slump captured by SDM:

The processes determine that the connection to the slump server is not available when they attempt to read or write data to the socket. A process usually registers a call back function with the slump layer to notify if the connection is broken. This function is usually called when the disconnection happens for the process to take actions. For example, the webengine terminates as a result of the disconnection once the call back function is invoked.

The socket layer (OS network layer) usually returns a socket error code (for example, WSAECONNRESET (10054) ) to the process read\write calls when it attempts to read or write data on an already established socket connection. Service Desk Manager process logs this error in the standard log file:

09/22 14:28:26.47 SDCBCBSVMSS02 domsrvr:21 3868 INFORMATION socket_port.c 1582 Error: WSAECONNRESET (10054) reading from SOCKET_PORT(0x024F9958) description = TCP/IP port_name = Slump Port status = 0 ip address = 156.79.203.11 compression = 1 extra_flags = 0 file descriptor = 288 write pending = 0 handler type = DATA read count = 2434167 write count = 1683967 socket = 0

Here is the description of the error code:

WSAECONNRESET 10054 - Connection reset by peer. An existing connection was forcibly closed by the remote host. This normally results if the peer application on the remote hostis suddenly stopped, the host isrebooted, the host or remote network interface is disabled, or the remote host uses a hard close (see setsockopt for more information on the SO_LINGER option on the remote socket). This error may also result if a connection was broken due to keep-alive activity detecting a failure while one or more operations are in progress. Operations that were in progress fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET.

Disconnection Examples Scenarios and their Outcomes:

Scenario 1: Webengine on secondary losses connection to Slump process on primary. As such, a slump connection error message is reported on the secondary.

Message: "EXIT webengine.c 1152 Slump died detected"

What happens:

  • The webengine process on the secondary will shutdown.
  • The secondary proctor process will wait for a request from the primary's pdm_d_mgr (daemon manager) process to restart the secondary's shutdown daemon processes. In this case, the webengine process.
  • The primary pdm_d_mgr process will notice that some daemons on the secondary are not running and will ping secondary proctor process to start them. In this case the webengine process.
  • The secondary webengine process is started and its initialization code is executed, eventually during the startup the process will open a new connection to Slump on the primary.
  • The new connection is done from webengine as one of the first steps of its startup and as such a new connection is established fairly quickly within seconds (depends speed of machine).
  • If for some reason webengine cannot establish connection, it will restart itself up to 10 times before it gives up. Ten is our max restart limit per process.
  • If max restart limit is reached, the only way to get the process back running is to restart entire services or by executing the following command, pdm_d_refresh.

Scenario 2: Domsrvr on secondary losses connection to Slump process on primary. As such, a slump connection error message is reported on the secondary.

Message: "EXIT api.c 1643 Exiting because slump server terminated!"

What happens:

  • The domsrvr process on the secondary will shutdown, pdm_d_mgr on the primary server will terminate the depending process, pdm_rpc, as well, webengine process will not restart, only reinitialized.
  • The secondary proctor process will wait for a request from the primary's pdm_d_mgr (daemon manager) process to restart the secondary's shutdown daemon processes. In this case, the domsrvr and pdm_rpc processes.
  • The primary pdm_d_mgr process will notice that some daemons on the secondary are not running and will ping secondary proctor process to start them. In this case the domsrvr and pdm_rpc processes.
  • The secondary domsrvr process is started and its initialization code is executed, eventually during the startup the process will open a new connection to Slump on the primary.
  • Once domsrvr process is started and logged in to slump, the primary pdm_d_mgr will request pdm_rpc to be started, which will initialized and eventually during startup the process will open a new connection to slump on the primary server.
  • The new connection is done from domsrvr as one of the first steps of its startup and as such a new connection is established fairly quickly within seconds (depends speed of machine).
  • If for some reason domsrvr cannot establish connection, it will restart itself up to 10 times before it gives up. Ten is our max restart limit per process.
  • If max restart limit is reached, the only way to get the process back running is to restart entire services or by executing the following command, pdm_d_refresh.

Scenario 3: Proctor on secondary losses connection to Slump process on primary. As such, a slump connection error message is reported on the secondary.

Message: "SIGNIFICANT api.c 1631 Slump server terminated!"
Message: "SIGNIFICANT agent_os_if.c 824 Proctor hibernating"

What happens:

  • The proctor process on the secondary will lose connection to slump server on the primary which leads to stopping all the started processes on that server.
  • Once the proctor terminates all the dependent processes it does into "Hibernate State" and will attempt indefinitely to connect to slump, an attempt will be made every 10 seconds, I'm not able to determine where this is waiting but based on the logs I can see an attempt to reconnect to slump every 10 seconds.
  • Once connected to slump the secondary proctor process will wait for a request from the primary's pdm_d_mgr (daemon manager) process to restart the secondary's shutdown daemon processes. In this case, all the processes on that server.
  • The primary pdm_d_mgr process will notice that some daemons on the secondary are not running and will ping secondary proctor process to start them. In this case all the processes on that server.
  • Secondary proctor will start all the processes and execute the full initialization procedure, similar as when it started the first time.