This document covers the reconnection logic used by CA Service Desk Manager between primary and secondary servers as well as some common problem scenarios.
This document does not cover the "Advanced Availability" features of 12.9.
How does CA Service Desk Manager (SDM) behave within a network:
The Service Desk installation consists of a Primary server and any number of Secondary servers. Each server contains multiple SDM daemon processes. The
main SDM communication manager process is located on the primary server only called Slump (sslump_nxd).
As noted, the Service Desk application is spread across multiple processes that can be distributed across multiple servers. Each process (when it starts)
connects to a single known process which we call our "slump" process. Since all processes connect to slump, the slump process is instrumental in knowing
how to route message to other connected processes. By default, all communication between processes would be routed through the slump process. To provide
improved scalability and performance, some processes have the ability to create a "fast-channel" between each other. Again, since slump is connected to
each process, it helps both processes setup the communication. Each connection either via slump or fast-channel will utilize a TCP port. A process such as
a webengine, will have multiple ports opened to other processes (domsrvr, bpvirtdb_srvr , etc) as well as one to slump.
All processes connect to slump initially and stay connected to slump. All processes that are connected to slump will have a tcp port opened always beyond
2100 value, but not necessarily close to 2100 value. By default, 2100 tcp port is slump's listener port. If your installation is configured to use slump
communication only, each Service Desk process will use the existing slump connection to communicate with other processes. Slump will handle communication
between processes via its already open tcp ports between each process. If NX_SLUMP_FIXED_SOCKETS variable is set, this forces slump server to open ports as
close as possible to 2100 value. This is required for firewall environments, as a firewalls need to open a range of ports it should keep open to not affect
Service Desk usage in a negative way.
If Service Desk is configured to use fast-channel connections only, a Service Desk process will request slump to open a fast-channel connection to the
other process. Hence, what it does is, process A requests slump for a fast-channel connection to process B, slump will notify process B to open a port to
process A, process B will notify slump of the new port information, slump will pass it to process A. Now, process A and B communicate directly with each
other without slump server involvement. Ports created this way are random.
By default, Service Desk is configured to use fast-channel connections. It is controlled by the NX_NOFASTCHAN variable. Changing Service Desk to use slump
only communication would cause a performance hit on the primary server where slump server is running. Even if fast-channel is enabled not all SDM processes
will utilized it and still rely on slump to send messages back and forth between processes.
In support, whenever there are problems with connectivity, we have recommended customers to use Tcpview to monitor Service Desk processes and what ports they have opened to
There is currently no verbose logging within the product to show what port numbers are being used. Hence, the recommendation to use the Tcpview tool.
The tcp ports opened are random and are set to be as close to 2100 and above as possible when fix socket variable is set.
It is ideal to have Fixed Sockets and Fast Channel enabled.
SDM Reconnection Attempt Logic:
The Service Desk processes when communicating to each other via a fast channel connection without slump involvement, any time there is a disconnection
between these two processes, the process first reporting the event will NOT shutdown, but attempt to reconnect. Reconnection is done quickly and within a
few seconds. There is no code to tell the process to wait. Reconnection attempts are indefinite.
The Service Desk processes when needing to communicate to the Slump process for message routing, any time there is a disconnection, the process first
reporting the event WILL shutdown. In this case, there is a perception that Slump on the primary is unreachable. This is considered a severe event and as
such we request our processes to shutdown. Our Slump process is our communication manager and any perception it is unavailable warrants shutting down of
processes and hibernation state. This allows the IT team to investigate the status of the primary if automatic recovery does not take place and take manual
steps. In this case, multiple processes on a secondary server will eventually report the same event, and all main daemons are told to shutdown. The Proctor
on the secondary in most cases will also have the problem and restart and then hibernate waiting connection with slump to be re-establish. The Proctor
process is the only process that hibernates on a secondary server. It will retry indefinitely to reach out to slump for connection until successful. Once
proctor is communicating to slump, the daemon manager on the primary will notice processes not running on secondary and request a restart. Those processes
will then connect to slump. If for some reason startup fails, the process is restarted again up to ten times before giving up.
How a disconnection to slump captured by SDM:
The processes determine that the connection to the slump server is not available when they attempt to read or write data to the socket. A process usually
registers a call back function with the slump layer to notify if the connection is broken. This function is usually called when the disconnections happen
for the process to take actions. For example, the webengine terminates as a result of the disconnection once the call back function is invoked.
The socket layer (OS network layer) usually returns a socket error code (for example, WSAECONNRESET (10054) ) to the process read\write calls when it
attempts to read or write data on an already established socket connection. Service Desk process logs this error in the standard logs:
09/22 14:28:26.47 SERVER domsrvr:21 3868 INFORMATION socket_port.c 1582 Error: WSAECONNRESET (10054) reading from
SOCKET_PORT(0x024F9958) description = TCP/IP port_name = Slump Port status = 0 ip address = 18.104.22.168 compression = 1 extra_flags = 0 file
descriptor = 288 write pending = 0 handler type = DATA read count = 2434167 write count = 1683967 socket = 0
Here is the description of the error code:
Connection reset by peer.|
An existing connection was forcibly closed by the remote host. This normally results if the peer application on the remote host is suddenly
stopped, the host is rebooted, the host or remote network interface is disabled, or the remote host uses a hard close (see setsockopt for more information on the
SO_LINGER option on the remote socket). This error may also result if a connection was broken due to keep-alive activity detecting a
failure while one or more operations are in progress. Operations that were in progress fail with WSAENETRESET. Subsequent operations fail
"BPServer:: init: couldn't logon to slump!" messages:
If multiple connection attempts from the primary server fail, then the message above will be printed to the stdlogs on the secondary server.
In this scenario: the Webengine on secondary losses connection to Slump process on primary. As such, a slump connection error message is reported on the
secondary SDM log file.
Message: "EXIT webengine.c 1152 Slump died detected"
The webengine process on the secondary will shutdown. Any perceived problem with communicating to the primary Slump process is considered severe and as
such we request the SDM process to shutdown and restart itself.
The secondary proctor process will wait for a request from the primary's pdm_d_mgr (daemon manager) process to restart the secondary's shutdown daemon
processes. In this case, the webengine process.
The primary pdm_d_mgr process will notice that some daemons on the secondary are not running and will ping secondary proctor process to start them. In
this case the webengine process.
The secondary webengine process is started and its initialization code is executed, eventually during the startup the process will open a new
connection to Slump on the primary.
The new connection is done from webengine as one of the first steps of startup and as such a new connection is established fairly quickly within
seconds depending on how fast daemon manager notices webengine down and requests proctor to restart it. If no further disruptions are seen between
daemon manager and secondary and proctor to primary this is then done quickly within seconds which again depends on speed of servers and networks
If for some reason the starting webengine cannot establish connection or cannot logon to slump, it will be requested to restart and try to establish
connection up to 10 times before it gives up. Ten is our max restart limit per process.
If max restart limit is reached, the only way to get the process back running is to restart entire SDM services or by executing the following command
In the case below, the webengine restarted 10 times failing on each instance with "BPServer::init: couldn't logon to slump!" error message.
10/07 00:16:56.98 SERVER pdm_d_mgr 3548 ERROR daemon_obj.c 1781 Max restarts attempted for _web_eng_SERVER2 You may reset the count by running
pdm_d_refresh from the command line.
The connection was established with slump as other SDM processes were working fine.
Support has seen that the most common cause of this are VMWare snapshots and backups. If not configured correctly, the backups/snapshots may take up
all of the bandwidth on the wire and cause connection drops for other applications.
After running pdm_d_refresh, the processes should restart if connectivity is re-established:
10/07 00:23:10.55 SERVER1 proctor_SERVER2 4888 SIGNIFICANT pdm_process.c 545 Process Started
(4960):D:/PROGRA~1/CA/SERVIC~1/bin/webengine -q -d domsrvr:21 -S web:SERVER2:1 -c D:/PROGRA~1/CA/SERVIC~1/bopcfg/www/SERVER2-
web1.cfg -r rpc_srvr:SERVER2
If connectivity still could not be established, then we would see messages like the following:
10/07 00:16:46.11 SERVER2 proctor_SERVER2 4888 SIGNIFICANT pdm_process.c 545 Process Started
(3636):D:/PROGRA~1/CA/SERVIC~1/bin/webengine -q -d domsrvr:21 -S web: SERVER2:1 -c
D:/PROGRA~1/CA/SERVIC~1/bopcfg/www/SDCBCBSVMSS02-web1.cfg -r rpc_srvr:SERVER2
10/07 00:16:46.57 SERVER2 web-engine 3636 EXIT bpserver.c 246 BPServer::init: couldn't logon to slump!
10/07 00:16:54.69 SERVER2 proctor_SERVER2 4888 SIGNIFICANT pdm_process.c
545 Process Started (3380):D:/PROGRA~1/CA/SERVIC~1/bin/webengine -q -d domsrvr:21
-S web:SERVER2 :1 -c D:/PROGRA~1/CA/SERVIC~1/bopcfg/www/SERVER2-web1.cfg -r rpc_srvr:SERVER2
10/07 00:16:55.15 SERVER2 web-engine 3380 EXIT bpserver.c 246 BPServer::init: couldn't logon to slump!
In the example above, 10 restart\connection attempts failed within 10 seconds.. (the duration depends on how fast the servers are). There is no code during
starting of a process to wait to establish connection. The process is starting and needs to establish connection to proceed. The process does not know when
it is starting about what happened previously with an initial disconnection error. That is why a restart mechanism is in place for up to ten times..
Troubleshooting and Tools used to troubleshoot network disconnections between Primary and Secondary servers:
- Wireshark or the command line version "Tshark" run in a round robin log.
- Speak with your Firewall admin about any reported issues or anything can be found in their logs. Have there been any recent changes if this problem has
- Speak with your Network team about any reported issues in that time frame.
- Use a Network Monitoring tool to do regular ping tests between the primary and secondary servers to see if there is packet loss with the pings when
there is a problem in Service Desk.
- Are backups running on the same NIC as the application or on a backup NIC?
- When (if any) are VMWare snapshots performed? if this is a new problem, have there been any recent changes?
- Is there a WAN between the Primary and Secondary server? High latency may cause performance and stability problems in Service Desk.
Note: Service Desk 12.9 has a new feature called "Advanced Availability" which replaces the Primary / Secondary Server architecture and provides for
much greater availability.