The short version (executive summary):
Agent Gateway is CA product formerly known as Secure Proxy Server, for the purpose of this article, I'll refer to it as Agent Gateway or Ag for short.
When one back-end sever for Agent Gateway goes down, all of the connections pool entries and worker threads are clogged up with transactions for the one down (or v slow) "bad" back-end server. Other requests destined for other "good" working back-end are also held up and do not get processed. The effect is the one "bad" back-end server tends to drive the whole Agent Gateway offline and it is unable to process any requests.
The solution is: to give a fixed max size to the back-end connection pool; the fixed max size has to be less than the max available worker threads; and to give a quick timeout for any request trying to get a back-end connection beyond that thread pool size. Then when one back-end server fails, it does hog worker threads, but only up to the limit of the connection pool size, and importantly it leaves the remaining threads free to handle requests to non-hung backend servers.
The settings to change server.conf are from:
http_connection_pool_max_size="100" # (max pool of 100)
http_connection_pool_wait_timeout="200" # (timeout if pool size already over 100 of 200ms)
The longer version :
Because Agent Gateway acts as a proxy, is holds onto the connection from the front end (client) while it sends the request to the back-end server to be processed. For a working back-end system the response time is generally fairly quick and the number of connections/worker threads needed to maintain throughput is minimal.
So for example, as in the diagram below:
If we have requirement for 100 request/sec, and the back-end response time is 200ms, then the bandwidth in connection/threads we will need is to be able to handle about 20 requests in parallel. So we will need: 20 httpd worker threads in apache; 20 connections via mod_jk from apache to tomcat; 20 worker threads in tomcat; and 20 connections to the back-end server. Note the SPS is mostly not doing any activity, rather of the elapsed 200ms most of the time the threads are inactive waiting for the response from the back-end server.
However if the back-end sever performance slows down, so rather than 200ms each transaction takes 2sec, Then the amount of in-progress transactions that Ag will have open at one time will increase. That would be 100 trans/sec x 2sec = 200 open transactions. So now we need a bandwidth of: 200 connection pool sizes; 200 thread pool size etc, for each component. Obviously if the back-end server goes even slower then the pool/thread sizes requirements continue to increase.
Ultimately if we get to a stage where the back-end server is down, then for the original 200 request/sec load, then (with the default settings, of a 60sec timeout and a 3x retry) we are faced with each transaction taking 180sec before it sends it's failure response back to the client.
Under those conditions the connection/thread pools sizes that we would need is 18,000. So we would need: 18,000 httpd threads; 18,000 connections from clients to httpd; from apache via mod_jk to tomcat, tomcat threads and from tomcat to the back-end.
Obviously before that we've probably hit some limit, probably a 150 thread pool size in apache, or 600 or 1000 depending on what you have it set to. But the important thing is when the back-end server is down, the SPS is flooded with waiting requests, and we can't realistically (or meaningfully) hold open all of those requests for all of those retries.
But this is not the problem we are solving - this is the background to the problem.
2. The Problem - one bad back-end can stop all activity
Now, generally an Agent Gateway server has multiple back-end servers. And if one back-end servers goes down then as we've seen that leaves a heavy footprint on the internal Ag infrastructure, blocking up all the pipes, and stopping acces to all back-end servers, not just the non-working server.
3. Connection Pool Size
Here is the pattern of connection/thread pool sizes that is best for throughput if the Agent Gateway has only one back-end server. The design is that at each stage the next pipe bandwidth is slightly larger than the previous one. With that model all incoming requests will be forwarded onto the next stage, and ultimately onto the back-end server, there will be no internal bottleneck within Ag.
The reality is for Ag that in normal operating conditions the default connection pools are low, often only 5 or 10 or 20 active requests, depending on the type of transaction, and that the pool sizes & thread counts only go up to values of 100 or more when there are delay or problems with the back-end servers.