AppLogic - What is the meaning of the "HA check failed: There are not enough available resources to restart components running on 2 servers [srv2,srv3]" messages in the controller?

Document ID : KB000020999
Last Modified Date : 14/02/2018
Show Technical Document Details

Description:

Sometimes messages of the type HA check failed: There are not enough available resources to restart components running on 2 servers [srvX,srvY, srvZ…] are observed in the controller logs. These messages indicate that the amount of resources available in all the grid servers is not enough to be able to restart all the components running in any of the servers indicated if for any reason it goes down.

Solution:

One of the main components of AppLogic is the possibility of having Application High Availability. This implies that if one of the grid servers goes down, the rest of the remaining servers need to be able to take over its role and restart the different components formerly running in that server.

Even though when referencing the amount of resources a grid has, the total amount of CPU, Memory an Bandwidth is often considered, each node contributes a specific amount to that global figure, and each component has its own requirements. Therefore, the ability to restart certain components if a server goes down is going to be constrained by:

  • The amount of resources available on each server node. That means CPU, Memory and Bandwidth
  • The amount of resources required by each component. A certain component needs to fit in its entirety in a given node (e.g. it is not possible to allocate CPU in one node and Bandwidth in another for a given component)
  • The scheduling algorithm. In the event of a given server going down, AppLogic will try to allocate its applications to the rest of the servers using a pack scheduling algorithm (the servers are ordered by their resources and those with less resources are filled first) and leaving the controller as the last server where appliances are started

As a result, even though globally resources may be available, and even at node-level, HA may not be possible.

Let's consider an example. Let's imagine a grid has the following distribution of resources:

 server srv1 : role primary, state up(enabled), 4.25/3.65 cpu, 12797/10142 MB mem, 801/1199 Mbps bw
 server srv2 : role secondary, state up(enabled), 7.00/1.00 cpu, 21468/2239 MB mem, 1800/200 Mbps bw
 server srv3 : role secondary, state up(enabled), 6.00/2.00 cpu, 12270/11437 MB mem, 911/1089 Mbps bw
 server srv4 : role none, state up(enabled), 7.95/0.05 cpu, 22164/1543 MB mem, 1651/349 Mbps bw
 server srv5 : role none, state up(enabled), 8.00/0.00 cpu, 16384/7323 MB mem, 1411/589 Mbps bw
 server srv6 : role none, state up(enabled), 6.75/1.25 cpu, 17372/6335 MB mem, 1350/650 Mbps bw
 server srv7 : role none, state up(enabled), 6.00/2.00 cpu, 20592/3115 MB mem, 1780/220 Mbps bw

And the list of applications running on srv2 is the following:

 1 AP1: running, 0.50 cpu, 1536 MB, 500 MBps       
 2 AP2: running, 1.00 cpu, 6144 MB, 500 MBps 
 3 AP3: running, 0.25 cpu, 750 MB, 100 MBps
 4 AP4: running, 3.00 cpu, 6144 MB, 300 MBps
 5 AP5: running, 2.00 cpu, 6144 MB, 300 MBps
 6 AP6: running, 0.25 cpu, 750 MB, 100 MBps

In this particular case srv2 requires 7.00 CPU, 21468 MB and 1800 MBps so in theory there are globally enough resources in the grid to accommodate the components. However, the message

HA check failed: There are not enough available resources to restart components running on 1 servers [srv2]

will be thrown in the controller (more servers may have the problem, but this is just an example for explanatory purposes):

In this case if srv2 fails, AppLogic will try to allocate its components starting with the server with least resources available, srv6, then srv7, srv3 and finally the controller, srv1. So, in this case:

  • Servers srv4 and srv5 cannot be used to restart any application

  • AP1 and AP3 would be started on srv6. After this srv6 would still have 0.50 CPU, 4049 MB Mem and 50 Mbps bandwidth free, but this is not enough to allocate resources to any other component

  • AP6 would be started on srv7. After this srv7 would still have 1.75 CPU, 2365 Mb Mem and 120 Mbps but it would be able to accommodate more applications from srv2, as it's got little bandwidth left

  • AP2 would be started on srv3, after which it would have 1.00 CPU, 5293 MB Mem and 589 Mbps bandwidth. No more applications could be started on any of the servers

  • AP4 and AP5 still need to be restarted, but the only server left is the controller itself, srv1, and it can either accommodate one or the other

Hence in this example HA cannot be ensured if srv2 goes down. In general it is recommended that at least one node with almost no applications running is available in the grid to accommodate a number of them in case one or several of the nodes restart. Grids should be provisioned with enough resources to make sure they are not running at the limit of their resources