We have deployed 'nutanix_monitor' for one of the customer.
We are getting below alert quite often:
QOS_RESOURCE_RESPONSE_TIME = 11323.0 on <SERVERNAME> has crossed the critical static threshold of >10000.0
The alerts stays for very short time of duration and clears off.
Can you please explain what is this metric and how does it do this?
Is it ping response Or response from entire cluster?
UIM 9.02 and earlier
nutanix_monitor 1.53 and earlier
no this is not a ping request.
The probe is doing a post request with a JSON payload and waiting for the response.
The probe sets the start time and then the end time is marked when we get a response back from the API service.
In one section of code where this check is done, there is the following note from the dev team.
/* We are seeing a lot of failures with the Nutanix API, even with Pagination. Upon contacting their
158 developer support people they suggested no more than 5000 VMs could be reliably queried at one and to
159 definitely use pagination. But we found that even then we would encounter a high percentage of failures.
161 A periodic failure would be acceptable, so we developed the following algorithm. Upon the first failure,
162 we'd sleep for a second and reissue. This seems to catch about 50% of the initial failures. But when that fails,
163 we then wait for TEN seconds and then retry. We've seen this work in about 40% of the remaining situations.
164 For those times it fails, we give up and exit. That seems like the best thing we can hope for right now.
So it would appear the nutanix API itself has some performance issues that are known and we are doing what we can in the code to compensate for this. however, we can not completely overcome it.