High CPU Utilization On Linux NAC

Document ID : KB000122394
Last Modified Date : 04/12/2018
Show Technical Document Details
Introduction:
The NAC is performing very poorly. When browsing the ROC it takes a long time (1+ minute) to get from one screen to another. When logging into the NAC server we can see, via top, that the java process used by the mgmt server (nac) is utilizing a majority of the CPU. 
Question:
What information is needed to investigate high cpu utilization issues on management servers installed on linux? 
Environment:
CA Release Automation Management Server on Linux
Answer:
  1. First, a screenshot of top showing the java process used by the management server.
  2. The following steps need to be repeated 3-4 times to get a decent sample of what's utilizing the CPU. It is recommended to wait 30 seconds or so between repeating the steps. 
    • Trigger thread dump of RA process using this command: kill -3 ` cat /opt/LISAReleaseAutomationServer/catalina.pid  `

    • Run the command: top -H
      • Important: Top refreshes periodically. The objective is to capture a screenshot of the top -H output while it is displaying the highest cpu utilizing threads (java if its the management server). But some screenshot utilities take time to position. It is recommended to press the 'd' key (without apostrophe's) to pause the top screen when you see a good set of sample data. The 'd' key will often prompt you for how many seconds you want top to refresh. While it waits for a response the sample data often does not change. When you've taken the screenshot, enter a number value to continue with the next iteration of kill -3 and top -H captures.
  3. Only after finishing #1 and all iterations of #2. This is important because the 'kill -3' output is captured in the files within the logs folder. Gathering it too soon will not include the data needed to investigate. 
  4. /proc/cpuinfo 
  5. /proc/meminfo
  6. ps -ef > ps.out
  7. netstat -aonp > netstat.out
  8. lsof -p `cat /path/to/RAInstalDir/catalina.pid` > lsof.out

NOTE: It is recommended to not capture/send this data when:
  • The mgmt server has been running fewer than 20-30 minutes after the services have been stopped/started.
  • After stop/start of the mgmt server it may spend resources trying to recover/resume jobs that were started prior to stopping the mgmt server. However, jobs should always be stopped before stopping services. In cases when that cannot be done the jobs should be stopped/cancelled once the services are up and running. If this scenario applies then please give the mgmt server another 10-20 minutes after the jobs have been successfully stopped. If the CPU utilization is still high after waiting this amount of time after the jobs have successfully been stopped then resume with the data collection. 

If the issue doesn't reoccur after 1 & 2 above then wait to collect the data when the real problem occurs rather than gather/send/review data for these expected conditions.