Data Repository node consumes excessive CPU resource utilization

Document ID : KB000004615
Last Modified Date : 14/02/2018
Show Technical Document Details
Issue:

A Data Repository node is consuming excessive CPU resources

Environment:
All CA Performance Manager releases; Observed in the GA r2.8 release
Cause:

Zombie process left over post upgrade that is impacting the operation of the problem node which is node 1 in a 3 node cluster. As a result it is consuming 98% of the CPU resources on a Vertica node.

This results in an inability to start the database and use it.

The details of the offending process are:

  • Owned by dradmin user
  • Running for over 10 days since 10/23/16
  • Parent PID is 1 indicating it was started out of the system, not from another process
  • Details for the errant process that remains running per the screen shots provided by the reporting customer:
    • dradmin  17995 1  92 Oct23 ?    10-05:19:27 /opt/vertica/bin/dialog --backtitle Vertica Analytic Database 7.0.2-5 Administration Tools --aspect 15 --help-button --menu Main Menu 16 60 9 1 View Database Cluster State 2 Connect to Database 3 Start Database 4 Stop Database 5 Restart Vertica on Host 6 Configuration Menu 7 Advanced Menu 8 Help Using the Administration Tools E Exit

Note that this is a system that was just upgraded 10 days prior from older CAPM release 2.4.1 to the latest 2.8 release.

This /opt/vertica/bin/dialog command and related process is what is started when the /opt/vertica/bin/adminTools UI is launched by the dradmin user. Under normal circumstances we should see something like the the following running when adminTools has been started properly.

  • Here we have root user CLI 23644 launching PID 24571 for user switch to dradmin user:
    • root     24571 23644  0 09:19 pts/2    00:00:00 su dradmin
  • Here we have it showing a bash shell PID 24580 from the root login su to dradmin PID 24571:
    • dradmin  24580 24571  0 09:19 pts/2    00:00:00 bash
  • Then bash shell dradmin login PID 24580 launches adminTools under PID 24659:
    • dradmin  24659 24580  0 09:19 pts/2    00:00:00 /opt/vertica/oss/python/bin/python ./adminTools
  • When that appears in view in the CLI as a UI we then get adminTools request PID 24659 owning PID 25807 launch for the dialog command:
    • dradmin  25807 24659  0 09:20 pts/2    00:00:00 /opt/vertica/bin/dialog --backtitle Vertica Analytic Database 7.1.2-6 Administration Tools --aspect 15 --help-button --menu Main Menu 16 60 9 1 View Database Cluster State 2 Connect to Database 3 Start Database 4 Stop Database 5 Restart Vertica on Host 6 Configuration Menu 7 Advanced Menu 8 Help Using the Administration Tools E Exit

Another key clue is that the errant process showed an older release of Vertica than was actually installed. It still showed as release of 7.0.2-5 when it should be 7.1.2-6.

Resolution:

To resolve this:

  1. Identify the problem zombie process and its PID
  2. Run "kill -9 <PID>" to stop the zombie process
  3. Launch adminTools
  4. Choose to restart Vertica on the problem host
Additional Information:

This was a somewhat unique situation. It is unlikely other users will run into this but if they do it was worth getting this information published to the user community.

This was an odd problem because the current release was shown when launching adminTools. It shows release 7.1.2-6 which is correct.

Why the process was never killed off during the upgrade is odd due to it showing a run date starting on 10/23 which is also when the upgrade was run.

An educated guess is that someone logged into node 1 and launched the adminTools UI to stop the DB in preparation for running the upgrade. Somehow they closed that terminal window without first exiting adminTools properly. That left the process hanging around and caused the node to not start properly post upgrade.