How to replace disks that house Data Repository Vertica database related information

Document ID : KB000032500
Last Modified Date : 14/02/2018
Show Technical Document Details

Summary:

There are times when disk space problems show up on mission critical systems such as the CA Performance Manager (CAPM) Data Repository (DR) Vertica Database host systems.

For example a disks size needs to be increased, or a disk type change is required. In these instances there are specific steps that must be followed to ensure success. Without following these steps to accomplish this task, the chances of having a corrupted or broken database system increase.

To minimize and nearly eliminate these risks, follow these instructions to change or otherwise replace the disks.

 

Instructions:

Two options with slightly different instructions are presented here to select from.

The first will allow:

  • Replacement of the disks at any given time
  • Retains access to the CAPM UI or user reports and data analysis
  • Has the potential, if something goes wrong during the process, to cause data corruption

The second will allow:

  • A much higher likelihood of completing the process with no corruption to the DB or its data
  • Requires making the change after-hours to align with times the system is at minimal or zero use
  • Requires shut down of the CAPM server services to ensure no access to the DB by users during the disk replacement process
  • Can be performed in one night of downtime, or over several nights, potentially replacing one disk each night

In both sets of instructions the following best practices would apply:

  • Replace the disks one node at a time to avoid bringing the entire DB down by doing the work on two or more nodes at once
  • If one has not been run recently, run a DB backup before beginning the disk change cycles. This helps to ensure a recovery option is available should something go wrong.
  • If there is a cron job configured for scheduled DB backups, disable it to ensure it doesn't launch during the disk replacement process cycles for the node(s).

 

Option 1

Perform the disk replacement at any given time. This option retains access to the CAPM UI for user reports and data analysis, though due to this it can possibly result in data corruption if something fails badly during the disk change process. To execute this option follow these instructions:

  1. Stop the node having disks replaced. To do so log into the DR primary node as the dradmin or equivalent user. Launch the adminTools UI; select option 7 for the "Advanced Menu"; select option 2 to "Stop Vertica on Host"; select the host having disks replaced and choose to stop Vertica on that host.
  2. Perform the copy of the data from the old to the new disks and attach them to the system properly so they are accessible.
  3. Update the Ancient History Mark (AHM) in the database. To do so we need to open a VSQL session on one of the running nodes as the dradmin or equivalent user. Do so with the ./vsql command from /opt/vertica/bin, or do so via the adminTools UI by selecting option 2 to "Connect to Database". When prompted enter the password to access the DB. This is the same password used to stop/start the DB via the adminTools UI. Once connected to the vsql prompt run the statement "select make_ahm_now();".
  4. Restart the node that had disks replaced. Note that after restarting the node there will be a period of recovery for the DB. Depending upon the load on the Vertica environment, as well as the length of time it is down, this could take some time to get back to a current state. Watch for the UP state in the DB status view in the adminTools UI (option 1 in main menu) for the fixed node. When it changes to an UP status, the system can be used normally again.
  5. When that is done and the node shows as UP, run a backup so a safe recovery option is available after the first node disks are replaced.
  6. Repeat the process for the other remaining nodes that require disk changes as needed.

 

Option 2

Alternative process that will increase the likelihood of completing this change with no corruption to the DB or its data. This option recommends the change is performed during night time or off-peak hours when there is little to no user activity in the system. This option also recommends shut down of CAPC services to ensure no access to DB at the time of the disk changes. This option may require change order submissions and scheduling for server downtime.

  1. Stop the four CAPM services. Use Knowledge Base Article ID TEC1382101 for instructions.
  2. Stop the node having disks replaced. To do so log into the DR primary node as the dradmin or equivalent user. Launch the adminTools UI; select option 7 for the "Advanced Menu"; select option 2 to "Stop Vertica on Host"; select the host having disks replaced and choose to stop Vertica on that host.
  3. Perform the copy of the data from the old to the new disks and attach them to the system properly so they are accessible.
  4. Update the Ancient History Mark (AHM) in the database. To do so we need to open a VSQL session on one of the running nodes as the dradmin or equivalent user. Do so with the ./vsql command from /opt/vertica/bin, or do so via the adminTools UI by selecting option 2 to "Connect to Database". When prompted enter the password to access the DB. This is the same password used to stop/start the DB via the adminTools UI. Once connected to the vsql prompt run the statement "select make_ahm_now();".
  5. Restart the node that had disks replaced. Note that after restarting the node there will be a period of recovery for the DB. Depending upon the load on the Vertica environment, as well as the length of time it is down, this could take some time to get back to a current state. Watch for the UP state in the DB status view in the adminTools UI (option 1 in main menu) for the fixed node. When it changes to an UP status, the system can be used normally again.
  6. When that is done and the node shows as UP, run a backup so a safe recovery option is available after the first node disks are replaced.
  7. If no further work on other nodes is scheduled, the CAPM services can now be restarted to allow user access to the system again.
  8. Repeat the process for the other remaining nodes that require disk changes as needed.