[MTP] Assess and Recover XFS File System Corruption

Document ID : KB000010130
Last Modified Date : 14/02/2018
Show Technical Document Details
Introduction:

On restart of the appliance, a kernel panic preceded by an XFS code call stack similar to the following is displayed: 

RIP [<ffffffff883cf607>] :xfs:xfs_error_report+0xf/0x58 

RSP <ffff81028c817c28> 

CR2: 0000000000000118 

<0> Kernel panic – not syncing – Fatal exception

Background:

The CA Multi-Port Monitor appliance uses the high performance Linux XFS file system on two partitions:

/dev/sda4 mounted on /nqxfs

Hosts the Vertica metrics database.

/dev/sdb1 mounted on /data

Hosts the CA Multi-Port Monitor packet capture storage.

XFS file system corruption typically occurs when the appliance experiences a power outage or hardware hang.

The Linux kernel panic is mostly likely to occur on the /nqxfs partition shortly after restarting the appliance when the Vertica metrics database starts. 

Environment:
CentOS 5.X/6.X
Instructions:

Repair XFS File System Corruption

Repair a damaged or corrupt XFS file system using the xfs_repair command on the affected partition. After you repair XFS file system corruption on the:

  • /data partition, no further action is required.
  • /nqxfs partition, recreate the Vertica metrics database which is hosted on the partition.

Estimated time to complete XFS repair:30-60 minutes

Follow these steps:

1. If the Multi-Port Monitor terminal displays a kernel panic and system halt message, and is unresponsive, shut down the appliance by holding down the Power button for several seconds. Otherwise, shut down the appliance (see page 12) normally.

2. Press the Power button to start the appliance.

3. After BIOS scans, the initial CentOS boot screen will appear. Hit any key before the countdown reaches zero seconds to enter the boot menu.

4. The default boot kernel will already be selected. Press a to modify kernel boot parameters.

5. The cursor will be at the end of the line of kernel parameters. Add the parameter single to the end of the line, as shown in the example below, and press Enter:

 Single.png

 

 

6. When the kernel finishes booting, a command prompt will be displayed. There is no login prompt as the system is running in single user mode.

 

Note: In single user mode, the appliance can only be accessed from the terminal display. 

7. To repair the:

 /nqxfs partition

umount it and execute xfs_repair for its block device:

umount /nqxfs

xfs_repair /dev/sda4

/data partition

umount it and execute xfs_repair for its block device:

umount /data

xfs_repair /dev/sdb1

8. In either case, a successful repair produces text output similar to the following:

Phase 1 - find and verify superblock...

Phase 2 - zero log...

- scan file system freespace and inode maps...

- found root inode chunk

Phase 3 - for each AG...

- scan and clear agi unlinked lists...

- process known inodes and perform inode discovery...

- agno = 0

- agno = 1

...

- process newly discovered inodes...

Phase 4 - check for duplicate blocks...

- setting up duplicate extent list...

- clear lost+found (if it exists) ...

- clearing existing “lost+found” inode

- deleting existing “lost+found” entry

- check for inodes claiming duplicate blocks...

- agno = 0

imap claims in-use inode 242000 is free, correcting imap

- agno = 1

- agno = 2

...

Phase 5 - rebuild AG headers and trees...

- reset superblock counters...

Phase 6 - check inode connectivity...

- ensuring existence of lost+found directory

- traversing file system starting at / ...

- traversal finished ...

- traversing all unattached subtrees ...

- traversals finished ...

- moving disconnected inodes to lost+found ...

disconnected inode 242000, moving to lost+found

Phase 7 - verify and correct link counts...

Done

9. Enter reboot to leave single user mode and restart the appliance.

 

10. Assess whether the XFS repair has returned the partition to normal operations.