Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command needed to recover a raid array after some volumes are temporarily unavailable #120

Open
pdcastro opened this issue Jun 13, 2023 · 2 comments

Comments

@pdcastro
Copy link

The problem

Users may permanently lose access to all the data in their LVM RAID (dm-raid) array when a subset of the disks (physical volumes) becomes temporarily unavailable, for example if that subset of disks resides in a powered USB docking station that gets accidentally unplugged from power, or in the case of unreliable cables, or the failure of a discrete disk controller.

This may happen even if those disks do not suffer any data or metadata corruption.

When the temporary unavailability is resolved (e.g. a USB docking station is powered up again), with or without rebooting the Linux machine, LVM commands such as pvs are able to list the previously missing volumes, but those volumes are still permanently reported by dm-raid as having failed, even if, in reality, they are in good working order and they still contain intact RAID data and metadata. Attempts to activate logical volumes in the RAID array may then fail with “input/output” errors. If the number of disks in the temporarily unavailable subset is large enough, it may not even be possible to perform re-synchronization by replacing the reportedly failed disks with new ones.

In addition, LVM RAID (dm-raid) log messages such as “md/raid:mdX: not enough operational devices (3/4 failed)” may lead some users to believe that the devices in question have actually failed, potentially causing the unnecessary disposal of hardware that is in good working order.

An advanced procedure to recover the RAID array by using a binary editor (hexedit) to edit metadata bit fields on block devices (e.g. "/dev/sda4") is partially described in some internet forums and further documented below. However, this procedure risks actual data corruption and may be beyond the reach of some users.

Real-world scenarios (how to reproduce the problem)

A web search reveals user reports of this issue as far back as 2014 (Ref1), but the issue may have been around for longer. It is infrequent but the consequences are severe when it happens, and it has happened enough times to leave online traces (Ref2, Ref3, and this GitHub issue itself, among others). Most cases appear to involve RAID level 5 configurations, presumably because RAID 5 is (or used to be) the most popular RAID configuration.

I personally came across a LVM RAID (dm-raid) level 4 array that consisted of 4 physical volumes, being 1 SSD (RAID parity) internal to the Linux machine and 3 HDDs (RAID data) in an external, powered USB docking station. The USB docking station was suddenly unplugged from power by accident, while the Linux machine and the internal SSD parity volume remained powered. Some time later, power was restored to the USB docking station. However, even after rebooting the Linux machine, while the pvs command could list all the physical volumes, the logical volumes could no longer be activated, with the system logs indicating that the 3 external HDDs had failed, and LVM command-line tools producing errors such as:

root@asus:~# vgchange -ay vg1
  device-mapper: reload ioctl on  (253:8) failed: Input/output error
  device-mapper: reload ioctl on  (253:8) failed: Input/output error
  device-mapper: reload ioctl on  (253:8) failed: Input/output error
  0 logical volume(s) in volume group "vg1" now active

root@asus:~# lvchange -ay --activationmode partial -vvvv vg1/lv1
  ...
  21:48:07.649271 lvchange[2978] device_mapper/libdm-deptree.c:2445  Found raid target v1.15.1.
  21:48:07.649316 lvchange[2978] device_mapper/ioctl/libdm-iface.c:1876  dm table   (253:8) [ opencount flush ]   [16384] (*1)
  21:48:07.649346 lvchange[2978] device_mapper/ioctl/libdm-iface.c:1876  dm reload   (253:8) [ noopencount flush ]   [16384]   (*1)
  21:48:07.697211 lvchange[2978] device_mapper/ioctl/libdm-iface.c:1926  device-mapper: reload ioctl on  (253:8) failed: Input/output error
  ...

Relevant system logs can be found at the end of this issue description.

The cause

During the temporary unavailability of a subset of the physical volumes, LVM RAID (dm-raid) may “mark” those physical volumes as having failed in the dm-raid superblock metadata (dm-raid.c#L1921-L1935) of the physical volumes that remained available.

As far as my understanding goes, that metadata is different to the higher-level “LVM raid” metadata that gets backed up with the vgcfgbackup and vgcfgrestore commands and typically stored in the /etc/lvm/backup/ or /etc/lvm/archive/ folders on the Linux machine. Therefore, vgcfgrestore does not help to recover the RAID array under these circumstances. While attempting to recover my own RAID array, I even tried removing physical volumes with "pvremove --force --force" and re-adding them with "pvcreate --uuid ... --restorefile ...", hoping that this would reset the dm-raid superblock metadata regarding failed volumes, but it did not.

Manual recovery with hexedit

I came across some old forum threads and archived mail lists (e.g. Ref1 , Ref2 , Ref3 ) that suggested that some dm-raid superblock metadata bit fields could be manually cleared with a binary editor like hexedit in order to tell LVM raid (dm-raid) that all physical volumes were just fine. However, the steps shared in those posts were too specific to their particular incidents and/or applied to old versions of LVM tools or dm-raid that did not match some of my own observations, so I decided to share here some more generic and up to date steps that could potentially help more users.

In source code, the dm-raid superblock metadata in question can be found at dm-raid.c#L1921-L1935. Note the #define DM_RAID_MAGIC 0x64526D44 line. In reverse (little endian), the 4 magic bytes 44 6D 52 64 (hexadecimal) correspond to the 4 ASCII characters DmRd ("dm-raid"). The dm-raid superblock metadata starts with these 4 bytes (32 bits) and is followed by several other fields:

struct dm_raid_superblock {
	__le32 magic;  /* "DmRd" */
	__le32 compat_features;
	__le32 num_devices;
	__le32 array_position;
	__le64 events;
	__le64 failed_devices;
    ...

With a binary editor such as hexedit ("apt-get install hexedit"), you'll find these fields as byte sequences. For example:

 magic       | features    | num_devices | array_pos   | events                  | failed_devices
-------------|-------------|-------------|-------------|-------------------------|-------------------------
 44 6D 52 64 | 01 00 00 00 | 04 00 00 00 | 00 00 00 00 | 2D 02 00 00 00 00 00 00 | 0E 00 00 00 00 00 00 00

The values above are in hexadecimal. We are interested in the failed_devices field. In the example above, its value is 0E, which in binary is 1110. The three 1 bits in the binary sequence 1110 indicate dm-raid's belief that three physical volumes have failed (one bit per physical volume). In this case, the fix is to edit 0E and replace it with 00, indicating that no physical volumes failed.

Each physical volume may have multiple instances of this superblock metadata — in my RAID 4 scenario, the volume group contained three logical volumes and I found three instances of the superblock metadata in each physical volume. Note that the superblock metadata may be different in each physical volume because some physical volumes were unavailable at the time when dm-raid updated the failed_devices field. Indeed, the physical volumes that remained always available are the ones expected to contain superblock metadata indicating a non-zero number of failed devices. To be sure, find and check all superblock metadata instances and clear the failed_devices field in each of them as needed.

Where exactly to find this metadata for editing? In Ref1, tabigel was somehow able to activate a logical volume and then found the metadata at the block device “file” "/dev/mapper/<vgname>-<lvname>_rmeta_<n>". In my case, I could not activate any logical volumes, not even with the "--activationmode partial" argument to the "lvchange -ay" or "vgchange -ay" commands, and the relevant volumes were not available under "/dev/mapper/". So I decided to open the raw physical volume block device (e.g. a GPT partition such as "/dev/sda4") with hexedit, as if it were a file, and then search for the magic DmRd sequence.

A practical issue was that the the physical volumes were fairly large (500GB), and searching the whole volume for the DmRd was time consuming and most importantly, produced many false positives (which I could tell because the remaining superblock fields made no sense). So I narrowed my search with the help of the "pvs -v --segments" and "pvdisplay" commands:

root@asus:~# pvs -v --segments
  PV         VG  Fmt  Attr PSize    PFree    Start SSize LV             Start Type   PE Ranges
  /dev/sda3      lvm2 ---  <232.88g <232.88g     0     0                    0 free
  /dev/sda4  vg1 lvm2 a--  <465.76g       0      0     1 [lv1_rmeta_0]      0 linear /dev/sda4:0-0
  /dev/sda4  vg1 lvm2 a--  <465.76g       0      1 39744 [lv1_rimage_0]     0 linear /dev/sda4:1-39744
  /dev/sda4  vg1 lvm2 a--  <465.76g       0  39745     1 [lv2_rmeta_0]      0 linear /dev/sda4:39745-39745
  /dev/sda4  vg1 lvm2 a--  <465.76g       0  39746 39744 [lv2_rimage_0]     0 linear /dev/sda4:39746-79489
  /dev/sda4  vg1 lvm2 a--  <465.76g       0  79490     1 [lv3_rmeta_0]      0 linear /dev/sda4:79490-79490
  /dev/sda4  vg1 lvm2 a--  <465.76g       0  79491 39743 [lv3_rimage_0]     0 linear /dev/sda4:79491-119233
  /dev/sdb1  vg1 lvm2 a--  <465.76g       0      0     1 [lv1_rmeta_2]      0 linear /dev/sdb1:0-0
  /dev/sdb1  vg1 lvm2 a--  <465.76g       0      1 39744 [lv1_rimage_2]     0 linear /dev/sdb1:1-39744
  /dev/sdb1  vg1 lvm2 a--  <465.76g       0  39745     1 [lv2_rmeta_2]      0 linear /dev/sdb1:39745-39745
  /dev/sdb1  vg1 lvm2 a--  <465.76g       0  39746 39744 [lv2_rimage_2]     0 linear /dev/sdb1:39746-79489
  /dev/sdb1  vg1 lvm2 a--  <465.76g       0  79490     1 [lv3_rmeta_2]      0 linear /dev/sdb1:79490-79490
  /dev/sdb1  vg1 lvm2 a--  <465.76g       0  79491 39743 [lv3_rimage_2]     0 linear /dev/sdb1:79491-119233
  /dev/sdc1  vg1 lvm2 a--  <465.76g       0      0     1 [lv1_rmeta_1]      0 linear /dev/sdc1:0-0
  /dev/sdc1  vg1 lvm2 a--  <465.76g       0      1 39744 [lv1_rimage_1]     0 linear /dev/sdc1:1-39744
  /dev/sdc1  vg1 lvm2 a--  <465.76g       0  39745     1 [lv2_rmeta_1]      0 linear /dev/sdc1:39745-39745
  /dev/sdc1  vg1 lvm2 a--  <465.76g       0  39746 39744 [lv2_rimage_1]     0 linear /dev/sdc1:39746-79489
  /dev/sdc1  vg1 lvm2 a--  <465.76g       0  79490     1 [lv3_rmeta_1]      0 linear /dev/sdc1:79490-79490
  /dev/sdc1  vg1 lvm2 a--  <465.76g       0  79491 39743 [lv3_rimage_1]     0 linear /dev/sdc1:79491-119233
  /dev/sdc2      lvm2 ---   232.87g  232.87g     0     0                    0 free
  /dev/sdd1  vg1 lvm2 a--  <465.76g       0      0     1 [lv1_rmeta_3]      0 linear /dev/sdd1:0-0
  /dev/sdd1  vg1 lvm2 a--  <465.76g       0      1 39744 [lv1_rimage_3]     0 linear /dev/sdd1:1-39744
  /dev/sdd1  vg1 lvm2 a--  <465.76g       0  39745     1 [lv2_rmeta_3]      0 linear /dev/sdd1:39745-39745
  /dev/sdd1  vg1 lvm2 a--  <465.76g       0  39746 39744 [lv2_rimage_3]     0 linear /dev/sdd1:39746-79489
  /dev/sdd1  vg1 lvm2 a--  <465.76g       0  79490     1 [lv3_rmeta_3]      0 linear /dev/sdd1:79490-79490
  /dev/sdd1  vg1 lvm2 a--  <465.76g       0  79491 39743 [lv3_rimage_3]     0 linear /dev/sdd1:79491-119233

root@asus:~# pvdisplay
  --- Physical volume ---
  PV Name               /dev/sda4
  ...
  PE Size               4.00 MiB
  ...

In the output of the pvs command above, we are interested in the rows where the LV column contains "_rmeta_" in the LV name. These are sub LVs (also known as meta LVs) that hold RAID metadata including the superblock metadata we are after. The "Start" or "PE Ranges" (Physical Extent Ranges) columns show the physical extent number where the metadata starts, for example 39745 for the lv2_rmeta_0 sub LV. The output of the pvdisplay command shows that the Physical Extent (PE) Size is 4MiB or 4194304. In this case, the byte offset is thus 39745 * 4194304 = 166702612480. Jump to this byte offset position in the binary editor (assuming you have opened a block device such as "/dev/sda4" as if it were a file), then search for the DmRd magic string starting from there. It should be found straight away, because the rmeta sub LV is not that large. I believe it is enough to search once per rmeta sub LV, per physical volume.

In summary, for this example, using hexedit:

  • Run "hexedit /dev/sda4" to open the physical volume block device as if it were a file.
  • For each rmeta sub LV, calculate its byte offset position (example above).
  • Hit ‘Ctrl+G‘ or ‘F4‘ to jump to the first calculated byte offset. Hint: Delete the '0x' prompt prefix to use a decimal number.
  • Hit ‘Tab’ to toggle from hex mode to ASCII mode.
  • Hit ‘Ctrl-S’ to search, type DmRd and Enter. It should be found within one second or so.
  • Hit ‘Tab’ to toggle back to hex mode.
  • Use the arrow keys to navigate to the failed_devices field shown earlier in a sample table.
  • Type 0 to overwrite a non-zero number of failed devices.
  • Repeat the steps above for additional rmeta sub LVs.
  • When done, hit ‘Ctrl-X‘ to save and exit. To exit without saving, it’s ‘Ctrl-C’.
  • Repeat all the steps above for additional physical volumes.

Note that if you make a mistake, you may corrupt or destroy data or metadata and it may be practically impossible to undo it.

Suggested solution

It can be argued that dm-raid should not mark a physical volume as having failed just because it is temporarily unavailable, thus preventing this issue from arising in the first place. Even if this is easier said than done, it may still be worth pursuing.

On the other hand, even if such improvements were made, there would still be a legacy of older dm-raid implementations and there could still be corner cases that required manual correction. Also, if we set the bar too high and the task is too demanding, while relying on volunteer developers, it may never get done. “Perfect is the enemy of good.”

In this GitHub issue, I advocate starting with the implementation of an LVM command that clears the “failed devices” bit fields of dm-raid superblock metadata, saving users from having to resort to the hexedit steps listed above. I have not given much thought to what this command should be, but to kickstart a discussion, I might suggest:

root@hostname:~# vgchange --resetfaileddevices vg1

Such a command would reset the “failed devices” bit field of the dm-raid superblock metadata in every rmeta sub LV of all physical volumes of the given volume group.

Another possibility might be to add an option to the dmsetup command. As I understand it, it operates at a lower level of abstraction. It might be a case of implementing both, with vgchange invoking dmsetup multiple times behind the scenes, perhaps as a library.

Logs

Sample journalctl logs (including dmesg) while a RAID volume subset (3 out of 4 disks) was unavailable (USB docking station suddenly powered off by accident), before the Linux machine was rebooted:

Click to expand
Jun 02 21:44:03 asus kernel: device-mapper: raid: Loading target version 1.15.1
...
Jun 06 13:36:27 asus kernel: md/raid:mdX: Disk failure on dm-25, disabling device.
Jun 06 13:36:27 asus kernel: md/raid:mdX: Operation continuing on 3 devices.
Jun 06 13:36:27 asus kernel: md/raid:mdX: Disk failure on dm-23, disabling device.
Jun 06 13:36:27 asus kernel: EXT4-fs warning (device dm-26): htree_dirblock_to_tree:1072: inode #2: lblock 0: comm ls: error -5 reading directory block
Jun 06 13:36:27 asus kernel: md/raid:mdX: Cannot continue operation (2/4 failed).
Jun 06 13:36:27 asus kernel: md/raid:mdX: Disk failure on dm-21, disabling device.
Jun 06 13:36:27 asus kernel: md/raid:mdX: Cannot continue operation (3/4 failed).
Jun 06 13:36:27 asus lvm[672]: WARNING: Device #1 of raid4 array, vg1-lv3, has failed.
Jun 06 13:36:27 asus lvm[672]: WARNING: Device #2 of raid4 array, vg1-lv3, has failed.
Jun 06 13:36:27 asus lvm[672]: WARNING: Device #3 of raid4 array, vg1-lv3, has failed.
Jun 06 13:36:27 asus lvm[672]: WARNING: waiting for resynchronization to finish before initiating repair on RAID device vg1-lv3.
Jun 06 13:36:27 asus lvm[672]: Use 'lvconvert --repair vg1/lv3' to replace failed device.
Jun 06 13:36:33 asus kernel: Buffer I/O error on dev dm-26, logical block 60850176, lost sync page write
Jun 06 13:36:33 asus kernel: JBD2: Error -5 detected when updating journal superblock for dm-26-8.
Jun 06 13:36:33 asus kernel: Aborting journal on device dm-26-8.
Jun 06 13:36:33 asus kernel: Buffer I/O error on dev dm-26, logical block 60850176, lost sync page write
Jun 06 13:36:33 asus kernel: JBD2: Error -5 detected when updating journal superblock for dm-26-8.
Jun 06 13:36:36 asus kernel: md/raid:mdX: Disk failure on dm-7, disabling device.
Jun 06 13:36:36 asus kernel: md/raid:mdX: Operation continuing on 3 devices.
Jun 06 13:36:36 asus kernel: md/raid:mdX: Disk failure on dm-5, disabling device.
Jun 06 13:36:36 asus kernel: EXT4-fs warning (device dm-8): htree_dirblock_to_tree:1072: inode #2: lblock 0: comm ls: error -5 reading directory block
Jun 06 13:36:36 asus kernel: md/raid:mdX: Cannot continue operation (2/4 failed).
Jun 06 13:36:36 asus kernel: md/raid:mdX: Disk failure on dm-3, disabling device.
Jun 06 13:36:36 asus kernel: md/raid:mdX: Cannot continue operation (3/4 failed).
Jun 06 13:36:36 asus lvm[672]: WARNING: Device #1 of raid4 array, vg1-lv1, has failed.
Jun 06 13:36:36 asus lvm[672]: WARNING: Device #2 of raid4 array, vg1-lv1, has failed.
Jun 06 13:36:36 asus lvm[672]: WARNING: Device #3 of raid4 array, vg1-lv1, has failed.
Jun 06 13:36:36 asus lvm[672]: WARNING: waiting for resynchronization to finish before initiating repair on RAID device vg1-lv1.
Jun 06 13:36:37 asus lvm[672]: Use 'lvconvert --repair vg1/lv1' to replace failed device.
Jun 06 13:36:43 asus kernel: Buffer I/O error on dev dm-8, logical block 60850176, lost sync page write
Jun 06 13:36:43 asus kernel: JBD2: Error -5 detected when updating journal superblock for dm-8-8.
Jun 06 13:36:43 asus kernel: Aborting journal on device dm-8-8.
Jun 06 13:36:43 asus kernel: Buffer I/O error on dev dm-8, logical block 60850176, lost sync page write
Jun 06 13:36:43 asus kernel: JBD2: Error -5 detected when updating journal superblock for dm-8-8.
Jun 06 13:37:37 asus lvm[672]: WARNING: Device #1 of raid4 array, vg1-lv2, has failed.
Jun 06 13:37:37 asus lvm[672]: WARNING: Device #2 of raid4 array, vg1-lv2, has failed.
Jun 06 13:37:37 asus lvm[672]: WARNING: Device #3 of raid4 array, vg1-lv2, has failed.
Jun 06 13:37:37 asus lvm[672]: WARNING: waiting for resynchronization to finish before initiating repair on RAID device vg1-lv2.
Jun 06 13:37:37 asus kernel: md: super_written gets error=-5
Jun 06 13:37:37 asus kernel: md/raid:mdX: Disk failure on dm-12, disabling device.
Jun 06 13:37:37 asus kernel: md/raid:mdX: Operation continuing on 3 devices.
Jun 06 13:37:37 asus kernel: md: super_written gets error=-5
Jun 06 13:37:37 asus kernel: md/raid:mdX: Disk failure on dm-14, disabling device.
Jun 06 13:37:37 asus kernel: md/raid:mdX: Cannot continue operation (2/4 failed).
Jun 06 13:37:37 asus kernel: md: super_written gets error=-5
Jun 06 13:37:37 asus kernel: md/raid:mdX: Disk failure on dm-16, disabling device.
Jun 06 13:37:37 asus kernel: md/raid:mdX: Cannot continue operation (3/4 failed).
Jun 06 13:37:37 asus kernel: Aborting journal on device dm-17-8.
Jun 06 13:37:37 asus kernel: Buffer I/O error on dev dm-17, logical block 60850176, lost sync page write
Jun 06 13:37:37 asus kernel: JBD2: Error -5 detected when updating journal superblock for dm-17-8.
Jun 06 13:37:37 asus lvm[672]: Use 'lvconvert --repair vg1/lv2' to replace failed device.```

Sample journalctl logs (including dmesg) after all RAID volumes were made available again and the Linux machine was rebooted:

Click to expand
Jun 06 13:42:06 asus kernel: device-mapper: raid: Loading target version 1.15.1
Jun 06 13:42:06 asus kernel: md/raid:mdX: device dm-1 operational as raid disk 0
Jun 06 13:42:06 asus kernel: md/raid:mdX: not enough operational devices (3/4 failed)
Jun 06 13:42:06 asus kernel: md/raid:mdX: failed to run raid set.
Jun 06 13:42:06 asus kernel: md: pers->run() failed ...
Jun 06 13:42:06 asus kernel: device-mapper: table: 253:8: raid: Failed to run raid array
Jun 06 13:42:06 asus kernel: device-mapper: ioctl: error adding target to table
Jun 06 13:42:06 asus kernel: md/raid:mdX: device dm-1 operational as raid disk 0
Jun 06 13:42:06 asus kernel: md/raid:mdX: not enough operational devices (3/4 failed)
Jun 06 13:42:06 asus kernel: md/raid:mdX: failed to run raid set.
Jun 06 13:42:06 asus kernel: md: pers->run() failed ...
Jun 06 13:42:06 asus kernel: device-mapper: table: 253:8: raid: Failed to run raid array
Jun 06 13:42:06 asus kernel: device-mapper: ioctl: error adding target to table
Jun 06 13:42:06 asus kernel: md/raid:mdX: device dm-1 operational as raid disk 0
Jun 06 13:42:06 asus kernel: md/raid:mdX: not enough operational devices (3/4 failed)
Jun 06 13:42:06 asus kernel: md/raid:mdX: failed to run raid set.
Jun 06 13:42:06 asus kernel: md: pers->run() failed ...
Jun 06 13:42:06 asus kernel: device-mapper: table: 253:8: raid: Failed to run raid array
Jun 06 13:42:06 asus kernel: device-mapper: ioctl: error adding target to table
@jrwren
Copy link

jrwren commented May 28, 2024

I'm having this issue, but slightly different, it isn't an rmeta sub LV which is returning the ioctl failed. It is the LV device <vgname>-<lvname>. This means I don't know where to hexedit. Any recommendation for how I can hunt this down?

https://gist.github.com/jrwren/e5fd91fe937923584d916b710dfc0f56

@zkabelac
Copy link
Contributor

Doing manual 'edit' of RAID metadata is certainly wrong idea.
Lvm2 needs to do a proper recovery case for this situation.

Ideally would be to prepare 'lvm2' test suite example for the above case. (or at least providing' sequence of lvm2 metadata archive leading to invalid state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants