lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <9952f532-2554-44bf-b906-4880b2e88e3a@o2.pl>
Date: Sat, 20 Jul 2024 16:47:32 +0200
From: Mateusz Jończyk <mat.jonczyk@...pl>
To: Yu Kuai <yukuai3@...wei.com>, linux-raid@...r.kernel.org,
 linux-kernel@...r.kernel.org
Cc: Song Liu <song@...nel.org>, Paul Luse <paul.e.luse@...ux.intel.com>
Subject: Filesystem corruption when adding a new device (delayed-resync,
 write-mostly)

Hello,

In my laptop, I used to have two RAID1 arrays on top of NVMe and SATA SSD
drives: /dev/md0 for /boot (not partitioned), /dev/md1 for remaining data (LUKS
+ LVM + ext4). For performance, I have marked the RAID component device for
/dev/md1 on the SATA SSD drive write-mostly, which "means that the 'md' driver
will avoid reading from these devices if at all possible" (man mdadm).

Recently, the NVMe drive started having problems (PCI AER errors and the
controller disappearing), so I removed it from the arrays and wiped it.
However, I have reseated the drive in the M.2 socket and this apparently fixed
it (verified with tests).

    $ cat /proc/mdstat
    Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
    md1 : active raid1 sdb5[1](W)
          471727104 blocks super 1.2 [2/1] [_U]
          bitmap: 4/4 pages [16KB], 65536KB chunk

    md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
          3142656 blocks super 1.2 [2/2] [UU]
          bitmap: 0/1 pages [0KB], 65536KB chunk

    md0 : active raid1 sdb4[3]
          2094080 blocks super 1.2 [2/1] [_U]
         
    unused devices: <none>

(md2 was used just for testing, ignore it).

Today, I have tried to add the drive back to the arrays by using a script that
executed in quick succession:

    mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
    mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3

This was on Linux 6.10.0, patched with my previous patch:

    https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/

(which fixed a regression in the kernel and allows it to start /dev/md1 with a
single drive in write-mostly mode).
In the background, I was running "rdiff-backup --compare" that was comparing
data between my array contents and a backup attached via USB.

This, however resulted in mayhem - I was unable to start any program with an
input-output error, etc. I used SysRQ + C to save a kernel log:

    [ 4940.891490] md: could not open device unknown-block(259,1).
    [ 4940.891498] md: md_import_device returned -16
    [ 4940.920440] md: could not open device unknown-block(259,1).
    [ 4940.920449] md: md_import_device returned -16
    [ 4940.934462] md: recovery of RAID array md0
    [ 4941.003041] EXT4-fs warning (device dm-4): ext4_dirblock_csum_verify:405: inode #16676515: comm rdiff-backup: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4941.003053] EXT4-fs error (device dm-4): htree_dirblock_to_tree:1082: inode #16676515: comm rdiff-backup: Directory block failed checksum
    [ 4941.003061] Aborting journal on device dm-4-8.
    [ 4941.066131] md: delaying recovery of md1 until md0 has finished (they share one or more physical units)
    [ 4941.092374] EXT4-fs (dm-4): Remounting filesystem read-only
    [ 4942.301499] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4942.301508] EXT4-fs error (device dm-2): __ext4_find_entry:1693: inode #156120: comm gnome-terminal-: checksumming directory block 0
    [ 4942.301515] Aborting journal on device dm-2-8.
    [ 4942.332393] EXT4-fs (dm-2): Remounting filesystem read-only
    [ 4942.333839] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4942.765422] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4942.766884] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4947.364466] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4947.366435] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4947.849698] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4947.851860] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4949.226094] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4949.227982] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4949.696095] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4949.697468] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4950.105158] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm pool-gnome-shel: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4950.105645] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm pool-gnome-shel: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4951.559184] md: md0: recovery done.
    [ 4951.562154] md: recovery of RAID array md1
    [ 4952.636764] EXT4-fs warning: 6 callbacks suppressed

The interesting fact is that both dm-4 and dm-2 are on /dev/md1, which was
modified second, and should not have been touched (or read from) before the
resync of /dev/md0 was complete.

I have restarted the laptop on the same kernel, and this apparently
made things worse (perhaps fsck at boot corrupted my root filesystem).
All in all, other filesystems were relatively unaffected, but the root
filesystem (for / ) I had to recover from backups.

Greetings,

Mateusz

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ