linux-kernel - Re: Filesystem corruption when adding a new device (delayed-resync, write-mostly)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ce95e64c-1a67-4a92-984a-c1eab0894857@o2.pl>
Date: Mon, 22 Jul 2024 07:39:42 +0200
From: Mateusz Jończyk <mat.jonczyk@...pl>
To: Yu Kuai <yukuai3@...wei.com>, linux-raid@...r.kernel.org,
 linux-kernel@...r.kernel.org
Cc: Song Liu <song@...nel.org>, Paul Luse <paul.e.luse@...ux.intel.com>
Subject: Re: Filesystem corruption when adding a new device (delayed-resync,
 write-mostly)

W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
> Hello,
>
> In my laptop, I used to have two RAID1 arrays on top of NVMe and SATA SSD
> drives: /dev/md0 for /boot (not partitioned), /dev/md1 for remaining data (LUKS
> + LVM + ext4). For performance, I have marked the RAID component device for
> /dev/md1 on the SATA SSD drive write-mostly, which "means that the 'md' driver
> will avoid reading from these devices if at all possible" (man mdadm).
>
> Recently, the NVMe drive started having problems (PCI AER errors and the
> controller disappearing), so I removed it from the arrays and wiped it.
> However, I have reseated the drive in the M.2 socket and this apparently fixed
> it (verified with tests).
>
>     $ cat /proc/mdstat
>     Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
>     md1 : active raid1 sdb5[1](W)
>           471727104 blocks super 1.2 [2/1] [_U]
>           bitmap: 4/4 pages [16KB], 65536KB chunk
>
>     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
>           3142656 blocks super 1.2 [2/2] [UU]
>           bitmap: 0/1 pages [0KB], 65536KB chunk
>
>     md0 : active raid1 sdb4[3]
>           2094080 blocks super 1.2 [2/1] [_U]
>          
>     unused devices: <none>
>
> (md2 was used just for testing, ignore it).
>
> Today, I have tried to add the drive back to the arrays by using a script that
> executed in quick succession:
>
>     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
>     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
>
> This was on Linux 6.10.0, patched with my previous patch:
>
>     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/
>
> (which fixed a regression in the kernel and allows it to start /dev/md1 with a
> single drive in write-mostly mode).
> In the background, I was running "rdiff-backup --compare" that was comparing
> data between my array contents and a backup attached via USB.
>
> This, however resulted in mayhem - I was unable to start any program with an
> input-output error, etc. I used SysRQ + C to save a kernel log:
>
Hello,

It is possible that my second SSD has some problems and high read activity
during RAID resync triggered it. Reads from that drive are now very slow (between
10 - 30 MB/s) and this suggests that something is not OK.

Greetings,

Mateusz