linux-kernel - Re: [REGRESSION] Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <018dd7bf-e7f6-3561-a522-5dea143947eb@huaweicloud.com>
Date: Wed, 31 Jul 2024 09:10:59 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Mateusz Jończyk <mat.jonczyk@...pl>,
 Paul E Luse <paul.e.luse@...ux.intel.com>
Cc: linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
 Song Liu <song@...nel.org>, regressions@...ts.linux.dev,
 Mariusz Tkaczyk <mariusz.tkaczyk@...ux.intel.com>,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [REGRESSION] Filesystem corruption when adding a new RAID device
 (delayed-resync, write-mostly)

Hi,

在 2024/07/31 4:35, Mateusz Jończyk 写道:
> W dniu 28.07.2024 o 12:30, Mateusz Jończyk pisze:
>> W dniu 25.07.2024 o 16:27, Paul E Luse pisze:
>>> On Thu, 25 Jul 2024 09:15:40 +0200
>>> Mateusz Jończyk <mat.jonczyk@...pl> wrote:
>>>
>>>> Dnia 24 lipca 2024 23:19:06 CEST, Paul E Luse
>>>> <paul.e.luse@...ux.intel.com> napisał/a:
>>>>> On Wed, 24 Jul 2024 22:35:49 +0200
>>>>> Mateusz Jończyk <mat.jonczyk@...pl> wrote:
>>>>>
>>>>>> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
>>>>>>> W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> In my laptop, I used to have two RAID1 arrays on top of NVMe and
>>>>>>>> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
>>>>>>>> for remaining data (LUKS
>>>>>>>> + LVM + ext4). For performance, I have marked the RAID component
>>>>>>>> device for /dev/md1 on the SATA SSD drive write-mostly, which
>>>>>>>> "means that the 'md' driver will avoid reading from these
>>>>>>>> devices if at all possible" (man mdadm).
>>>>>>>>
>>>>>>>> Recently, the NVMe drive started having problems (PCI AER errors
>>>>>>>> and the controller disappearing), so I removed it from the
>>>>>>>> arrays and wiped it. However, I have reseated the drive in the
>>>>>>>> M.2 socket and this apparently fixed it (verified with tests).
>>>>>>>>
>>>>>>>>      $ cat /proc/mdstat
>>>>>>>>      Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
>>>>>>>> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
>>>>>>>>            471727104 blocks super 1.2 [2/1] [_U]
>>>>>>>>            bitmap: 4/4 pages [16KB], 65536KB chunk
>>>>>>>>
>>>>>>>>      md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
>>>>>>>>            3142656 blocks super 1.2 [2/2] [UU]
>>>>>>>>            bitmap: 0/1 pages [0KB], 65536KB chunk
>>>>>>>>
>>>>>>>>      md0 : active raid1 sdb4[3]
>>>>>>>>            2094080 blocks super 1.2 [2/1] [_U]
>>>>>>>>           
>>>>>>>>      unused devices: <none>
>>>>>>>>
>>>>>>>> (md2 was used just for testing, ignore it).
>>>>>>>>
>>>>>>>> Today, I have tried to add the drive back to the arrays by
>>>>>>>> using a script that executed in quick succession:
>>>>>>>>
>>>>>>>>      mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
>>>>>>>>      mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
>>>>>>>>
>>>>>>>> This was on Linux 6.10.0, patched with my previous patch:
>>>>>>>>
>>>>>>>>      https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/
>>>>>>>>
>>>>>>>> (which fixed a regression in the kernel and allows it to start
>>>>>>>> /dev/md1 with a single drive in write-mostly mode).
>>>>>>>> In the background, I was running "rdiff-backup --compare" that
>>>>>>>> was comparing data between my array contents and a backup
>>>>>>>> attached via USB.
>>>>>>>>
>>>>>>>> This, however resulted in mayhem - I was unable to start any
>>>>>>>> program with an input-output error, etc. I used SysRQ + C to
>>>>>>>> save a kernel log:
>>>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Unfortunately, hardware failure seems not to be the case.
>>>>>>
>>>>>> I did test it again on 6.10, twice, and in both cases I got
>>>>>> filesystem corruption (but not as severe).
>>>>>>
>>>>>> On Linux 6.1.96 it seems to be working well (also did two tries).
>>>>>>
>>>>>> Please note: in my tests, I was using a RAID component device with
>>>>>> a write-mostly bit set. This setup does not work on 6.9+ out of the
>>>>>> box and requires the following patch:
>>>>>>
>>>>>> commit 36a5c03f23271 ("md/raid1: set max_sectors during early
>>>>>> return from choose_slow_rdev()")
>>>>>>
>>>>>> that is in master now.
>>>>>>
>>>>>> It is also heading into stable, which I'm going to interrupt.
>> Hello,
>>
>> With much effort (challenging to reproduce reliably) I think have nailed down the issue to the read_balance refactoring series in 6.9:
> [snip]
>> After code analysis, I have noticed that the following check that was present in old
>> read_balance() is not present (in equivalent form in the new code):
>>
>>                  if (!test_bit(In_sync, &rdev->flags) &&
>>                      rdev->recovery_offset < this_sector + sectors)
>>                          continue;
>>
>> (in choose_slow_rdev() and choose_first_rdev() and possibly other functions)
>>
>> which would cause the kernel to read from the device being synced to before
>> it is ready.
> 
> Hello,
> 
> I think have made a reliable (and safe) reproducer for this bug:
> 
> Prerequisite: create an array on top of 2 devices 1GB+ large:
> 
> mdadm --create /dev/md4 --level=1 --raid-devices=2 /dev/nvme0n1p5 --write-mostly /dev/sdb8
> The script:
> -------------------------------8<------------------------
> 
> #!/bin/bash
> 
> mdadm /dev/md4 --fail /dev/nvme0n1p5
> sleep 1
> mdadm /dev/md4 --remove failed
> sleep 1
> 
> # fill with random data
> shred -n1 -v /dev/md4
> # fill with zeros
> shred -n0 -zv /dev/nvme0n1p5
> 
> sha256sum /dev/md4
> 
> echo 1 > /proc/sys/vm/drop_caches
> 
> date
> 
> # calculate a shasum while the array is being synced
> ( sha256sum /dev/md4; date ) &
> mdadm /dev/md4 --add --readwrite /dev/nvme0n1p5
> date
> 
> -------------------------------8<------------------------
> 
> The two shasums should be equal, but they were different in my tests on affected kernels.
> 
> Also, in my tests with the script, *without* a write-mostly device in the array, the problems did not happen.

Thanks for the test,

Can you send a new version of patch, and this test to mdadm?
Kuai

> 
> Greetings,
> 
> Mateusz
> 
> .
>