linux-kernel - Re: [PATCH v5 00/14] dm-raid/md/raid: fix v6.7 regressions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dafd3e61-a033-4c4b-bcf6-70ccd0f4ff63@huaweicloud.com>
Date: Tue, 20 Feb 2024 11:09:06 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Benjamin Marzinski <bmarzins@...hat.com>,
 Yu Kuai <yukuai1@...weicloud.com>
Cc: Song Liu <song@...nel.org>, mpatocka@...hat.com, heinzm@...hat.com,
 xni@...hat.com, blazej.kucman@...ux.intel.com, agk@...hat.com,
 snitzer@...nel.org, dm-devel@...ts.linux.dev, jbrassow@....redhat.com,
 neilb@...e.de, shli@...com, akpm@...l.org, linux-kernel@...r.kernel.org,
 linux-raid@...r.kernel.org, yi.zhang@...wei.com, yangerkun@...wei.com,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH v5 00/14] dm-raid/md/raid: fix v6.7 regressions

Hi,

在 2024/02/20 0:05, Benjamin Marzinski 写道:
> On Sun, Feb 18, 2024 at 09:24:31AM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2024/02/16 13:46, Benjamin Marzinski 写道:
>>> On Thu, Feb 15, 2024 at 02:24:34PM -0800, Song Liu wrote:
>>>> On Thu, Feb 1, 2024 at 1:30 AM Yu Kuai <yukuai1@...weicloud.com> wrote:
>>>>>
>>>> [...]
>>>>>
>>>>> [1] https://lore.kernel.org/all/CALTww29QO5kzmN6Vd+jT=-8W5F52tJjHKSgrfUc1Z1ZAeRKHHA@mail.gmail.com/
>>>>>
>>>>> Yu Kuai (14):
>>>>>     md: don't ignore suspended array in md_check_recovery()
>>>>>     md: don't ignore read-only array in md_check_recovery()
>>>>>     md: make sure md_do_sync() will set MD_RECOVERY_DONE
>>>>>     md: don't register sync_thread for reshape directly
>>>>>     md: don't suspend the array for interrupted reshape
>>>>>     md: fix missing release of 'active_io' for flush
>>>>
>>>> Applied 1/14-5/14 to md-6.8 branch (6/14 was applied earlier).
>>>>
>>>> Thanks,
>>>> Song
>>>
>>> I'm still seeing new failures that I can't reproduce in the 6.6 kernel,
>>> specifically:
>>>
>>> lvconvert-raid-reshape-stripes-load-reload.sh
>>> lvconvert-repair-raid.sh
>>>
>>> with lvconvert-raid-reshape-stripes-load-reload.sh Patch 12/14
>>> ("md/raid456: fix a deadlock for dm-raid456 while io concurrent with
>>> reshape") is changing a hang to a corruption. The issues is that we
>>> can't simply fail IO that crosses the reshape position. I assume that
>>> the correct thing to do is have dm-raid reissue it after the suspend,
>>> when the reshape can make progress again. Perhaps something like this,
>>> only less naive (although this patch does make the test pass for me).
>>> Heinz, any thoughts on this? Otherwise, I'll look into this a little
>>> more and post a RFC patch.
>>
>> Does the corruption looks like below?
> 
> There isn't a kernel stack trace.  The test
> lvconvert-raid-reshape-stripes-load-reload.sh does some IO to a
> filesytem on top of a raid device, and then starts a reshape, and
> repeatedly suspends the device. After all that, it runs fsck to see if
> the filesystem is clean, and on runs where I see "dm-raid456: io failed
> across reshape position while reshape can't make progress" I see
> filesystem errors:
> 
> ------------------------------------------------------------------
> [ 0:25.219] fsck from util-linux 2.39.2
> [ 0:25.224] e2fsck 1.47.0 (5-Feb-2023)
> [ 0:25.232] Warning: skipping journal recovery because doing a read-only
> filesystem check.
> [ 0:25.233] Pass 1: Checking inodes, blocks, and sizes
> [ 0:25.233] Pass 2: Checking directory structure
> [ 0:25.234] Pass 3: Checking directory connectivity
> [ 0:25.234] Pass 4: Checking reference counts
> [ 0:25.234] Pass 5: Checking group summary information
> [ 0:25.234] Feature orphan_present is set but orphan file is clean.
> [ 0:25.235] Clear? no
> [ 0:25.235]
> [ 0:25.235]
> [ 0:25.235] /tmp/LVMTEST35943.Iuo9Ro5tCY/dev/mapper/LVMTEST35943vg-LV1:
> ********** WARNING: Filesystem still has errors **********
> [ 0:25.235]
> [ 0:25.235] /tmp/LVMTEST35943.Iuo9Ro5tCY/dev/mapper/LVMTEST35943vg-LV1:
> 13/2560 files (0.0% non-contiguous), 5973/10240 blocks
> ------------------------------------------------------------------
> 
> O.k. corruption is too strong a word. Lets just call it a filesystem
> that got a write error, and now is in an unclean state according to
> fsck. I'm pretty sure that this is recoverable.

Yes, I thought this can be acceptable because everything should be good
again once reshape continues.

> 
>> [12504.959682] BUG bio-296 (Not tainted): Object already free
>> [12504.960239]
>> -----------------------------------------------------------------------------
>> [12504.960239]
>> [12504.961209] Allocated in mempool_alloc+0xe8/0x270 age=30 cpu=1 pid=203288
>> [12504.961905]  kmem_cache_alloc+0x36a/0x3b0
>> [12504.962324]  mempool_alloc+0xe8/0x270
>> [12504.962712]  bio_alloc_bioset+0x3b5/0x920
>> [12504.963129]  bio_alloc_clone+0x3e/0x160
>> [12504.963533]  alloc_io+0x3d/0x1f0
>> [12504.963876]  dm_submit_bio+0x12f/0xa30
>> [12504.964267]  __submit_bio+0x9c/0xe0
>> [12504.964639]  submit_bio_noacct_nocheck+0x25a/0x570
>> [12504.965136]  submit_bio_wait+0xc2/0x160
>> [12504.965535]  blkdev_issue_zeroout+0x19b/0x2e0
>> [12504.965991]  ext4_init_inode_table+0x246/0x560
>> [12504.966462]  ext4_lazyinit_thread+0x750/0xbe0
>> [12504.966922]  kthread+0x1b4/0x1f0
>>
>> I assum that this is a dm problem and I'm still trying to debug it.
>> Can you explain more why IO that crosses the reshape position can't
>> fail directly?
> 
> Maybe I'm missing something here, but if the filesystem is trying to
> write out data to the device, and we fail that IO, why would that not
> cause problems, whatever we call it?

And the root cause is the logical in raid456:

Reshape will reconstruct data, and data accross reshape position can't
be reachable until reshape make progress.

The point is that before c467e97f079f, data could be corrupted sliently,
because c467e97f079f fix that IO across reshape position is submitted
directly.

I'm not sure yet how to completely fix this, we can let the IO wait for
reshape(in the upper layer, wait in raid456 will deadlock) to make
progress instead of fail it directly, however, continue the reshape
relies on user, and this way IO may wait forever.

Thanks,
Kuai
> 
> [ 0:18.792] 3,6342,47220996156,-;dm-raid456: io failed across reshape
> position while reshape can't make progress
> [ 0:18.792] 3,6343,47220996182,-;Aborting journal on device dm-39-8.
> [ 0:18.792] 3,6344,47221411730,-;dm-raid456: io failed across reshape
> position while reshape can't make progress
> [ 0:18.792] 3,6345,47221411746,-;Buffer I/O error on dev dm-39, logical
> block 740, lost sync page write
> [ 0:18.792] 3,6346,47221416194,-;JBD2: I/O error when updating journal
> superblock for dm-39-8.
> 
> Does this test not fail for you? Or does it simply also fail in the 6.6
> kernel.

Yes, this test failed as well. And it also fail in 6.6.
> 
> -Ben
>   
>> Thanks,
>> Kuai
>>
>>>
>>> =========================================================
>>> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
>>> index ed8c28952b14..ff481d494b04 100644
>>> --- a/drivers/md/dm-raid.c
>>> +++ b/drivers/md/dm-raid.c
>>> @@ -3345,6 +3345,14 @@ static int raid_map(struct dm_target *ti, struct bio *bio)
>>>    	return DM_MAPIO_SUBMITTED;
>>>    }
>>> +static int raid_end_io(struct dm_target *ti, struct bio *bio,
>>> +		       blk_status_t *error)
>>> +{
>>> +	if (*error != BLK_STS_IOERR || !dm_noflush_suspending(ti))
>>> +		return DM_ENDIO_DONE;
>>> +	return DM_ENDIO_REQUEUE;
>>> +}
>>
>>> +
>>>    /* Return sync state string for @state */
>>>    enum sync_state { st_frozen, st_reshape, st_resync, st_check, st_repair, st_recover, st_idle };
>>>    static const char *sync_str(enum sync_state state)
>>> @@ -4100,6 +4108,7 @@ static struct target_type raid_target = {
>>>    	.ctr = raid_ctr,
>>>    	.dtr = raid_dtr,
>>>    	.map = raid_map,
>>> +	.end_io = raid_end_io,
>>>    	.status = raid_status,
>>>    	.message = raid_message,
>>>    	.iterate_devices = raid_iterate_devices,
>>> =========================================================
>>>>
>>>>
>>>>>     md: export helpers to stop sync_thread
>>>>>     md: export helper md_is_rdwr()
>>>>>     dm-raid: really frozen sync_thread during suspend
>>>>>     md/dm-raid: don't call md_reap_sync_thread() directly
>>>>>     dm-raid: add a new helper prepare_suspend() in md_personality
>>>>>     md/raid456: fix a deadlock for dm-raid456 while io concurrent with
>>>>>       reshape
>>>>>     dm-raid: fix lockdep waring in "pers->hot_add_disk"
>>>>>     dm-raid: remove mddev_suspend/resume()
>>>>>
>>>>>    drivers/md/dm-raid.c |  78 +++++++++++++++++++--------
>>>>>    drivers/md/md.c      | 126 +++++++++++++++++++++++++++++--------------
>>>>>    drivers/md/md.h      |  16 ++++++
>>>>>    drivers/md/raid10.c  |  16 +-----
>>>>>    drivers/md/raid5.c   |  61 +++++++++++----------
>>>>>    5 files changed, 192 insertions(+), 105 deletions(-)
>>>>>
>>>>> --
>>>>> 2.39.2
>>>>>
>>>>>
>>>
>>> .
>>>
> 
> .
>