linux-kernel - Re: [PATCH RFC 5/6] md/raid1: Handle bio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <098e65e7-53fb-4bf1-b973-2bda425139ae@demonlair.co.uk>
Date: Wed, 23 Oct 2024 12:46:27 +0100
From: Geoff Back <geoff@...onlair.co.uk>
To: John Garry <john.g.garry@...cle.com>, Yu Kuai <yukuai1@...weicloud.com>,
 axboe@...nel.dk, hch@....de
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-raid@...r.kernel.org, martin.petersen@...cle.com,
 "yangerkun@...wei.com" <yangerkun@...wei.com>,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH RFC 5/6] md/raid1: Handle bio_split() errors

On 23/10/2024 12:16, John Garry wrote:
> On 23/09/2024 10:38, Yu Kuai wrote:
>>>>>> We need a new branch in read_balance() to choose a rdev with full 
>>>>>> copy.
>>>>> Sure, I do realize that the mirror'ing personalities need more 
>>>>> sophisticated error handling changes (than what I presented).
>>>>>
>>>>> However, in raid1_read_request() we do the read_balance() and then 
>>>>> the bio_split() attempt. So what are you suggesting we do for the 
>>>>> bio_split() error? Is it to retry without the bio_split()?
>>>>>
>>>>> To me bio_split() should not fail. If it does, it is likely ENOMEM 
>>>>> or some other bug being exposed, so I am not sure that retrying with 
>>>>> skipping bio_split() is the right approach (if that is what you are 
>>>>> suggesting).
>>>> bio_split_to_limits() is already called from md_submit_bio(), so here
>>>> bio should only be splitted because of badblocks or resync. We have to
>>>> return error for resync, however, for badblocks, we can still try to
>>>> find a rdev without badblocks so bio_split() is not needed. And we need
>>>> to retry and inform read_balance() to skip rdev with badblocks in this
>>>> case.
>>>>
>>>> This can only happen if the full copy only exist in slow disks. This
>>>> really is corner case, and this is not related to your new error path by
>>>> atomic write. I don't mind this version for now, just something
>>>> I noticed if bio_spilit() can fail.
> Hi Kuai,
>
> I am just coming back to this topic now.
>
> Previously I was saying that we should error and end the bio if we need 
> to split for an atomic write due to BB. Continued below..
>
>>> Are you saying that some improvement needs to be made to the current 
>>> code for badblocks handling, like initially try to skip bio_split()?
>>>
>>> Apart from that, what about the change in raid10_write_request(), 
>>> w.r.t error handling?
>>>
>>> There, for an error in bio_split(), I think that we need to do some 
>>> tidy-up if bio_split() fails, i.e. undo increase in rdev->nr_pending 
>>> when looping conf->copies
>>>
>>> BTW, feel free to comment in patch 6/6 for that.
>> Yes, raid1/raid10 write are the same. If you want to enable atomic write
>> for raid1/raid10, you must add a new branch to handle badblocks now,
>> otherwise, as long as one copy contain any badblocks, atomic write will
>> fail while theoretically I think it can work.
> Can you please expand on what you mean by this last sentence, "I think 
> it can work".
>
> Indeed, IMO, chance of encountering a device with BBs and supporting 
> atomic writes is low, so no need to try to make it work (if it were 
> possible) - I think that we just report EIO.
>
> Thanks,
> John
>
>
Hi all,

Looking at this from a different angle: what does the bad blocks system
actually gain in modern environments?  All the physical storage devices
I can think of (including all HDDs and SSDs, NVME or otherwise) have
internal mechanisms for remapping faulty blocks, and therefore
unrecoverable blocks don't become visible to the Linux kernel level
until after the physical storage device has exhausted its internal
supply of replacement blocks.  At that point the physical device is
already catastrophically failing, and in the case of SSDs will likely
have already transitioned to a read-only state.  Using bad-blocks at the
kernel level to map around additional faulty blocks at this point does
not seem to me to have any benefit, and the device is unlikely to be
even marginally usable for any useful length of time at that point anyway.

It seems to me that the bad-blocks capability is a legacy from the
distant past when HDDs did not do internal block remapping and hence the
kernel could usefully keep a disk usable by mapping out individual
blocks in software.
If this is the case and there isn't some other way that bad-blocks is
still beneficial, might it be better to drop it altogether rather than
implementing complex code to work around its effects?

Of course I'm happy to be corrected if there's still a real benefit to
having it, just because I can't see one doesn't mean there isn't one.

Regards,
Geoff.