[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <887a09bc-3c98-4bd1-aa31-0732fc633315@wdc.com>
Date: Mon, 23 Sep 2024 14:41:20 +0000
From: Johannes Thumshirn <Johannes.Thumshirn@....com>
To: Qu Wenruo <quwenruo.btrfs@....com>, Johannes Thumshirn <jth@...nel.org>,
Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba
<dsterba@...e.com>, "open list:BTRFS FILE SYSTEM"
<linux-btrfs@...r.kernel.org>, open list <linux-kernel@...r.kernel.org>
CC: WenRuo Qu <wqu@...e.com>, Naohiro Aota <Naohiro.Aota@....com>
Subject: Re: [PATCH] btrfs: also add stripe entries for NOCOW writes
On 23.09.24 10:54, Qu Wenruo wrote:
>
>
> 在 2024/9/23 17:45, Johannes Thumshirn 写道:
>> On 23.09.24 09:56, Qu Wenruo wrote:
>>>>> 在 2024/9/23 16:15, Johannes Thumshirn 写道:
>>>>>> From: Johannes Thumshirn <johannes.thumshirn@....com>
>>>>>>
>>>>>> NOCOW writes do not generate stripe_extent entries in the RAID stripe
>>>>>> tree, as the RAID stripe-tree feature initially was designed with a
>>>>>> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
>>>>>> writes. But the RAID stripe-tree feature is independent from the zoned
>>>>>> feature, so we must also allow NOCOW writes for zoned filesystems.
>>>>>>
>>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@....com>
>>>>>
>>>>> Sorry I'm going to repeat myself again, I still believe if we insert an
>>>>> RST entry at falloc() time, it will be more consistent with the non-RST
>>>>> code.
>>>>>
>>>>> Yes, I known preallocated space will not need any read nor search RST
>>>>> entry, and we just fill the page cache with zero at read time.
>>>>>
>>>>> But the point of proper (not just dummy) RST entry for the whole
>>>>> preallocated space is, we do not need to touch the RST entry anymore for
>>>>> NOCOW/PREALLOCATED write at all.
>>>>>
>>>>> This makes the RST NOCOW/PREALLOC writes behavior to align with the
>>>>> non-RST code, which doesn't update any extent item, but only modify the
>>>>> file extent for PREALLOC writes.
>>>>
>>>> Please re-read the patch. This is not a dummy RST entry but a real RST
>>>> entry for NOCOW writes.
>>>>
>>> I know, but my point is, if the RST entry for preallocated range is
>>> already a regular one, you won't even need to insert/update the RST tree
>>> at all.
>>>
>>> Just like we do not need to update the extent tree for
>>> NOCOW/PREALLOCATED writes.
>>
>> But as long as there is no data on disk there is no point of having a
>> logical to N-disk physical mapping. We haven't even called
>> btrfs_map_block() before, so where do we know which disks will get the
>> blocks at which address? Look at this example:
>>
>> Fallocate [0, 1M]
>> virtme-scsi:/mnt # xfs_io -fc "falloc 0 1M" -c sync test
> [...]
>>
>>
>> [1] we preallocate the data for [0, 1M] @ 298844160
>> [2] we have the actual written data for [64k, 128k] @ 298844160
>>
>> What should I do to insert the RST entry there as we get:
>
> Do the needed write map and insert the RST entries, just as if we're
> doing a write, but without any actual IO.
>
>
> Currently the handling of RST is not consistent with non-RST, thus
> that's the reason causing problems with scrub on preallocated extents.
>
> I known preallocated range won't trigger any read thus it makes no sense
> to do the proper RST setup.
> But let's also take the example of non-RST scrub:
>
> Do we spend our time checking if a data extent is preallocated or not?
> No, we do not. We just read the content, and check against its csum.
> For preallocated extents, it just has no csum.
>
> You can argue that this is wasting IO, but I can also counter-argue that
> we need to make sure the read on that device range is fine, even if we
> know we will not really read the content (but that's only for now).
>
>
> Furthermore, from the logical aspect, at least to me, non-RST case is
> just a special case where logical address is 1:1 mapped already.
>
> This also means, even for preallocated extents, they should have RST
> entries.
>
>
> Finally, I do not think it's a good idea to insert RST entries for NOCOW.
> If a file is set NOCOW, it means we'll doing a lot of overwrite for it.
> Then why waste our time updating the RST entries again and again?
>
> Isn't such behavior going to cause more write amplification? Meanwhile
> for non-RST cases, NOCOW should cause the least amount of write
> amplification.
The whole idea behind the RST was to write the RST entries _after_ the
data has been persisted to disk. Otherwise we're back at the write hole
problem. See for example this imaginary sequence:
Preallocate a range. This will then also preallocate the RST entries
with the mapping as you describe. Write to it and while you write you
have a powerloss. The copy/stripe to disk 1 is correctly written but
disk 2 didn't report back before the power loss happened. After we have
power again, a read to disk 2 comes in, as we have a RST entry, the read
will be directed to the broken entry and garbage is returned. And this
is the good case, as we can repair it.
If it was an overwrite of a block and the same happens, we have a RST
entry pointing to a good and a bad copy.
Once we're adding the RST entries after both writes succeed the problem
isn't there. So for preallocated extents it is even harmful to add a RST
entry.
Powered by blists - more mailing lists