lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <887a09bc-3c98-4bd1-aa31-0732fc633315@wdc.com>
Date: Mon, 23 Sep 2024 14:41:20 +0000
From: Johannes Thumshirn <Johannes.Thumshirn@....com>
To: Qu Wenruo <quwenruo.btrfs@....com>, Johannes Thumshirn <jth@...nel.org>,
	Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba
	<dsterba@...e.com>, "open list:BTRFS FILE SYSTEM"
	<linux-btrfs@...r.kernel.org>, open list <linux-kernel@...r.kernel.org>
CC: WenRuo Qu <wqu@...e.com>, Naohiro Aota <Naohiro.Aota@....com>
Subject: Re: [PATCH] btrfs: also add stripe entries for NOCOW writes

On 23.09.24 10:54, Qu Wenruo wrote:
> 
> 
> 在 2024/9/23 17:45, Johannes Thumshirn 写道:
>> On 23.09.24 09:56, Qu Wenruo wrote:
>>>>> 在 2024/9/23 16:15, Johannes Thumshirn 写道:
>>>>>> From: Johannes Thumshirn <johannes.thumshirn@....com>
>>>>>>
>>>>>> NOCOW writes do not generate stripe_extent entries in the RAID stripe
>>>>>> tree, as the RAID stripe-tree feature initially was designed with a
>>>>>> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
>>>>>> writes. But the RAID stripe-tree feature is independent from the zoned
>>>>>> feature, so we must also allow NOCOW writes for zoned filesystems.
>>>>>>
>>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@....com>
>>>>>
>>>>> Sorry I'm going to repeat myself again, I still believe if we insert an
>>>>> RST entry at falloc() time, it will be more consistent with the non-RST
>>>>> code.
>>>>>
>>>>> Yes, I known preallocated space will not need any read nor search RST
>>>>> entry, and we just fill the page cache with zero at read time.
>>>>>
>>>>> But the point of proper (not just dummy) RST entry for the whole
>>>>> preallocated space is, we do not need to touch the RST entry anymore for
>>>>> NOCOW/PREALLOCATED write at all.
>>>>>
>>>>> This makes the RST NOCOW/PREALLOC writes behavior to align with the
>>>>> non-RST code, which doesn't update any extent item, but only modify the
>>>>> file extent for PREALLOC writes.
>>>>
>>>> Please re-read the patch. This is not a dummy RST entry but a real RST
>>>> entry for NOCOW writes.
>>>>
>>> I know, but my point is, if the RST entry for preallocated range is
>>> already a regular one, you won't even need to insert/update the RST tree
>>> at all.
>>>
>>> Just like we do not need to update the extent tree for
>>> NOCOW/PREALLOCATED writes.
>>
>> But as long as there is no data on disk there is no point of having a
>> logical to N-disk physical mapping. We haven't even called
>> btrfs_map_block() before, so where do we know which disks will get the
>> blocks at which address? Look at this example:
>>
>> Fallocate [0, 1M]
>> virtme-scsi:/mnt # xfs_io -fc "falloc 0 1M" -c sync test
> [...]
>>
>>
>> [1] we preallocate the data for [0, 1M] @ 298844160
>> [2] we have the actual written data for [64k, 128k] @ 298844160
>>
>> What should I do to insert the RST entry there as we get:
> 
> Do the needed write map and insert the RST entries, just as if we're
> doing a write, but without any actual IO.
> 
> 
> Currently the handling of RST is not consistent with non-RST, thus
> that's the reason causing problems with scrub on preallocated extents.
> 
> I known preallocated range won't trigger any read thus it makes no sense
> to do the proper RST setup.
> But let's also take the example of non-RST scrub:
> 
> Do we spend our time checking if a data extent is preallocated or not?
> No, we do not. We just read the content, and check against its csum.
> For preallocated extents, it just has no csum.
> 
> You can argue that this is wasting IO, but I can also counter-argue that
> we need to make sure the read on that device range is fine, even if we
> know we will not really read the content (but that's only for now).
> 
> 
> Furthermore, from the logical aspect, at least to me, non-RST case is
> just a special case where logical address is 1:1 mapped already.
> 
> This also means, even for preallocated extents, they should have RST
> entries.
> 
> 
> Finally, I do not think it's a good idea to insert RST entries for NOCOW.
> If a file is set NOCOW, it means we'll doing a lot of overwrite for it.
> Then why waste our time updating the RST entries again and again?
> 
> Isn't such behavior going to cause more write amplification? Meanwhile
> for non-RST cases, NOCOW should cause the least amount of write
> amplification.

The whole idea behind the RST was to write the RST entries _after_ the 
data has been persisted to disk. Otherwise we're back at the write hole 
problem. See for example this imaginary sequence:

Preallocate a range. This will then also preallocate the RST entries 
with the mapping as you describe. Write to it and while you write you 
have a powerloss. The copy/stripe to disk 1 is correctly written but 
disk 2 didn't report back before the power loss happened. After we have 
power again, a read to disk 2 comes in, as we have a RST entry, the read 
will be directed to the broken entry and garbage is returned. And this 
is the good case, as we can repair it.
If it was an overwrite of a block and the same happens, we have a RST 
entry pointing to a good and a bad copy.

Once we're adding the RST entries after both writes succeed the problem 
isn't there. So for preallocated extents it is even harmful to add a RST 
entry.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ