linux-kernel - Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <92cec08a-7fe7-1cc7-7a39-7f3d5fbc087b@huaweicloud.com>
Date: Thu, 23 May 2024 10:00:02 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Dave Chinner <david@...morbit.com>
Cc: linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-ext4@...r.kernel.org, djwong@...nel.org,
 hch@...radead.org, brauner@...nel.org, chandanbabu@...nel.org, jack@...e.cz,
 yi.zhang@...wei.com, chengzhihao1@...wei.com, yukuai3@...wei.com
Subject: Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range

On 2024/5/23 9:11, Dave Chinner wrote:
> On Wed, May 22, 2024 at 09:57:13AM +0800, Zhang Yi wrote:
>> On 2024/5/21 10:38, Dave Chinner wrote:
>>> We can do all this with a single writeback operation if we are a
>>> little bit smarter about the order of operations we perform and we
>>> are a little bit smarter in iomap about zeroing dirty pages in the
>>> page cache:
>>>
>>> 	1. change iomap_zero_range() to do the right thing with
>>> 	dirty unwritten and cow extents (the patch I've been working
>>> 	on).
>>>
>>> 	2. pass the range to be zeroed into iomap_truncate_page()
>>> 	(the fundamental change being made here).
>>>
>>> 	3. zero the required range *through the page cache*
>>> 	(iomap_zero_range() already does this).
>>>
>>> 	4. write back the XFS inode from ip->i_disk_size to the end
>>> 	of the range zeroed by iomap_truncate_page()
>>> 	(xfs_setattr_size() already does this).
>>>
>>> 	5. i_size_write(newsize);
>>>
>>> 	6. invalidate_inode_pages2_range(newsize, -1) to trash all
>>> 	the page cache beyond the new EOF without doing any zeroing
>>> 	as we've already done all the zeroing needed to the page
>>> 	cache through iomap_truncate_page().
>>>
>>>
>>> The patch I'm working on for step 1 is below. It still needs to be
>>> extended to handle the cow case, but I'm unclear on how to exercise
>>> that case so I haven't written the code to do it. The rest of it is
>>> just rearranging the code that we already use just to get the order
>>> of operations right. The only notable change in behaviour is using
>>> invalidate_inode_pages2_range() instead of truncate_pagecache(),
>>> because we don't want the EOF page to be dirtied again once we've
>>> already written zeroes to disk....
>>>
>>
>> Indeed, this sounds like the best solution. Since Darrick recommended
>> that we could fix the stale data exposure on realtime inode issue by
>> convert the tail extent to unwritten, I suppose we could do this after
>> fixing the problem.
> 
> We also need to fix the truncate issue for the upcoming forced
> alignment feature (for atomic writes), and in that case we are
> required to write zeroes to the entire tail extent. i.e. forced
> alignment does not allow partial unwritten extent conversion of
> the EOF extent.
> 

Yes, right. I noticed that feature also needs to fix.

> Hence I think we want to fix the problem by zeroing the entire EOF
> extent first, then optimise the large rtextsize case to use
> unwritten extents if that tail zeroing proves to be a performance
> issue.
> 
> I say "if" because the large rtextsize case will still need to write
> zeroes for the fsb that spans EOF. Adding conversion of the rest of
> the extent to unwritten may well be more expensive (in terms of both
> CPU and IO requirements for the transactional metadata updates) than
> just submitting a slightly larger IO containing real zeroes and
> leaving it as a written extent....
> 

Yeah, if the rtextsize if not large (in most cases), I'm pretty sure
that writing zeros would better. If the rtextsize is large enough, I
think it deserves a performance test.

Thanks,
Yi.