linux-kernel - Re: [PATCH v4] ceph: invalidate pages when doing direct/sync writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <bf1758b0-5a0f-5c9f-ecdf-e3031fdd9678@redhat.com>
Date:   Sun, 10 Apr 2022 09:35:48 +0800
From:   Xiubo Li <xiubli@...hat.com>
To:     Jeff Layton <jlayton@...nel.org>,
        Luís Henriques <lhenriques@...e.de>,
        Ilya Dryomov <idryomov@...il.com>
Cc:     ceph-devel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] ceph: invalidate pages when doing direct/sync writes


On 4/8/22 8:04 PM, Jeff Layton wrote:
> On Fri, 2022-04-08 at 10:47 +0800, Xiubo Li wrote:
>> On 4/7/22 11:15 PM, Luís Henriques wrote:
>>> When doing a direct/sync write, we need to invalidate the page cache in
>>> the range being written to.  If we don't do this, the cache will include
>>> invalid data as we just did a write that avoided the page cache.
>>>
>>> Signed-off-by: Luís Henriques <lhenriques@...e.de>
>>> ---
>>>    fs/ceph/file.c | 19 ++++++++++++++-----
>>>    1 file changed, 14 insertions(+), 5 deletions(-)
>>>
>>> Changes since v3:
>>> - Dropped initial call to invalidate_inode_pages2_range()
>>> - Added extra comment to document invalidation
>>>
>>> Changes since v2:
>>> - Invalidation needs to be done after a write
>>>
>>> Changes since v1:
>>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>>
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 5072570c2203..97f764b2fbdd 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>    		return ret;
>>>    
>>>    	ceph_fscache_invalidate(inode, false);
>>> -	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>> -					    pos >> PAGE_SHIFT,
>>> -					    (pos + count - 1) >> PAGE_SHIFT);
>>> -	if (ret < 0)
>>> -		dout("invalidate_inode_pages2_range returned %d\n", ret);
>>>    
>>>    	while ((len = iov_iter_count(from)) > 0) {
>>>    		size_t left;
>>> @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>    			break;
>>>    		}
>>>    		ceph_clear_error_write(ci);
>>> +
>>> +		/*
>>> +		 * we need to invalidate the page cache here, otherwise the
>>> +		 * cache will include invalid data in direct/sync writes.
>>> +		 */
>>> +		ret = invalidate_inode_pages2_range(
>>> +				inode->i_mapping,
>>> +				pos >> PAGE_SHIFT,
>>> +				(pos + len - 1) >> PAGE_SHIFT);
>>> +		if (ret < 0) {
>>> +			dout("invalidate_inode_pages2_range returned %d\n",
>>> +			     ret);
>>> +			ret = 0;
>> For this, IMO it's not safe. If we just ignore it the pagecache will
>> still have invalid data.
>>
> That data is not invalid. It's dirty data from a buffered write that
> raced with the DIO/sync write we're handling here. i.e. another write
> came in while we were already processing the DIO write, but after the
> point where we called filemap_write_and_wait.
>
> When two write() calls to the same data are racing like that, the
> outcome is undefined. We can't be certain which one will prevail as the
> kernel could handle them in either order.

Okay, I think you are right.

-- Xiubo

>
> The good news with Ceph/RADOS is that you shouldn't end up with a torn
> write, unless the write happens to span multiple objects. Not much we
> can do about that though.
>
>> I think what the 'ceph_direct_read_write()' does is more correct, it
>> will make sure all the dirty pages are writeback from the pagecaches by
>> using 'invalidate_inode_pages2_range()' without blocking and later will
>> do the invalidate blocked by using 'truncate_inode_pages_range()' if
>> some pages are not unmaped in 'invalidate_inode_pages2_range()' when EBUSY.
>>
> I'm not convinced this is any better, and it's attempting to impose a
> deterministic outcome on a situation that is non-deterministic by
> nature.
>
>> This can always be sure that the pagecache has no invalid data after
>> write finishes. I think why it use the truncate helper here is because
>> it's safe and there shouldn't have any buffer write happen for DIO ?
>>
>> But from my understanding the 'ceph_direct_read_write()' is still buggy.
>> What if the page fault happen just after 'truncate_inode_pages_range()'
>> ? Will this happen ? Should we leave this to use the file lock to
>> guarantee it in user space ?
>>
>> Thought ?
> Again, we can't really predict what the outcome of two racing writes to
> the same area will do, so I don't see that there is a problem.
>