lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <822cca41-f700-3cba-e2c8-d1fbe5a934b1@redhat.com>
Date:   Fri, 8 Apr 2022 03:24:09 +0800
From:   Xiubo Li <xiubli@...hat.com>
To:     Jeff Layton <jlayton@...nel.org>,
        Luís Henriques <lhenriques@...e.de>,
        Ilya Dryomov <idryomov@...il.com>
Cc:     ceph-devel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] ceph: invalidate pages when doing direct/sync writes


On 4/8/22 3:16 AM, Jeff Layton wrote:
> On Fri, 2022-04-08 at 03:03 +0800, Xiubo Li wrote:
>> On 4/7/22 11:15 PM, Luís Henriques wrote:
>>> When doing a direct/sync write, we need to invalidate the page cache in
>>> the range being written to.  If we don't do this, the cache will include
>>> invalid data as we just did a write that avoided the page cache.
>>>
>>> Signed-off-by: Luís Henriques <lhenriques@...e.de>
>>> ---
>>>    fs/ceph/file.c | 19 ++++++++++++++-----
>>>    1 file changed, 14 insertions(+), 5 deletions(-)
>>>
>>> Changes since v3:
>>> - Dropped initial call to invalidate_inode_pages2_range()
>>> - Added extra comment to document invalidation
>>>
>>> Changes since v2:
>>> - Invalidation needs to be done after a write
>>>
>>> Changes since v1:
>>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>>
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 5072570c2203..97f764b2fbdd 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>    		return ret;
>>>    
>>>    	ceph_fscache_invalidate(inode, false);
>>> -	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>> -					    pos >> PAGE_SHIFT,
>>> -					    (pos + count - 1) >> PAGE_SHIFT);
>>> -	if (ret < 0)
>>> -		dout("invalidate_inode_pages2_range returned %d\n", ret);
>>>    
>>>    	while ((len = iov_iter_count(from)) > 0) {
>>>    		size_t left;
>>> @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>    			break;
>>>    		}
>>>    		ceph_clear_error_write(ci);
>>> +
>>> +		/*
>>> +		 * we need to invalidate the page cache here, otherwise the
>>> +		 * cache will include invalid data in direct/sync writes.
>>> +		 */
>>> +		ret = invalidate_inode_pages2_range(
>> IMO we'd better use truncate_inode_pages_range() after write. The above
>> means it's possibly will write the dirty pagecache back, which will
>> overwrite and corrupt the disk data just wrote.
>>
> I disagree. We call filemap_write_and_wait_range at the start of this,
> so any data that was dirty when we called write() will be written back
> before the sync write.
>
> If we truncate the range, then we'll potentially lose writes that came
> in after write was issued but before truncate_inode_pages_range. I think
> we'd rather let what we just wrote be clobbered in this situation than
> lose a write altogether.
>
> All of this is somewhat academic though. If you're mixing buffered and
> direct writes like this without some sort of locking, then you're just
> asking for trouble. The aim here is "sane behavior to the best of our
> ability", but we can't expect it to always be sane when people do insane
> things. ;)

Just in the case Luis hit. Before writing the new data the mapping 
happen when reading the src in copy_from_usr(). So once the writing done 
the pagecache is caching the stale contents.

-- Xiubo

>> Though it seems impossible that these pagecaches will be marked dirty,
>> but this call is misleading ?
>>
> Not impossible at all. You can open a file O_DIRECT and then mmap the fd
> for PROT_WRITE (or just open the file a second time and do it).
>
> We definitely recommend against mixing buffered and direct I/O, but
> nothing really prevents someone from doing it. If the user is properly
> using file locking, then there's really no reason it shouldn't work.
>
>>> +				inode->i_mapping,
>>> +				pos >> PAGE_SHIFT,
>>> +				(pos + len - 1) >> PAGE_SHIFT);
>>> +		if (ret < 0) {
>>> +			dout("invalidate_inode_pages2_range returned %d\n",
>>> +			     ret);
>>> +			ret = 0;
>>> +		}
>>>    		pos += len;
>>>    		written += len;
>>>    		dout("sync_write written %d\n", written);
>>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ