[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bf1758b0-5a0f-5c9f-ecdf-e3031fdd9678@redhat.com>
Date: Sun, 10 Apr 2022 09:35:48 +0800
From: Xiubo Li <xiubli@...hat.com>
To: Jeff Layton <jlayton@...nel.org>,
Luís Henriques <lhenriques@...e.de>,
Ilya Dryomov <idryomov@...il.com>
Cc: ceph-devel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] ceph: invalidate pages when doing direct/sync writes
On 4/8/22 8:04 PM, Jeff Layton wrote:
> On Fri, 2022-04-08 at 10:47 +0800, Xiubo Li wrote:
>> On 4/7/22 11:15 PM, Luís Henriques wrote:
>>> When doing a direct/sync write, we need to invalidate the page cache in
>>> the range being written to. If we don't do this, the cache will include
>>> invalid data as we just did a write that avoided the page cache.
>>>
>>> Signed-off-by: Luís Henriques <lhenriques@...e.de>
>>> ---
>>> fs/ceph/file.c | 19 ++++++++++++++-----
>>> 1 file changed, 14 insertions(+), 5 deletions(-)
>>>
>>> Changes since v3:
>>> - Dropped initial call to invalidate_inode_pages2_range()
>>> - Added extra comment to document invalidation
>>>
>>> Changes since v2:
>>> - Invalidation needs to be done after a write
>>>
>>> Changes since v1:
>>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>>
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 5072570c2203..97f764b2fbdd 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>> return ret;
>>>
>>> ceph_fscache_invalidate(inode, false);
>>> - ret = invalidate_inode_pages2_range(inode->i_mapping,
>>> - pos >> PAGE_SHIFT,
>>> - (pos + count - 1) >> PAGE_SHIFT);
>>> - if (ret < 0)
>>> - dout("invalidate_inode_pages2_range returned %d\n", ret);
>>>
>>> while ((len = iov_iter_count(from)) > 0) {
>>> size_t left;
>>> @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>> break;
>>> }
>>> ceph_clear_error_write(ci);
>>> +
>>> + /*
>>> + * we need to invalidate the page cache here, otherwise the
>>> + * cache will include invalid data in direct/sync writes.
>>> + */
>>> + ret = invalidate_inode_pages2_range(
>>> + inode->i_mapping,
>>> + pos >> PAGE_SHIFT,
>>> + (pos + len - 1) >> PAGE_SHIFT);
>>> + if (ret < 0) {
>>> + dout("invalidate_inode_pages2_range returned %d\n",
>>> + ret);
>>> + ret = 0;
>> For this, IMO it's not safe. If we just ignore it the pagecache will
>> still have invalid data.
>>
> That data is not invalid. It's dirty data from a buffered write that
> raced with the DIO/sync write we're handling here. i.e. another write
> came in while we were already processing the DIO write, but after the
> point where we called filemap_write_and_wait.
>
> When two write() calls to the same data are racing like that, the
> outcome is undefined. We can't be certain which one will prevail as the
> kernel could handle them in either order.
Okay, I think you are right.
-- Xiubo
>
> The good news with Ceph/RADOS is that you shouldn't end up with a torn
> write, unless the write happens to span multiple objects. Not much we
> can do about that though.
>
>> I think what the 'ceph_direct_read_write()' does is more correct, it
>> will make sure all the dirty pages are writeback from the pagecaches by
>> using 'invalidate_inode_pages2_range()' without blocking and later will
>> do the invalidate blocked by using 'truncate_inode_pages_range()' if
>> some pages are not unmaped in 'invalidate_inode_pages2_range()' when EBUSY.
>>
> I'm not convinced this is any better, and it's attempting to impose a
> deterministic outcome on a situation that is non-deterministic by
> nature.
>
>> This can always be sure that the pagecache has no invalid data after
>> write finishes. I think why it use the truncate helper here is because
>> it's safe and there shouldn't have any buffer write happen for DIO ?
>>
>> But from my understanding the 'ceph_direct_read_write()' is still buggy.
>> What if the page fault happen just after 'truncate_inode_pages_range()'
>> ? Will this happen ? Should we leave this to use the file lock to
>> guarantee it in user space ?
>>
>> Thought ?
> Again, we can't really predict what the outcome of two racing writes to
> the same area will do, so I don't see that there is a problem.
>
Powered by blists - more mailing lists