[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20191203115445.6F802AE059@d06av26.portsmouth.uk.ibm.com>
Date: Tue, 3 Dec 2019 17:24:44 +0530
From: Ritesh Harjani <riteshh@...ux.ibm.com>
To: Jan Kara <jack@...e.cz>
Cc: tytso@....edu, linux-ext4@...r.kernel.org,
linux-fsdevel@...r.kernel.org, mbobrowski@...browski.org
Subject: Re: [RFCv3 4/4] ext4: Move to shared iolock even without
dioread_nolock mount opt
Hello Jan,
I have compiled something based on our discussion.
Could you please share your thoughts on below.
On 11/29/19 10:48 PM, Jan Kara wrote:
> Hello Ritesh!
>
> On Tue 26-11-19 16:21:15, Ritesh Harjani wrote:
>> On 11/20/19 8:02 PM, Jan Kara wrote:
>>> On Wed 20-11-19 10:30:24, Ritesh Harjani wrote:
>>>> We were using shared locking only in case of dioread_nolock
>>>> mount option in case of DIO overwrites. This mount condition
>>>> is not needed anymore with current code, since:-
>>>>
>>>> 1. No race between buffered writes & DIO overwrites.
>>>> Since buffIO writes takes exclusive locks & DIO overwrites
>>>> will take share locking. Also DIO path will make sure
>>>> to flush and wait for any dirty page cache data.
>>>>
>>>> 2. No race between buffered reads & DIO overwrites, since there
>>>> is no block allocation that is possible with DIO overwrites.
>>>> So no stale data exposure should happen. Same is the case
>>>> between DIO reads & DIO overwrites.
>>>>
>>>> 3. Also other paths like truncate is protected,
>>>> since we wait there for any DIO in flight to be over.
>>>>
>>>> 4. In case of buffIO writes followed by DIO reads:
>>>> Since here also we take exclusive locks in ext4_write_begin/end().
>>>> There is no risk of exposing any stale data in this case.
>>>> Since after ext4_write_end, iomap_dio_rw() will wait to flush &
>>>> wait for any dirty page cache data.
>>>>
>>>> Signed-off-by: Ritesh Harjani <riteshh@...ux.ibm.com>
>>>
>>> There's one more case to consider here as I mentioned in [1]. There can be
>>
>> Yes, I should have mentioned about this in cover letter and about my
>> thoughts on that.
>> I was of the opinion that since the race is already existing
>> and it may not be caused due to this patch, so we should handle that in
>> incremental fashion and as a separate patch series after this one.
>> Let me know your thoughts on above.
>
> Yes, I'm fine with that.
Sure thanks, will do that.
>
>> Also, I wanted to have some more discussions on this race before
>> making the changes.
>> But nevertheless, it's the right time to discuss those changes here.
>>
>>> mmap write instantiating dirty page and then someone starting writeback
>>> against that page while DIO read is running still theoretically leading to
>>> stale data exposure. Now this patch does not have influence on that race
>>> but:
>>
>> Yes, agreed.
>>
>>>
>>> 1) We need to close the race mentioned above. Maybe we could do that by
>>> proactively allocating unwritten blocks for a page being faulted when there
>>> is direct IO running against the file - the one who fills holes through
>>> mmap write while direct IO is running on the file deserves to suffer the
>>> performance penalty...
>>
>> I was giving this a thought. So even if we try to penalize mmap
>> write as you mentioned above, what I am not sure about it, is that, how can
>> we reliably detect that the DIO is in progress?
>>
>> Say even if we try to check for atomic_read(&inode->i_dio_count) in mmap
>> ext4_page_mkwrite path, it cannot be reliable unless there is some sort of a
>> lock protection, no?
>> Because after the check the DIO can still snoop in, right?
>
> Yes, doing this reliably will need some code tweaking. Also thinking about
> this in detail, doing a reliable check in ext4_page_mkwrite() is
> somewhat difficult so it will be probably less error-prone to deal with the
> race in the writeback path.
hmm. But if we don't do in ext4_page_mkwrite, then I am afraid on
how to handle nodelalloc scenario. Where we will directly go and
allocate block via ext4_get_block() in ext4_page_mkwrite(),
as explained below.
I guess we may need some tweaking at both places.
>
> My preferred way of dealing with this would be to move inode_dio_begin()
> call in iomap_dio_rw() a bit earlier before page cache invalidation and add
> there smp_mb_after_atomic() (so that e.g. nrpages checks cannot get
> reordered before the increment). Then the check on i_dio_count in
> ext4_writepages() will be reliable if we do it after gathering and locking
> pages for writeback (i.e., in mpage_map_and_submit_extent()) - either we
> see i_dio_count elevated and use the safe (but slower) writeback using
> unwritten extents, or we see don't and then we are sure DIO will not start
> until writeback of the pages we have locked has finished because of
> filemap_write_and_wait() call in iomap_dio_rw().
>
>
Thanks for explaining this in detail. I guess I understand this part now
Earlier my understanding towards mapping->nrpages was not complete.
AFAIU, with your above suggestion the race won't happen for delalloc
cases. But what if it is a nodelalloc mount option?
Say with above changes i.e. after tweaking iomap_dio_rw() code as you
mentioned above. Below race could still happen, right?
iomap_dio_rw()
filemap_write_and_wait_range()
inode_dio_begin()
smp_mb__after_atomic()
invalidate_inode_pages2_range()
ext4_page_mkwrite()
block_page_mkwrite()
lock_page()
ext4_get_block()
ext4_map_blocks()
//this will return IOMAP_MAPPED entry
submit_bio()
// this goes and reads the block
// with stale data allocated,
// by ext4_page_mkwrite()
Now, I am assuming that ext4_get_block() via ext4_page_mkwrite() path
may try to create the block for hole then and there itself.
And if submit_bio() from DIO path is serviced late i.e. after
ext4_get_block() has already allocated block there, then this may expose
stale data. Thoughts?
So to avoid both such races in delalloc & in nodelalloc path,
we should add the checks at both ext4_writepages() & also at
ext4_page_mkwrite().
For ext4_page_mkwrite(), why don't we just change the "get_block"
function pointer which is passed to block_page_mkwrite()
as below. This should solve our race since
ext4_dio_check_get_block() will be only called with lock_page()
held. And also with inode_dio_begin() now moved up before
invalidate_inode_pages2_range(), we could be sure
about DIO is currently running or not in ext4_page_mkwrite() path.
Does this looks correct to you?
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 381813205f99..74c33d03592c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -806,6 +806,19 @@ int ext4_get_block_unwritten(struct inode *inode,
sector_t iblock,
EXT4_GET_BLOCKS_IO_CREATE_EXT);
}
+int ext4_dio_check_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh, int create)
+{
+ get_block_t *get_block;
+
+ if (!atomic_read(&inode->i_dio_count))
+ get_block = ext4_get_block;
+ else
+ get_block = ext4_get_block_unwritten;
+
+ return get_block(inode, iblock, bh, create);
+}
+
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096
@@ -2332,7 +2345,8 @@ static int mpage_map_one_extent(handle_t *handle,
struct mpage_da_data *mpd)
struct inode *inode = mpd->inode;
struct ext4_map_blocks *map = &mpd->map;
int get_blocks_flags;
- int err, dioread_nolock;
+ int err;
+ bool dio_in_progress = atomic_read(&inode->i_dio_count);
trace_ext4_da_write_pages_extent(inode, map);
/*
@@ -2353,8 +2367,14 @@ static int mpage_map_one_extent(handle_t *handle,
struct mpage_da_data *mpd)
get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
EXT4_GET_BLOCKS_METADATA_NOFAIL |
EXT4_GET_BLOCKS_IO_SUBMIT;
- dioread_nolock = ext4_should_dioread_nolock(inode);
- if (dioread_nolock)
+
+ /*
+ * There could be race between DIO read & ext4_page_mkwrite
+ * where in delalloc case, we may go and try to allocate the
+ * block here but if DIO read is in progress then it may expose
+ * stale data, hence use unwritten blocks for allocation
+ * when DIO is in progress.
+ */
+ if (dio_in_progress)
get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
if (map->m_flags & (1 << BH_Delay))
get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
@@ -2362,7 +2382,7 @@ static int mpage_map_one_extent(handle_t *handle,
struct mpage_da_data *mpd)
err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
if (err < 0)
return err;
- if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
+ if (dio_in_progress && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
if (!mpd->io_submit.io_end->handle &&
ext4_handle_valid(handle)) {
mpd->io_submit.io_end->handle = handle->h_rsv_handle;
@@ -5906,10 +5926,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
}
unlock_page(page);
/* OK, we need to fill the hole... */
- if (ext4_should_dioread_nolock(inode))
- get_block = ext4_get_block_unwritten;
- else
- get_block = ext4_get_block;
+ get_block = ext4_dio_check_get_block;
retry_alloc:
handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
ext4_writepage_trans_blocks(inode));
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 2f88d64c2a4d..09d0601e5ecb 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -465,6 +465,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
if (ret)
goto out_free_dio;
+ inode_dio_begin(inode);
+ smp_mb__after_atomic();
/*
* Try to invalidate cache pages for the range we're direct
* writing. If this invalidation fails, tough, the write will
@@ -484,8 +486,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
goto out_free_dio;
}
- inode_dio_begin(inode);
-
blk_start_plug(&plug);
do {
ret = iomap_apply(inode, pos, count, flags, ops, dio,
-ritesh
Powered by blists - more mailing lists