[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20190917090016.266CB520A1@d06av21.portsmouth.uk.ibm.com>
Date: Tue, 17 Sep 2019 14:30:15 +0530
From: Ritesh Harjani <riteshh@...ux.ibm.com>
To: Matthew Bobrowski <mbobrowski@...browski.org>,
Christoph Hellwig <hch@...radead.org>, tytso@....edu,
jack@...e.cz, adilger.kernel@...ger.ca
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
david@...morbit.com, darrick.wong@...cle.com
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap
infrastructure
Hello,
On 9/17/19 4:07 AM, Matthew Bobrowski wrote:
> On Mon, Sep 16, 2019 at 05:12:48AM -0700, Christoph Hellwig wrote:
>> On Thu, Sep 12, 2019 at 09:04:46PM +1000, Matthew Bobrowski wrote:
>>> @@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>>> struct inode *inode = file_inode(iocb->ki_filp);
>>> ssize_t ret;
>>>
>>> + if (unlikely(IS_IMMUTABLE(inode)))
>>> + return -EPERM;
>>> +
>>> ret = generic_write_checks(iocb, from);
>>> if (ret <= 0)
>>> return ret;
>>>
>>> - if (unlikely(IS_IMMUTABLE(inode)))
>>> - return -EPERM;
>>> + ret = file_modified(iocb->ki_filp);
>>> + if (ret)
>>> + return 0;
>>>
>>> /*
>>> * If we have encountered a bitmap-format file, the size limit
>>
>> Independent of the error return issue you probably want to split
>> modifying ext4_write_checks into a separate preparation patch.
>
> Providing that there's no objections to introducing a possible performance
> change with this separate preparation patch (overhead of calling
> file_remove_privs/file_update_time twice), then I have no issues in doing so.
>
>>> +/*
>>> + * For a write that extends the inode size, ext4_dio_write_iter() will
>>> + * wait for the write to complete. Consequently, operations performed
>>> + * within this function are still covered by the inode_lock(). On
>>> + * success, this function returns 0.
>>> + */
>>> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
>>> + unsigned int flags)
>>> +{
>>> + int ret;
>>> + loff_t offset = iocb->ki_pos;
>>> + struct inode *inode = file_inode(iocb->ki_filp);
>>> +
>>> + if (error) {
>>> + ret = ext4_handle_failed_inode_extension(inode, offset + size);
>>> + return ret ? ret : error;
>>> + }
>>
>> Just a personal opinion, but I find the use of the ternary operator
>> here a little weird.
>>
>> A plain old:
>>
>> ret = ext4_handle_failed_inode_extension(inode, offset + size);
>> if (ret)
>> return ret;
>> return error;
>>
>> flow much easier.
>
> Agree, much cleaner.
>
>>> + if (!inode_trylock(inode)) {
>>> + if (iocb->ki_flags & IOCB_NOWAIT)
>>> + return -EAGAIN;
>>> + inode_lock(inode);
>>> + }
>>> +
>>> + if (!ext4_dio_checks(inode)) {
>>> + inode_unlock(inode);
>>> + /*
>>> + * Fallback to buffered IO if the operation on the
>>> + * inode is not supported by direct IO.
>>> + */
>>> + return ext4_buffered_write_iter(iocb, from);
>>
>> I think you want to lift the locking into the caller of this function
>> so that you don't have to unlock and relock for the buffered write
>> fallback.
>
> I don't exactly know what you really mean by "lift the locking into the caller
> of this function". I'm interpreting that as moving the inode_unlock()
> operation into ext4_buffered_write_iter(), but I can't see how that would be
> any different from doing it directly here? Wouldn't this also run the risk of
> the locks becoming unbalanced as we'd need to add checks around whether the
> resource is being contended? Maybe I'm misunderstanding something here...
>
>>> + if (offset + count > i_size_read(inode) ||
>>> + offset + count > EXT4_I(inode)->i_disksize) {
>>> + ext4_update_i_disksize(inode, inode->i_size);
>>> + extend = true;
>>
>> Doesn't the ext4_update_i_disksize need to be under an open journal
>> handle?
>
> After all, it is a metadata update, which should go through an open journal
> handle.
Hmmm, it seems like a race here. But I am not sure if this is just due
to not updating i_disksize under open journal handle.
So if we have a delayed buffered write to a file,
in that case we first only update inode->i_size and update
i_disksize at writeback time
(i.e. during block allocation).
In that case when we call for ext4_dio_write_iter
since offset + len > i_disksize, we call for ext4_update_i_disksize().
Now if writeback for some reason failed. And the system crashes, during
the DIO writes, after the blocks are allocated. Then during reboot we
may have an inconsistent inode, since we did not add the inode into the
orphan list before we updated the inode->i_disksize. And journal replay
may not succeed.
1. Can above actually happen? I am still not able to figure out the
race/inconsistency completely.
2. Can you please help explain under what other cases
it was necessary to call ext4_update_i_disksize() in DIO write paths?
3. When will i_disksize be out-of-sync with i_size during DIO writes?
-ritesh
Powered by blists - more mailing lists