lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 17 Sep 2019 14:30:15 +0530
From:   Ritesh Harjani <riteshh@...ux.ibm.com>
To:     Matthew Bobrowski <mbobrowski@...browski.org>,
        Christoph Hellwig <hch@...radead.org>, tytso@....edu,
        jack@...e.cz, adilger.kernel@...ger.ca
Cc:     linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        david@...morbit.com, darrick.wong@...cle.com
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap
 infrastructure

Hello,

On 9/17/19 4:07 AM, Matthew Bobrowski wrote:
> On Mon, Sep 16, 2019 at 05:12:48AM -0700, Christoph Hellwig wrote:
>> On Thu, Sep 12, 2019 at 09:04:46PM +1000, Matthew Bobrowski wrote:
>>> @@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>>>   	struct inode *inode = file_inode(iocb->ki_filp);
>>>   	ssize_t ret;
>>>   
>>> +	if (unlikely(IS_IMMUTABLE(inode)))
>>> +		return -EPERM;
>>> +
>>>   	ret = generic_write_checks(iocb, from);
>>>   	if (ret <= 0)
>>>   		return ret;
>>>   
>>> -	if (unlikely(IS_IMMUTABLE(inode)))
>>> -		return -EPERM;
>>> +	ret = file_modified(iocb->ki_filp);
>>> +	if (ret)
>>> +		return 0;
>>>   
>>>   	/*
>>>   	 * If we have encountered a bitmap-format file, the size limit
>>
>> Independent of the error return issue you probably want to split
>> modifying ext4_write_checks into a separate preparation patch.
> 
> Providing that there's no objections to introducing a possible performance
> change with this separate preparation patch (overhead of calling
> file_remove_privs/file_update_time twice), then I have no issues in doing so.
> 
>>> +/*
>>> + * For a write that extends the inode size, ext4_dio_write_iter() will
>>> + * wait for the write to complete. Consequently, operations performed
>>> + * within this function are still covered by the inode_lock(). On
>>> + * success, this function returns 0.
>>> + */
>>> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
>>> +				 unsigned int flags)
>>> +{
>>> +	int ret;
>>> +	loff_t offset = iocb->ki_pos;
>>> +	struct inode *inode = file_inode(iocb->ki_filp);
>>> +
>>> +	if (error) {
>>> +		ret = ext4_handle_failed_inode_extension(inode, offset + size);
>>> +		return ret ? ret : error;
>>> +	}
>>
>> Just a personal opinion, but I find the use of the ternary operator
>> here a little weird.
>>
>> A plain old:
>>
>> 	ret = ext4_handle_failed_inode_extension(inode, offset + size);
>> 	if (ret)
>> 		return ret;
>> 	return error;
>>
>> flow much easier.
> 
> Agree, much cleaner.
> 
>>> +	if (!inode_trylock(inode)) {
>>> +		if (iocb->ki_flags & IOCB_NOWAIT)
>>> +			return -EAGAIN;
>>> +		inode_lock(inode);
>>> +	}
>>> +
>>> +	if (!ext4_dio_checks(inode)) {
>>> +		inode_unlock(inode);
>>> +		/*
>>> +		 * Fallback to buffered IO if the operation on the
>>> +		 * inode is not supported by direct IO.
>>> +		 */
>>> +		return ext4_buffered_write_iter(iocb, from);
>>
>> I think you want to lift the locking into the caller of this function
>> so that you don't have to unlock and relock for the buffered write
>> fallback.
> 
> I don't exactly know what you really mean by "lift the locking into the caller
> of this function". I'm interpreting that as moving the inode_unlock()
> operation into ext4_buffered_write_iter(), but I can't see how that would be
> any different from doing it directly here? Wouldn't this also run the risk of
> the locks becoming unbalanced as we'd need to add checks around whether the
> resource is being contended? Maybe I'm misunderstanding something here...
> 
>>> +	if (offset + count > i_size_read(inode) ||
>>> +	    offset + count > EXT4_I(inode)->i_disksize) {
>>> +		ext4_update_i_disksize(inode, inode->i_size);
>>> +		extend = true;
>>
>> Doesn't the ext4_update_i_disksize need to be under an open journal
>> handle?
> 
> After all, it is a metadata update, which should go through an open journal
> handle.

Hmmm, it seems like a race here. But I am not sure if this is just due 
to not updating i_disksize under open journal handle.


So if we have a delayed buffered write to a file,
in that case we first only update inode->i_size and update
i_disksize at writeback time
(i.e. during block allocation).
In that case when we call for ext4_dio_write_iter
since offset + len > i_disksize, we call for ext4_update_i_disksize().

Now if writeback for some reason failed. And the system crashes, during 
the DIO writes, after the blocks are allocated. Then during reboot we 
may have an inconsistent inode, since we did not add the inode into the
orphan list before we updated the inode->i_disksize. And journal replay
may not succeed.

1. Can above actually happen? I am still not able to figure out the
    race/inconsistency completely.
2. Can you please help explain under what other cases
    it was necessary to call ext4_update_i_disksize() in DIO write paths?
3. When will i_disksize be out-of-sync with i_size during DIO writes?


-ritesh

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ