lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <88ce2dea-1ddf-7cb1-7f8f-6964b7d17ce5@linux.alibaba.com>
Date:   Wed, 20 Sep 2023 15:36:53 +0800
From:   Gao Xiang <hsiangkao@...ux.alibaba.com>
To:     Dave Chinner <david@...morbit.com>
Cc:     Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org,
        Theodore Ts'o <tytso@....edu>,
        Matthew Bobrowski <mbobrowski@...browski.org>,
        Christoph Hellwig <hch@....de>,
        Joseph Qi <joseph.qi@...ux.alibaba.com>,
        "Darrick J. Wong" <djwong@...nel.org>
Subject: Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT |
 O_SYNC semantics after iomap DIO conversion



On 2023/9/20 15:29, Dave Chinner wrote:
> On Tue, Sep 19, 2023 at 09:47:34PM +0800, Gao Xiang wrote:
>>
>> (sorry... add Darrick here...)
>>
>> Hi Jan,
>>
>> On 2023/9/19 20:05, Jan Kara wrote:
>>> Hello!
>>>
>>> On Tue 19-09-23 14:00:04, Gao Xiang wrote:
>>>> Our consumer reports a behavior change between pre-iomap and iomap
>>>> direct io conversion:
>>>>
>>>> If the system crashes after an appending write to a file open with
>>>> O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
>>>> O_SYNC was marked before.
>>>>
>>>> It can be reproduced by a test program in the attachment with
>>>> gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger
>>>>
>>>> After some analysis, we found that before iomap direct I/O conversion,
>>>> the timing was roughly (taking Linux 3.10 codebase as an example):
>>>>
>>>> 	..
>>>> 	- ext4_file_dio_write
>>>> 	  - __generic_file_aio_write
>>>> 	      ..
>>>> 	    - ext4_direct_IO  # generic_file_direct_write
>>>> 	      - ext4_ext_direct_IO
>>>> 	        - ext4_ind_direct_IO  # final_size > inode->i_size
>>>> 	          - ..
>>>> 	          - ret = blockdev_direct_IO()
>>>> 	          - i_size_write(inode, end) # orphan && ret > 0 &&
>>>> 	                                   # end > inode->i_size
>>>> 	          - ext4_mark_inode_dirty()
>>>> 	          - ...
>>>> 	  - generic_write_sync  # handling O_SYNC
>>>>
>>>> So the dirty inode meta will be committed into journal immediately
>>>> if O_SYNC is set.  However, After commit 569342dc2485 ("ext4: move
>>>> inode extension/truncate code out from ->iomap_end() callback"),
>>>> the new behavior seems as below:
>>>>
>>>> 	..
>>>> 	- ext4_dio_write_iter
>>>> 	  - ext4_dio_write_checks  # extend = 1
>>>> 	  - iomap_dio_rw
>>>> 	      - __iomap_dio_rw
>>>> 	      - iomap_dio_complete
>>>> 	        - generic_write_sync
>>>> 	  - ext4_handle_inode_extension  # extend = 1
>>>>
>>>> So that i_size will be recorded only after generic_write_sync() is
>>>> called.  So O_SYNC won't flush the update i_size to the disk.
>>>
>>> Indeed, that looks like a bug. Thanks for report!
>>
>> Thanks for the confirmation!
>>
>>>
>>>> On the other side, after a quick look of XFS side, it will record
>>>> i_size changes in xfs_dio_write_end_io() so it seems that it doesn't
>>>> have this problem.
>>>
>>> Yes, I'm a bit hazy on the details but I think we've decided to call
>>> ext4_handle_inode_extension() directly from ext4_dio_write_iter() because
>>> from ext4_dio_write_end_io() it was difficult to test in a race-free way
>>> whether extending i_size (and i_disksize) is needed or not (we don't
>>> necessarily hold i_rwsem there). I'll think how we could fix the problem
>>> you've reported.
> 
> Given that ext4 can run extent conversion in IO completion, it can
> run file extension in IO completion. Yes, that might require
> additional synchronisation of file size updates to co-ordinate
> submission and completion size checks. XFS just uses a spinlock for
> this....
> 
>> Yes, another concern is O_DSYNC, I'm quite not sure if the behavior
>> is changed too.
> 
> For O_DSYNC, the file size change needs to be covered by the
> call to generic_write_sync() as well. O_DSYNC should be thought of
> as being essentially the same as O_SYNC except for minor details.
> 
>> I had a rough feeling that currently iomap DIO behaviors on these are
>> too strict and might not fit in each specific fs detailed
>> implementation, tho.
> 
> 
> In what way?  iomap implements exactly the data integrity semantics
> that are required for O_DSYNC and O_SYNC writes, and it requires
> filesystem end_io method to finalize all metadata modifications
> needed for data integrity purposes.
> 
> Keep in mind that iomap is designed around the requirements async IO
> (AIO and io_uring) place on individual IOs: there is no waiting
> context to "finish" the IO before userspace is signalled that it is
> complete. Hence everything related to data integrity needs to be
> done by the filesystem in ->end_io before the iomap completion runs
> generic_write_sync() and signals IO completion....

Yes, I understand iomap is well-suited in this model.  Anyway, it's
somewhat out of scope on my side.

Thanks,
Gao Xiang

> 
> Cheers,
> 
> Dave.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ