linux-kernel - Re: [PATCH RFC 1/2] iomap: Add a IOMAP_DIO_MAY_INLINE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8263025e-15e6-fbc7-2826-f6f9ac3d9043@huawei.com>
Date: Mon, 11 Mar 2024 15:55:30 +0800
From: Zhihao Cheng <chengzhihao1@...wei.com>
To: Dave Chinner <david@...morbit.com>
CC: <brauner@...nel.org>, <djwong@...nel.org>, <jack@...e.cz>,
	<tytso@....edu>, <linux-xfs@...r.kernel.org>,
	<linux-fsdevel@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<linux-ext4@...r.kernel.org>, <yi.zhang@...wei.com>
Subject: Re: [PATCH RFC 1/2] iomap: Add a IOMAP_DIO_MAY_INLINE_COMP flag

在 2024/3/1 18:02, Zhihao Cheng 写道:
> 在 2024/3/1 8:40, Dave Chinner 写道:
> 
Hi Dave. Friendly ping.
> Hi Dave, thanks for your detailed and nice suggestions, I have a few 
> questions below.
>> On Thu, Feb 29, 2024 at 07:38:48PM +0800, Zhihao Cheng wrote:
>>> It will be more efficient to execute quick endio process(eg. non-sync
>>> overwriting case) under irq process rather than starting a worker to
>>> do it.
>>> Add a flag to control DIO to be finished inline(under irq context), 
>>> which
>>> can be used for non-sync overwriting case.
>>> Besides, skip invalidating pages if DIO is finished inline, which will
>>> keep the same logic with dio_bio_end_aio in non-sync overwriting case.
>>>
>>> Signed-off-by: Zhihao Cheng <chengzhihao1@...wei.com>
>>
>> A nice idea, but I don't think an ext4 specific API flag is the
>> right way to go about enabling this. The iomap dio code knows if
>> the write is pure overwrite already - we have the IOMAP_F_DIRTY flag
>> for that, and we combine this with IOMAP_DIO_WRITE_THROUGH to do the
>> pure overwrite FUA optimisations.
>>
>> That is:
>>
>>         /*
>>                   * Use a FUA write if we need datasync semantics, 
>> this is a pure
>>                   * data IO that doesn't require any metadata updates 
>> (including
>>                   * after IO completion such as unwritten extent 
>> conversion) and
>>                   * the underlying device either supports FUA or 
>> doesn't have
>>                   * a volatile write cache. This allows us to avoid 
>> cache flushes
>>                   * on IO completion. If we can't use writethrough and 
>> need to
>>                   * sync, disable in-task completions as dio 
>> completion will
>>                   * need to call generic_write_sync() which will do a 
>> blocking
>>                   * fsync / cache flush call.
>>                   */
>>                  if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
>>                      (dio->flags & IOMAP_DIO_WRITE_THROUGH) &&
>>                      (bdev_fua(iomap->bdev) || 
>> !bdev_write_cache(iomap->bdev)))
>>                          use_fua = true;
>>
>> Hence if we want to optimise pure overwrites that have no data sync
>> requirements, we already have the detection and triggers in place to
>> do this. We just need to change the way we set up the IO flags to
>> allow write-through (i.e. non-blocking IO completions) to use inline
>> completions.
>>
>> In __iomap_dio_rw():
>>
>> +    /* Always try to complete inline. */
>> +    dio->flags |= IOMAP_DIO_INLINE_COMP;
>>     if (iov_iter_rw(iter) == READ) {
>> -               /* reads can always complete inline */
>> -               dio->flags |= IOMAP_DIO_INLINE_COMP;
>> ....
>>
>>     } else {
>> +        /* Always try write-through semantics. If we can't
>> +         * use writethough, it will be disabled along with
>> +         * IOMAP_DIO_INLINE_COMP before dio completion is run
>> +         * so it can be deferred to a task completion context
>> +         * appropriately.
>> +         */
>> +               dio->flags |= IOMAP_DIO_WRITE | IOMAP_DIO_WRITE_THROUGH;
> 
> There is a behavior change here, if we set IOMAP_DIO_WRITE_THROUGH 
> unconditionally, non-datasync IO will be set with REQ_FUA, which means 
> that device will flush writecache for each IO, will it affect the 
> performance in non-sync dio case?
>>         iomi.flags |= IOMAP_WRITE;
>> -               dio->flags |= IOMAP_DIO_WRITE;
>> .....
>>         /* for data sync or sync, we need sync completion processing */
>>                  if (iocb_is_dsync(iocb)) {
>>                          dio->flags |= IOMAP_DIO_NEED_SYNC;
>>
>> -                      /*
>> -                       * For datasync only writes, we optimistically 
>> try using
>> -                       * WRITE_THROUGH for this IO. This flag 
>> requires either
>> -                       * FUA writes through the device's write cache, 
>> or a
>> -                       * normal write to a device without a volatile 
>> write
>> -                       * cache. For the former, Any non-FUA write 
>> that occurs
>> -                       * will clear this flag, hence we know before 
>> completion
>> -                       * whether a cache flush is necessary.
>> -                       */
>> -                       if (!(iocb->ki_flags & IOCB_SYNC))
>> -                               dio->flags |= IOMAP_DIO_WRITE_THROUGH;
>> +            * For sync writes we know we are going to need
>> +            * blocking completion processing, so turn off
>> +            * writethrough now.
>> +            */
>>             if (iocb->ki_flags & IOCB_SYNC) {
>>                 dio->flags &= ~(IOMAP_DIO_WRITE_THROUGH |
>>                         IOMAP_DIO_INLINE_COMP);
>>             }
>>                  }
>>
> 
> [...]
>>
>> However, this does mean that any spinlock taken in the ->end_io()
>> callbacks now needs to be irq safe. e.g. in xfs_dio_write_end_io()
>> the spinlock protection around inode size updates will need to use
>> an irq safe locking, as will the locking in the DIO submission path
>> that it serialises against in xfs_file_write_checks(). That probably
>> is best implemented as a separate spinlock.
>>
>> There will also be other filesystems that need to set IOMAP_F_DIRTY
>> unconditionally (e.g. zonefs) because they always take blocking
>> locks in their ->end_io callbacks and so must always run in task
>> context...
> Should we add a new flag(eg. IOMAP_F_ENDIO_IRQ ?) to indicate that the 
> endio cannot be done under irq? Because I think IOMAP_F_DIRTY means that 
> the metadata needs to be written, if we add a new semantics(endio must 
> be done in defered work) for this flag, the code will looks a little 
> complicated.
> 
> 
> .