linux-kernel - Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <90e5f9b4-eb1d-4d63-ba22-4a1f564b2ccf@163.com>
Date: Thu, 9 Jan 2025 20:10:27 +0800
From: Chi Zhiling <chizhiling@....com>
To: Amir Goldstein <amir73il@...il.com>
Cc: John Garry <john.g.garry@...cle.com>, Dave Chinner <david@...morbit.com>,
 djwong@...nel.org, cem@...nel.org, linux-xfs@...r.kernel.org,
 linux-kernel@...r.kernel.org, Chi Zhiling <chizhiling@...inos.cn>
Subject: Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

On 2025/1/9 18:25, Amir Goldstein wrote:
>>> One more thing I should mention.
>>> You do not need to wait for atomic large writes patches to land.
>>> There is nothing stopping you from implementing the suggested
>>> solution based on the xfs code already in master (v6.13-rc1),
>>> which has support for the RWF_ATOMIC flag for writes.
> 
> Only I missed the fact that there is not yet a plan to support
> atomic buffered writes :-/

I think it's necessary to support atomic buffered writes.

> 
>>>
>>> It just means that the API will not be usable for applications that
>>> want to do IO larger than block size, but concurrent read/write
>>                                 ^
>> To be precise, this is the page size, not the block size, right?
>>
> 
> fs block size:
> 
>          if (iocb->ki_flags & IOCB_ATOMIC) {
>                  /*
>                   * Currently only atomic writing of a single FS block is
>                   * supported. It would be possible to atomic write smaller than
>                   * a FS block, but there is no requirement to support this.
>                   * Note that iomap also does not support this yet.
>                   */
>                  if (ocount != ip->i_mount->m_sb.sb_blocksize)
>                          return -EINVAL;
>                  ret = generic_atomic_write_valid(iocb, from);
>                  if (ret)
>                          return ret;
>          }

Uh, okay, maybe I didn't understand it very accurately. I thought
you were talking about buffered IO. I was quite curious as to why
you mentioned block size in the context of buffered IO.

> 
>>> performance of 4K IO could be improved already.
>>
>> Great, which means that IO operations aligned within a single page
>> can be executed concurrently, because the folio lock already
>> provides atomicity guarantees.
>>
>> If the write does not exceed the boundary of a page, we can
>> downgrade the iolock to XFS_IOLOCK_SHARED. It seems to be safe
>> and will not change the current behavior.
>>
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -454,6 +454,11 @@ xfs_file_write_checks(
>>           if (error)
>>                   return error;
>>
>> +       if ( iocb->ki_pos >> PAGE_SHIFT == (iocb->ki_pos + count) >>
>> PAGE_SHIFT) {
>> +               *iolock = XFS_IOLOCK_SHARED;
>> +       }
>> +
>>           /*
>>            * For changing security info in file_remove_privs() we need
>> i_rwsem
>>            * exclusively.
>>
> 
> I think that may be possible, but you should do it in the buffered write
> code as the patch below.
> xfs_file_write_checks() is called from code paths like
> xfs_file_dio_write_unaligned() where you should not demote to shared lock.

Wow, thank you for the reminder. This is the prototype of the patch.
I might need to consider more scenarios and conduct testing before
sending the patch.


> 
> 
>>>
>>> It's possible that all you need to do is:
>>>
>>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>>> index c488ae26b23d0..2542f15496488 100644
>>> --- a/fs/xfs/xfs_file.c
>>> +++ b/fs/xfs/xfs_file.c
>>> @@ -777,9 +777,10 @@ xfs_file_buffered_write(
>>>           ssize_t                 ret;
>>>           bool                    cleared_space = false;
>>>           unsigned int            iolock;
>>> +       bool                    atomic_write = iocb->ki_flags & IOCB_ATOMIC;
>>>
>>>    write_retry:
>>> -       iolock = XFS_IOLOCK_EXCL;
>>> +       iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL;
>>>           ret = xfs_ilock_iocb(iocb, iolock);
>>> --
>>>
>>> xfs_file_write_checks() afterwards already takes care of promoting
>>> XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes.
>>
>> Yeah, for writes that exceed the PAGE boundary, we can also promote
>> the lock to XFS_IOLOCK_EXCL. Otherwise, I am concerned that it may
>> lead to old data being retained in the file.
>>
>> For example, two processes writing four pages of data to the same area.
>>
>> process A           process B
>> --------------------------------
>> write AA--
>> <sleep>
>>                       new write BBBB
>> write --AA
>>
>> The final data is BBAA.
>>
> 
> What is the use case for which you are trying to fix performance?
> Is it a use case with single block IO? if not then it does not help to implement
> a partial solution for single block size IO.

I want to improve the UnixBench score for XFS. UnixBench uses buffered
IO for testing, including both single-threaded and multi-threaded tests.
Additionally, the IO size tested in UnixBench is below 4K, so this will
be very helpful.


Thanks,
Chi Zhiling