linux-kernel - Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3d657be2-3cca-49b5-b967-5f5740d86c6e@163.com>
Date: Fri, 24 Jan 2025 15:57:43 +0800
From: Chi Zhiling <chizhiling@....com>
To: Dave Chinner <david@...morbit.com>, Christoph Hellwig <hch@...radead.org>
Cc: Brian Foster <bfoster@...hat.com>, "Darrick J. Wong" <djwong@...nel.org>,
 Amir Goldstein <amir73il@...il.com>, cem@...nel.org,
 linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
 Chi Zhiling <chizhiling@...inos.cn>, John Garry <john.g.garry@...cle.com>
Subject: Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

On 2025/1/16 05:41, Dave Chinner wrote:
> On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote:
>> On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote:
>>> Sorry if this is out of left field as I haven't followed the discussion
>>> closely, but I presumed one of the reasons Darrick and Christoph raised
>>> the idea of using the folio batch thing I'm playing around with on zero
>>> range for buffered writes would be to acquire and lock all targeted
>>> folios up front. If so, would that help with what you're trying to
>>> achieve here? (If not, nothing to see here, move along.. ;).
>>
>> I mostly thought about acquiring, as locking doesn't really have much
>> batching effects.  That being said, no that you got the idea in my mind
>> here's my early morning brainfart on it:
>>
>> Let's ignore DIRECT I/O for the first step.  In that case lookup /
>> allocation and locking all folios for write before copying data will
>> remove the need for i_rwsem in the read and write path.  In a way that
>> sounds perfect, and given that btrfs already does that (although in a
>> very convoluted way) we know it's possible.
> 
> Yes, this seems like a sane, general approach to allowing concurrent
> buffered writes (and reads).
> 
>> But direct I/O throws a big monkey wrench here as already mentioned by
>> others.  Now one interesting thing some file systems have done is
>> to serialize buffered against direct I/O, either by waiting for one
>> to finish, or by simply forcing buffered I/O when direct I/O would
>> conflict.
> 
> Right. We really don't want to downgrade to buffered IO if we can
> help it, though.
> 
>> It's easy to detect outstanding direct I/O using i_dio_count
>> so buffered I/O could wait for that, and downgrading to buffered I/O
>> (potentially using the new uncached mode from Jens) if there are any
>> pages on the mapping after the invalidation also sounds pretty doable.
> 
> It's much harder to sanely serialise DIO against buffered writes
> this way, because i_dio_count only forms a submission barrier in
> conjunction with the i_rwsem being held exclusively. e.g. ongoing
> DIO would result in the buffered write being indefinitely delayed.
> 
> I think the model and method that bcachefs uses is probably the best
> way to move forward - the "two-state exclusive shared" lock which it
> uses to do buffered vs direct exclusion is a simple, easy way to
> handle this problem. The same-state shared locking fast path is a
> single atomic cmpxchg operation, so it has neglible extra overhead
> compared to using a rwsem in the shared DIO fast path.
> 
> The lock also has non-owner semantics, so DIO can take it during
> submission and then drop it during IO completion. This solves the
> problem we currently use the i_rwsem and
> inode_dio_{start,end/wait}() to solve (i.e. create a DIO submission
> barrier and waiting for all existing DIO to drain).
> 
> IOWs, a two-state shared lock provides the mechanism to allow DIO
> to be done without holding the i_rwsem at all, as well as being able
> to elide two atomic operations per DIO to track in-flight DIOs.
> 
> We'd get this whilst maintaining buffered/DIO coherency without
> adding any new overhead to the DIO path, and allow concurrent
> buffered reads and writes that have their atomicity defined by the
> batched folio locking strategy that Brian is working on...
> 
> This only leaves DIO coherency issues with mmap() based IO as an
> issue, but that's a problem for a different day...

When I try to implement those features, I think we could use exclusive
locks for RWF_APPEND writes, and shared locks for other writes.

The reason is that concurrent operations are also possible for extended
writes if there is no overlap in the regions being written.
So there is no need to determine whether it is an extended write in
advance.

As for why an exclusive lock is needed for append writes, it's because
we don't want the EOF to be modified during the append write.

The code is like that:
+       if (iocb->ki_flags & IOCB_APPEND)
+               iolock = XFS_IOLOCK_EXCL;
+       else
+               iolock = XFS_IOLOCK_SHARED;


If we use exclusive locks for all extended writes, we need to check if
it is an extended write before acquiring the lock, but this value could
become outdated if we do not hold the lock.

So if we do an extended write,
we might need to follow this process:

1. Get read lock.
2. Check if it is an extended write.
3. Release read lock.
4. Get write lock.
5. Do the write operation.


Thanks,
Chi Zhiling