linux-kernel - Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <fcab0612-2211-40be-a837-e8969d795617@163.com>
Date: Sat, 18 Jan 2025 00:12:34 +0800
From: Chi Zhiling <chizhiling@....com>
To: Dave Chinner <david@...morbit.com>, Christoph Hellwig <hch@...radead.org>
Cc: Brian Foster <bfoster@...hat.com>, "Darrick J. Wong" <djwong@...nel.org>,
 Amir Goldstein <amir73il@...il.com>, cem@...nel.org,
 linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
 Chi Zhiling <chizhiling@...inos.cn>, John Garry <john.g.garry@...cle.com>
Subject: Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

On 2025/1/16 05:41, Dave Chinner wrote:
> On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote:
>> On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote:
>>> Sorry if this is out of left field as I haven't followed the discussion
>>> closely, but I presumed one of the reasons Darrick and Christoph raised
>>> the idea of using the folio batch thing I'm playing around with on zero
>>> range for buffered writes would be to acquire and lock all targeted
>>> folios up front. If so, would that help with what you're trying to
>>> achieve here? (If not, nothing to see here, move along.. ;).
>>
>> I mostly thought about acquiring, as locking doesn't really have much
>> batching effects.  That being said, no that you got the idea in my mind
>> here's my early morning brainfart on it:
>>
>> Let's ignore DIRECT I/O for the first step.  In that case lookup /
>> allocation and locking all folios for write before copying data will
>> remove the need for i_rwsem in the read and write path.  In a way that
>> sounds perfect, and given that btrfs already does that (although in a
>> very convoluted way) we know it's possible.
> 
> Yes, this seems like a sane, general approach to allowing concurrent
> buffered writes (and reads).
> 
>> But direct I/O throws a big monkey wrench here as already mentioned by
>> others.  Now one interesting thing some file systems have done is
>> to serialize buffered against direct I/O, either by waiting for one
>> to finish, or by simply forcing buffered I/O when direct I/O would
>> conflict.
> 
> Right. We really don't want to downgrade to buffered IO if we can
> help it, though.
> 
>> It's easy to detect outstanding direct I/O using i_dio_count
>> so buffered I/O could wait for that, and downgrading to buffered I/O
>> (potentially using the new uncached mode from Jens) if there are any
>> pages on the mapping after the invalidation also sounds pretty doable.
> 
> It's much harder to sanely serialise DIO against buffered writes
> this way, because i_dio_count only forms a submission barrier in
> conjunction with the i_rwsem being held exclusively. e.g. ongoing
> DIO would result in the buffered write being indefinitely delayed.
> 
> I think the model and method that bcachefs uses is probably the best
> way to move forward - the "two-state exclusive shared" lock which it
> uses to do buffered vs direct exclusion is a simple, easy way to
> handle this problem. The same-state shared locking fast path is a
> single atomic cmpxchg operation, so it has neglible extra overhead
> compared to using a rwsem in the shared DIO fast path.
> 
> The lock also has non-owner semantics, so DIO can take it during
> submission and then drop it during IO completion. This solves the
> problem we currently use the i_rwsem and
> inode_dio_{start,end/wait}() to solve (i.e. create a DIO submission
> barrier and waiting for all existing DIO to drain).
> 
> IOWs, a two-state shared lock provides the mechanism to allow DIO
> to be done without holding the i_rwsem at all, as well as being able
> to elide two atomic operations per DIO to track in-flight DIOs.
> 
> We'd get this whilst maintaining buffered/DIO coherency without
> adding any new overhead to the DIO path, and allow concurrent
> buffered reads and writes that have their atomicity defined by the
> batched folio locking strategy that Brian is working on...
> 
> This only leaves DIO coherency issues with mmap() based IO as an
> issue, but that's a problem for a different day...
> 
>> I don't really have time to turn this hand waving into, but maybe we
>> should think if it's worthwhile or if I'm missing something important.
> 
> If people are OK with XFS moving to exclusive buffered or DIO
> submission model, then I can find some time to work on the
> converting the IO path locking to use a two-state shared lock in
> preparation for the batched folio stuff that will allow concurrent
> buffered writes...
> 

I really think it's time for someone to submit the basic patch, and then
we can continue with the discussion and testing. :)

BTW, I support the solution with folio batch and two-state shared lock.


Thanks,
Chi Zhiling