linux-kernel - Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxjRi9nagj4JVXMFoz0MXP_2YA=bgvoiDqStiHpFpK+tsQ@mail.gmail.com>
Date: Fri, 17 Jan 2025 14:27:46 +0100
From: Amir Goldstein <amir73il@...il.com>
To: Dave Chinner <david@...morbit.com>
Cc: Christoph Hellwig <hch@...radead.org>, Brian Foster <bfoster@...hat.com>, 
	"Darrick J. Wong" <djwong@...nel.org>, Chi Zhiling <chizhiling@....com>, cem@...nel.org, 
	linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Chi Zhiling <chizhiling@...inos.cn>, John Garry <john.g.garry@...cle.com>
Subject: Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

On Wed, Jan 15, 2025 at 10:41 PM Dave Chinner <david@...morbit.com> wrote:
>
> On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote:
> > > Sorry if this is out of left field as I haven't followed the discussion
> > > closely, but I presumed one of the reasons Darrick and Christoph raised
> > > the idea of using the folio batch thing I'm playing around with on zero
> > > range for buffered writes would be to acquire and lock all targeted
> > > folios up front. If so, would that help with what you're trying to
> > > achieve here? (If not, nothing to see here, move along.. ;).
> >
> > I mostly thought about acquiring, as locking doesn't really have much
> > batching effects.  That being said, no that you got the idea in my mind
> > here's my early morning brainfart on it:
> >
> > Let's ignore DIRECT I/O for the first step.  In that case lookup /
> > allocation and locking all folios for write before copying data will
> > remove the need for i_rwsem in the read and write path.  In a way that
> > sounds perfect, and given that btrfs already does that (although in a
> > very convoluted way) we know it's possible.
>
> Yes, this seems like a sane, general approach to allowing concurrent
> buffered writes (and reads).
>
> > But direct I/O throws a big monkey wrench here as already mentioned by
> > others.  Now one interesting thing some file systems have done is
> > to serialize buffered against direct I/O, either by waiting for one
> > to finish, or by simply forcing buffered I/O when direct I/O would
> > conflict.
>
> Right. We really don't want to downgrade to buffered IO if we can
> help it, though.
>
> > It's easy to detect outstanding direct I/O using i_dio_count
> > so buffered I/O could wait for that, and downgrading to buffered I/O
> > (potentially using the new uncached mode from Jens) if there are any
> > pages on the mapping after the invalidation also sounds pretty doable.
>
> It's much harder to sanely serialise DIO against buffered writes
> this way, because i_dio_count only forms a submission barrier in
> conjunction with the i_rwsem being held exclusively. e.g. ongoing
> DIO would result in the buffered write being indefinitely delayed.

Isn't this already the case today with EX vs. SH iolock?
I guess the answer depends whether or not i_rwsem
starves existing writers in the face of ongoing new readers.
If my memory serves me right, this exact behavior of i_rwsem
is what sparked the original regression report when xfs
moved from i_iolock to i_rwsem [1].

[1] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/

>
> I think the model and method that bcachefs uses is probably the best
> way to move forward - the "two-state exclusive shared" lock which it
> uses to do buffered vs direct exclusion is a simple, easy way to
> handle this problem. The same-state shared locking fast path is a
> single atomic cmpxchg operation, so it has neglible extra overhead
> compared to using a rwsem in the shared DIO fast path.
>
> The lock also has non-owner semantics, so DIO can take it during
> submission and then drop it during IO completion. This solves the
> problem we currently use the i_rwsem and
> inode_dio_{start,end/wait}() to solve (i.e. create a DIO submission
> barrier and waiting for all existing DIO to drain).
>
> IOWs, a two-state shared lock provides the mechanism to allow DIO
> to be done without holding the i_rwsem at all, as well as being able
> to elide two atomic operations per DIO to track in-flight DIOs.
>
> We'd get this whilst maintaining buffered/DIO coherency without
> adding any new overhead to the DIO path, and allow concurrent
> buffered reads and writes that have their atomicity defined by the
> batched folio locking strategy that Brian is working on...
>
> This only leaves DIO coherency issues with mmap() based IO as an
> issue, but that's a problem for a different day...
>
> > I don't really have time to turn this hand waving into, but maybe we
> > should think if it's worthwhile or if I'm missing something important.
>
> If people are OK with XFS moving to exclusive buffered or DIO
> submission model, then I can find some time to work on the
> converting the IO path locking to use a two-state shared lock in
> preparation for the batched folio stuff that will allow concurrent
> buffered writes...

I won't object to getting the best of all worlds, but I have to say,
upfront, this sounds a bit like premature optimization and for
a workload (mixed buffered/dio write) that I don't think anybody
does in practice and nobody should care how it performs.
Am I wrong?

For all practical purposes, we could maintain a counter in inode
not for submitted DIO, but for files opened O_DIRECT.

The first open O_DIRECT could serialize with in-flight
buffered writes holding a shared iolock and then buffered writes
could take SH vs. EX iolock depending on folio state and on
i_dio_open_count.

I would personally prefer a simple solution that is good enough
and has a higher likelihood for allocating the development, review
and testing resources that are needed to bring it to the finish line.

Thanks,
Amir.