linux-kernel - Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxizxUg+EXar2GzDWgp+reRZ_0wc+DAaSAGmUZ1VXOjjLw@mail.gmail.com>
Date: Sat, 18 Jan 2025 14:03:41 +0100
From: Amir Goldstein <amir73il@...il.com>
To: Dave Chinner <david@...morbit.com>
Cc: Christoph Hellwig <hch@...radead.org>, Brian Foster <bfoster@...hat.com>, 
	"Darrick J. Wong" <djwong@...nel.org>, Chi Zhiling <chizhiling@....com>, cem@...nel.org, 
	linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Chi Zhiling <chizhiling@...inos.cn>, John Garry <john.g.garry@...cle.com>
Subject: Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

On Fri, Jan 17, 2025 at 11:19 PM Dave Chinner <david@...morbit.com> wrote:
>
> On Fri, Jan 17, 2025 at 02:27:46PM +0100, Amir Goldstein wrote:
> > On Wed, Jan 15, 2025 at 10:41 PM Dave Chinner <david@...morbit.com> wrote:
> > >
> > > On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote:
> > > > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote:
> > > > > Sorry if this is out of left field as I haven't followed the discussion
> > > > > closely, but I presumed one of the reasons Darrick and Christoph raised
> > > > > the idea of using the folio batch thing I'm playing around with on zero
> > > > > range for buffered writes would be to acquire and lock all targeted
> > > > > folios up front. If so, would that help with what you're trying to
> > > > > achieve here? (If not, nothing to see here, move along.. ;).
> > > >
> > > > I mostly thought about acquiring, as locking doesn't really have much
> > > > batching effects.  That being said, no that you got the idea in my mind
> > > > here's my early morning brainfart on it:
> > > >
> > > > Let's ignore DIRECT I/O for the first step.  In that case lookup /
> > > > allocation and locking all folios for write before copying data will
> > > > remove the need for i_rwsem in the read and write path.  In a way that
> > > > sounds perfect, and given that btrfs already does that (although in a
> > > > very convoluted way) we know it's possible.
> > >
> > > Yes, this seems like a sane, general approach to allowing concurrent
> > > buffered writes (and reads).
> > >
> > > > But direct I/O throws a big monkey wrench here as already mentioned by
> > > > others.  Now one interesting thing some file systems have done is
> > > > to serialize buffered against direct I/O, either by waiting for one
> > > > to finish, or by simply forcing buffered I/O when direct I/O would
> > > > conflict.
> > >
> > > Right. We really don't want to downgrade to buffered IO if we can
> > > help it, though.
> > >
> > > > It's easy to detect outstanding direct I/O using i_dio_count
> > > > so buffered I/O could wait for that, and downgrading to buffered I/O
> > > > (potentially using the new uncached mode from Jens) if there are any
> > > > pages on the mapping after the invalidation also sounds pretty doable.
> > >
> > > It's much harder to sanely serialise DIO against buffered writes
> > > this way, because i_dio_count only forms a submission barrier in
> > > conjunction with the i_rwsem being held exclusively. e.g. ongoing
> > > DIO would result in the buffered write being indefinitely delayed.
> >
> > Isn't this already the case today with EX vs. SH iolock?
>
> No. We do not hold the i_rwsem across async DIO read or write, so
> we can have DIO in flight whilst a buffered write grabs and holds
> the i_rwsem exclusive. This is the problem that i_dio_count solves
> for truncate, etc. i.e. the only way to actually ensure there is no
> DIO in flight is to grab the i_rwsem exclusive and then call
> inode_dio_wait() to ensure all async DIO in flight completes before
> continuing.
>
> > I guess the answer depends whether or not i_rwsem
> > starves existing writers in the face of ongoing new readers.
>
> It does not. If the lock is held for read and there is a pending
> write, new readers are queued behind the pending write waiters
> (see rwsem_down_read_slowpath(), which is triggered if there are
> pending waiters on an attempt to read lock).
>
> > > > I don't really have time to turn this hand waving into, but maybe we
> > > > should think if it's worthwhile or if I'm missing something important.
> > >
> > > If people are OK with XFS moving to exclusive buffered or DIO
> > > submission model, then I can find some time to work on the
> > > converting the IO path locking to use a two-state shared lock in
> > > preparation for the batched folio stuff that will allow concurrent
> > > buffered writes...
> >
> > I won't object to getting the best of all worlds, but I have to say,
> > upfront, this sounds a bit like premature optimization and for
> > a workload (mixed buffered/dio write)
>
> Say what? The dio/buffered exclusion change provides a different
> data coherency and correctness model that is required to then
> optimising the workload you care about - mixed buffered read/write.
>
> This change doesn't optimise either DIO or buffered IO, nor is a
> shared-exclusive BIO/DIO lock isn't going to improve performance of
> such mixed IO workloads in any way.  IOWs, it's pretty hard to call
> a locking model change like this "optimisation", let alone call it
> "premature".
>

I was questioning the need to optimize the mixed buffered read/write
*while file is also open for O_DIRECT*
I thought that the solution would be easier if can take Christof's
"Let's ignore DIRECT I/O for the first step" and make it testable.
Then if the inode is open O_DIRECT, no change to locking.

> > that I don't think anybody
> > does in practice and nobody should care how it performs.
> > Am I wrong?
>
> Yes and no. mixed buffered/direct IO worklaods are more common than
> you think. e.g. many backup programs use direct IO and so inherently
> mix DIO with buffered IO for the majority of files the backup
> program touches.

Still feels like we can forego performance of mixed buffered rw
*while the backup program reads the database file*, should a backup
program ever really read a large database file...

> However, all we care about in this case is data
> coherency, not performance, and this change should improve data
> coherency between DIO and buffered IO...
>

I did not understand that you are proposing an improvement over the
existing state of affairs. Yeh, I certainly have no objections to fixing
things that are wrong.

> > For all practical purposes, we could maintain a counter in inode
> > not for submitted DIO, but for files opened O_DIRECT.
> >
> > The first open O_DIRECT could serialize with in-flight
> > buffered writes holding a shared iolock and then buffered writes
> > could take SH vs. EX iolock depending on folio state and on
> > i_dio_open_count.
>
> I don't see how this can be made to work w.r.t. sane data coherency
> behaviour. e.g. how does this model serialise a new DIO write that
> is submitted after a share-locked buffered write has just started
> and passed all the "i_dio_count == 0" checks that enable it to use
> shared locking? i.e. we now have potentially overlapping buffered
> writes and DIO writes being done concurrently because the buffered
> write may not have instantiated folios in the page cache yet....
>

A shared-locked buffered write can only start when there is no
O_DIRECT file open on the inode.

The first open for O_DIRECT to increment i_dio_open_count
needs to take exclusive iolock to wait for all in-flight buffered writes
then release the iolock.

All DIO submitted via O_DIRECT fds will be safe against in-flight
share-locked buffered write.

> > I would personally prefer a simple solution that is good enough
> > and has a higher likelihood for allocating the development, review
> > and testing resources that are needed to bring it to the finish line.
>
> Have you looked at how simple the bcachefs buffered/dio exclusion
> implementation is? The lock mechanism itself is about 50 lines of
> code, and there are only 4 or 5 places where the lock is actually
> needed. It doesn't get much simpler than that.
>
> And, quite frankly, the fact the bcachefs solution also covers AIO
> DIO in flight (which i_rwsem based locking does not!) means it is a
> more robust solution than trying to rely on racy i_dio_count hacks
> and folio residency in the page cache...

I will be happy to see this improvement.

Thanks,
Amir.