linux-kernel - Re: [RFC PATCH 0/8] xfs: single block atomic writes for buffered IO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aRmHRk7FGD4nCT0s@dread.disaster.area>
Date: Sun, 16 Nov 2025 19:11:50 +1100
From: Dave Chinner <david@...morbit.com>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Cc: Ritesh Harjani <ritesh.list@...il.com>, Christoph Hellwig <hch@....de>,
	Christian Brauner <brauner@...nel.org>, djwong@...nel.org,
	john.g.garry@...cle.com, tytso@....edu, willy@...radead.org,
	dchinner@...hat.com, linux-xfs@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-ext4@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, jack@...e.cz,
	nilay@...ux.ibm.com, martin.petersen@...cle.com,
	rostedt@...dmis.org, axboe@...nel.dk, linux-block@...r.kernel.org,
	linux-trace-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/8] xfs: single block atomic writes for buffered IO

On Fri, Nov 14, 2025 at 02:50:25PM +0530, Ojaswin Mujoo wrote:
> On Thu, Nov 13, 2025 at 09:32:11PM +1100, Dave Chinner wrote:
> > On Thu, Nov 13, 2025 at 11:12:49AM +0530, Ritesh Harjani wrote:
> > > Christoph Hellwig <hch@....de> writes:
> > > 
> > > > On Thu, Nov 13, 2025 at 08:56:56AM +1100, Dave Chinner wrote:
> > > >> On Wed, Nov 12, 2025 at 04:36:03PM +0530, Ojaswin Mujoo wrote:
> > > >> > This patch adds support to perform single block RWF_ATOMIC writes for
> > > >> > iomap xfs buffered IO. This builds upon the inital RFC shared by John
> > > >> > Garry last year [1]. Most of the details are present in the respective 
> > > >> > commit messages but I'd mention some of the design points below:
> > > >> 
> > > >> What is the use case for this functionality? i.e. what is the
> > > >> reason for adding all this complexity?
> > > >
> > > > Seconded.  The atomic code has a lot of complexity, and further mixing
> > > > it with buffered I/O makes this even worse.  We'd need a really important
> > > > use case to even consider it.
> > > 
> > > I agree this should have been in the cover letter itself. 
> > > 
> > > I believe the reason for adding this functionality was also discussed at
> > > LSFMM too...  
> > > 
> > > For e.g. https://lwn.net/Articles/974578/ goes in depth and talks about
> > > Postgres folks looking for this, since PostgreSQL databases uses
> > > buffered I/O for their database writes.
> > 
> > Pointing at a discussion about how "this application has some ideas
> > on how it can maybe use it someday in the future" isn't a
> > particularly good justification. This still sounds more like a
> > research project than something a production system needs right now.
> 
> Hi Dave, Christoph,
> 
> There were some discussions around use cases for buffered atomic writes
> in the previous LSFMM covered by LWN here [1]. AFAIK, there are 
> databases that recommend/prefer buffered IO over direct IO. As mentioned
> in the article, MongoDB being one that supports both but recommends
> buffered IO. Further, many DBs support both direct IO and buffered IO
> well and it may not be fair to force them to stick to direct IO to get
> the benefits of atomic writes.
> 
> [1] https://lwn.net/Articles/1016015/

You are quoting a discussion about atomic writes that was
held without any XFS developers present. Given how XFS has driven
atomic write functionality so far, XFS developers might have some
..... opinions about how buffered atomic writes in XFS...

Indeed, go back to the 2024 buffered atomic IO LSFMM discussion,
where there were XFS developers present. That's the discussion that
Ritesh referenced, so you should be aware of it.

https://lwn.net/Articles/974578/

Back then I talked about how atomic writes made no sense as
-writeback IO- given the massive window for anything else to modify
the data in the page cache. There is no guarantee that what the
application wrote in the syscall is what gets written to disk with
writeback IO. i.e. anything that can access the page cache can
"tear" application data that is staged as "atomic data" for later
writeback.

IOWs, the concept of atomic writes for writeback IO makes almost no
sense at all - dirty data at rest in the page cache is not protected
against 3rd party access or modification. The "atomic data IO"
semantics can only exist in the submitting IO context where
exclusive access to the user data can be guaranteed.

IMO, the only way semantics that makes sense for buffered atomic
writes through the page cache is write-through IO. The "atomic"
context is related directly to user data provided at IO submission,
and so IO submitted must guarantee exactly that data is being
written to disk in that IO.

IOWs, we have to guarantee exclusive access between the data copy-in
and the pages being marked for writeback. The mapping needs to be
marked as using stable pages to prevent anyone else changing the
cached data whilst it has an atomic IO pending on it.

That means folios covering atomic IO ranges do not sit in the page
cache in a dirty state - they *must* immediately transition to the
writeback state before the folio is unlocked so that *nothing else
can modify them* before the physical REQ_ATOMIC IO is submitted and
completed.

If we've got the folios marked as writeback, we can pack them
immediately into a bio and submit the IO (e.g. via the iomap DIO
code). There is no need to involve the buffered IO writeback path
here; we've already got the folios at hand and in the right state
for IO. Once the IO is done, we end writeback on them and they
remain clean in the page caceh for anyone else to access and
modify...

This gives us the same physical IO semantics for buffered and direct
atomic IO, and it allows the same software fallbacks for larger IO
to be used as well.

> > Why didn't you use the existing COW buffered write IO path to
> > implement atomic semantics for buffered writes? The XFS
> > functionality is already all there, and it doesn't require any
> > changes to the page cache or iomap to support...
> 
> This patch set focuses on HW accelerated single block atomic writes with
> buffered IO, to get some early reviews on the core design.

What hardware acceleration? Hardware atomic writes are do not make
IO faster; they only change IO failure semantics in certain corner
cases. Making buffered writeback IO use REQ_ATOMIC does not change
the failure semantics of buffered writeback from the point of view
of an application; the applicaiton still has no idea just how much
data or what files lost data whent eh system crashes.

Further, writeback does not retain application write ordering, so
the application also has no control over the order that structured
data is updated on physical media.  Hence if the application needs
specific IO ordering for crash recovery (e.g. to avoid using a WAL)
it cannot use background buffered writeback for atomic writes
because that does not guarantee ordering.

What happens when you do two atomic buffered writes to the same file
range? The second on hits the page cache, so now the crash recovery
semantic is no longer "old or new", it's "some random older version
or new". If the application rewrites a range frequently enough,
on-disk updates could skip dozens of versions between "old" and
"new", whilst other ranges of the file move one version at a time.
The application has -zero control- of this behaviour because it is
background writeback that determines when something gets written to
disk, not the application.

IOWs, the only way to guarantee single version "old or new" atomic
buffered overwrites for any given write would be to force flushing
of the data post-write() completion.  That means either O_DSYNC,
fdatasync() or sync_file_range(). And this turns the atomic writes
into -write-through- IO, not write back IO...

> Just like we did for direct IO atomic writes, the software fallback with
> COW and multi block support can be added eventually.

If the reason for this functionality is "maybe someone
can use it in future", then you're not implementing this
functionality to optimise an existing workload. It's a research
project looking for a user.

Work with the database engineers to build a buffered atomic write
based engine that implements atomic writes with RWF_DSYNC.
Make it work, and optimise it to be competitive with existing
database engines, than then show how much faster it is using
RWF_ATOMIC buffered writes.

Alternatively - write an algorithm that assumes the filesystem is
using COW for overwrites, and optimise the data integrity algorithm
based on this knowledge. e.g. use always-cow mode on XFS, or just
optimise for normal bcachefs or btrfs buffered writes. Use O_DSYNC
when completion to submission ordering is required. Now you have
an application algorithm that is optimised for old-or-new behaviour,
and that can then be acclerated on overwrite-in-place capable
filesystems by using a direct-to-hw REQ_ATOMIC overwrite to provide
old-or-new semantics instead of using COW.

Yes, there are corner cases - partial writeback, fragmented files,
etc - where data will a mix of old and new when using COW without
RWF_DSYNC.  Those are the the cases that RWF_ATOMIC needs to
mitigate, but we don't need whacky page cache and writeback stuff to
implement RWF_ATOMIC semantics in COW capable filesystems.

i.e. enhance the applicaitons to take advantage of native COW
old-or-new data semantics for buffered writes, then we can look at
direct-to-hw fast paths to optimise those algorithms.

Trying to go direct-to-hw first without having any clue of how
applications are going to use such functionality is backwards.
Design the applicaiton level code that needs highly performant
old-or-new buffered write guarantees, then we can optimise the data
paths for it...

-Dave.
-- 
Dave Chinner
david@...morbit.com