linux-kernel - Re: [RFC PATCH 0/8] xfs: single block atomic writes for buffered IO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aR8GObWa1mtbbtts@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Thu, 20 Nov 2025 17:44:49 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Dave Chinner <david@...morbit.com>
Cc: Ritesh Harjani <ritesh.list@...il.com>, Christoph Hellwig <hch@....de>,
        Christian Brauner <brauner@...nel.org>, djwong@...nel.org,
        john.g.garry@...cle.com, tytso@....edu, willy@...radead.org,
        dchinner@...hat.com, linux-xfs@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-ext4@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, jack@...e.cz,
        nilay@...ux.ibm.com, martin.petersen@...cle.com, rostedt@...dmis.org,
        axboe@...nel.dk, linux-block@...r.kernel.org,
        linux-trace-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/8] xfs: single block atomic writes for buffered IO

On Sun, Nov 16, 2025 at 07:11:50PM +1100, Dave Chinner wrote:
> On Fri, Nov 14, 2025 at 02:50:25PM +0530, Ojaswin Mujoo wrote:
> > On Thu, Nov 13, 2025 at 09:32:11PM +1100, Dave Chinner wrote:
> > > On Thu, Nov 13, 2025 at 11:12:49AM +0530, Ritesh Harjani wrote:
> > > > Christoph Hellwig <hch@....de> writes:
> > > > 
> > > > > On Thu, Nov 13, 2025 at 08:56:56AM +1100, Dave Chinner wrote:
> > > > >> On Wed, Nov 12, 2025 at 04:36:03PM +0530, Ojaswin Mujoo wrote:
> > > > >> > This patch adds support to perform single block RWF_ATOMIC writes for
> > > > >> > iomap xfs buffered IO. This builds upon the inital RFC shared by John
> > > > >> > Garry last year [1]. Most of the details are present in the respective 
> > > > >> > commit messages but I'd mention some of the design points below:
> > > > >> 
> > > > >> What is the use case for this functionality? i.e. what is the
> > > > >> reason for adding all this complexity?
> > > > >
> > > > > Seconded.  The atomic code has a lot of complexity, and further mixing
> > > > > it with buffered I/O makes this even worse.  We'd need a really important
> > > > > use case to even consider it.
> > > > 
> > > > I agree this should have been in the cover letter itself. 
> > > > 
> > > > I believe the reason for adding this functionality was also discussed at
> > > > LSFMM too...  
> > > > 
> > > > For e.g. https://lwn.net/Articles/974578/ goes in depth and talks about
> > > > Postgres folks looking for this, since PostgreSQL databases uses
> > > > buffered I/O for their database writes.
> > > 
> > > Pointing at a discussion about how "this application has some ideas
> > > on how it can maybe use it someday in the future" isn't a
> > > particularly good justification. This still sounds more like a
> > > research project than something a production system needs right now.
> > 
> > Hi Dave, Christoph,
> > 
> > There were some discussions around use cases for buffered atomic writes
> > in the previous LSFMM covered by LWN here [1]. AFAIK, there are 
> > databases that recommend/prefer buffered IO over direct IO. As mentioned
> > in the article, MongoDB being one that supports both but recommends
> > buffered IO. Further, many DBs support both direct IO and buffered IO
> > well and it may not be fair to force them to stick to direct IO to get
> > the benefits of atomic writes.
> > 
> > [1] https://lwn.net/Articles/1016015/
> 
> You are quoting a discussion about atomic writes that was
> held without any XFS developers present. Given how XFS has driven
> atomic write functionality so far, XFS developers might have some
> ..... opinions about how buffered atomic writes in XFS...
> 
> Indeed, go back to the 2024 buffered atomic IO LSFMM discussion,
> where there were XFS developers present. That's the discussion that
> Ritesh referenced, so you should be aware of it.
> 
> https://lwn.net/Articles/974578/
> 
> Back then I talked about how atomic writes made no sense as
> -writeback IO- given the massive window for anything else to modify
> the data in the page cache. There is no guarantee that what the
> application wrote in the syscall is what gets written to disk with
> writeback IO. i.e. anything that can access the page cache can
> "tear" application data that is staged as "atomic data" for later
> writeback.
> 
> IOWs, the concept of atomic writes for writeback IO makes almost no
> sense at all - dirty data at rest in the page cache is not protected
> against 3rd party access or modification. The "atomic data IO"
> semantics can only exist in the submitting IO context where
> exclusive access to the user data can be guaranteed.
> 
> IMO, the only way semantics that makes sense for buffered atomic
> writes through the page cache is write-through IO. The "atomic"
> context is related directly to user data provided at IO submission,
> and so IO submitted must guarantee exactly that data is being
> written to disk in that IO.
> 
> IOWs, we have to guarantee exclusive access between the data copy-in
> and the pages being marked for writeback. The mapping needs to be
> marked as using stable pages to prevent anyone else changing the
> cached data whilst it has an atomic IO pending on it.
> 
> That means folios covering atomic IO ranges do not sit in the page
> cache in a dirty state - they *must* immediately transition to the
> writeback state before the folio is unlocked so that *nothing else
> can modify them* before the physical REQ_ATOMIC IO is submitted and
> completed.
> 
> If we've got the folios marked as writeback, we can pack them
> immediately into a bio and submit the IO (e.g. via the iomap DIO
> code). There is no need to involve the buffered IO writeback path
> here; we've already got the folios at hand and in the right state
> for IO. Once the IO is done, we end writeback on them and they
> remain clean in the page caceh for anyone else to access and
> modify...

Hi Dave,

I believe the essenece of your comment is that the data in the page
cache can be modified between the write and the writeback time and hence
it makes sense to have a write-through only semantic for RWF_ATOMIC
buffered IO.

However, as per various discussions around this on the mailing list, it
is my understanding that protecting tearing against an application
changing a data range that was previously written atomically is
something that falls out of scope of RWF_ATOMIC.

As John pointed out in [1], even with dio, RWF_ATOMIC writes can be torn
if the application does parallel overlaps. The only thing we guarantee
is the data doesn't tear when the actualy IO happens, and from there its
the userspace's responsibility to not change the data till IO [2]. I
believe userspace changing data between write and writeback time falls
in the same category.


[1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/
[2] https://lore.kernel.org/fstests/20250729144526.GB2672049@frogsfrogsfrogs/

> 
> This gives us the same physical IO semantics for buffered and direct
> atomic IO, and it allows the same software fallbacks for larger IO
> to be used as well.
> 
> > > Why didn't you use the existing COW buffered write IO path to
> > > implement atomic semantics for buffered writes? The XFS
> > > functionality is already all there, and it doesn't require any
> > > changes to the page cache or iomap to support...
> > 
> > This patch set focuses on HW accelerated single block atomic writes with
> > buffered IO, to get some early reviews on the core design.
> 
> What hardware acceleration? Hardware atomic writes are do not make
> IO faster; they only change IO failure semantics in certain corner
> cases. Making buffered writeback IO use REQ_ATOMIC does not change
> the failure semantics of buffered writeback from the point of view
> of an application; the applicaiton still has no idea just how much
> data or what files lost data whent eh system crashes.
> 
> Further, writeback does not retain application write ordering, so
> the application also has no control over the order that structured
> data is updated on physical media.  Hence if the application needs
> specific IO ordering for crash recovery (e.g. to avoid using a WAL)
> it cannot use background buffered writeback for atomic writes
> because that does not guarantee ordering.
> 
> What happens when you do two atomic buffered writes to the same file
> range? The second on hits the page cache, so now the crash recovery
> semantic is no longer "old or new", it's "some random older version
> or new". If the application rewrites a range frequently enough,
> on-disk updates could skip dozens of versions between "old" and
> "new", whilst other ranges of the file move one version at a time.
> The application has -zero control- of this behaviour because it is
> background writeback that determines when something gets written to
> disk, not the application.
> 
> IOWs, the only way to guarantee single version "old or new" atomic
> buffered overwrites for any given write would be to force flushing
> of the data post-write() completion.  That means either O_DSYNC,
> fdatasync() or sync_file_range(). And this turns the atomic writes
> into -write-through- IO, not write back IO...

I agree that there is no ordeirng guarantee without calls to sync and
friends, but as with all other IO paths, it has always been the
applicatoin that needs to enforce the ordering. Applications like DBs
are well aware of this however there are still areas where they can
benefit with unordered atomic IO, eg bg write of a bunch of dirty
buffers, which only need to be sync'd once during checkpoint.

> 
> > Just like we did for direct IO atomic writes, the software fallback with
> > COW and multi block support can be added eventually.
> 
> If the reason for this functionality is "maybe someone
> can use it in future", then you're not implementing this
> functionality to optimise an existing workload. It's a research
> project looking for a user.
> 
> Work with the database engineers to build a buffered atomic write
> based engine that implements atomic writes with RWF_DSYNC.
> Make it work, and optimise it to be competitive with existing
> database engines, than then show how much faster it is using
> RWF_ATOMIC buffered writes.
> 
> Alternatively - write an algorithm that assumes the filesystem is
> using COW for overwrites, and optimise the data integrity algorithm
> based on this knowledge. e.g. use always-cow mode on XFS, or just
> optimise for normal bcachefs or btrfs buffered writes. Use O_DSYNC
> when completion to submission ordering is required. Now you have
> an application algorithm that is optimised for old-or-new behaviour,
> and that can then be acclerated on overwrite-in-place capable
> filesystems by using a direct-to-hw REQ_ATOMIC overwrite to provide
> old-or-new semantics instead of using COW.
> 
> Yes, there are corner cases - partial writeback, fragmented files,
> etc - where data will a mix of old and new when using COW without
> RWF_DSYNC.  Those are the the cases that RWF_ATOMIC needs to
> mitigate, but we don't need whacky page cache and writeback stuff to
> implement RWF_ATOMIC semantics in COW capable filesystems.
> 
> i.e. enhance the applicaitons to take advantage of native COW
> old-or-new data semantics for buffered writes, then we can look at
> direct-to-hw fast paths to optimise those algorithms.
> 
> Trying to go direct-to-hw first without having any clue of how
> applications are going to use such functionality is backwards.
> Design the applicaiton level code that needs highly performant
> old-or-new buffered write guarantees, then we can optimise the data
> paths for it...

Got it, thanks for the pointers Dave, we will look into this.

Regards,
ojaswin

> 
> -Dave.
> -- 
> Dave Chinner
> david@...morbit.com