linux-kernel - Re: fall back from direct to buffered I/O when stable writes are required

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <kpk2od2fuqofdoneqse2l3gvn7wbqx3y4vckmnvl6gc2jcaw4m@hsxqmxshckpj>
Date: Mon, 3 Nov 2025 12:14:06 +0100
From: Jan Kara <jack@...e.cz>
To: Christoph Hellwig <hch@....de>
Cc: Keith Busch <kbusch@...nel.org>, Dave Chinner <david@...morbit.com>, 
	Carlos Maiolino <cem@...nel.org>, Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, 
	"Martin K. Petersen" <martin.petersen@...cle.com>, linux-kernel@...r.kernel.org, linux-xfs@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org, linux-raid@...r.kernel.org, linux-block@...r.kernel.org
Subject: Re: fall back from direct to buffered I/O when stable writes are
 required

On Fri 31-10-25 17:47:01, Christoph Hellwig wrote:
> On Fri, Oct 31, 2025 at 09:57:35AM -0600, Keith Busch wrote:
> > Not sure of any official statement to that effect, but storage in
> > general always says the behavior of modifying data concurrently with
> > in-flight operations on that data produces non-deterministic results.
> 
> Yes, it's pretty clear that the result in non-deterministic in what you
> get.  But that result still does not result in corruption, because
> there is a clear boundary ( either the sector size, or for NVMe
> optionally even a larger bodunary) that designates the atomicy boundary.

Well, is that boundary really guaranteed? I mean if you modify the buffer
under IO couldn't it happen that the DMA sees part of the sector new and
part of the sector old? I agree the window is small but I think the real
guarantee is architecture dependent and likely cacheline granularity or
something like that.

> > An
> > application with such behavior sounds like a bug to me as I can't
> > imagine anyone purposefully choosing to persist data with a random
> > outcome. If PI is enabled, I think they'd rather get a deterministic
> > guard check error so they know they did something with undefined
> > behavior.
> 
> As long as your clearly define your transaction boundaries that
> non-atomicy is not a problem per se.
> 
> > It's like having reads and writes to overlapping LBA and/or memory
> > ranges concurrently outstanding. There's no guaranteed result there
> > either; specs just say it's the host's responsibilty to not do that.
> 
> There is no guaranteed result as in an enforced ordering.  But there
> is a pretty clear model that you get either the old or new at a
> well defined boundary.
> 
> > The kernel doesn't stop an application from trying that on raw block
> > direct-io, but I'd say that's an application bug.
> 
> If it corrupts other applications data as in the RAID case it's
> pretty clearly not an application bug.  It's also pretty clear that
> at least some applications (qemu and other VMs) have been doings this
> for 20+ years.

Well, I'm mostly of the opinion that modifying IO buffers in flight is an
application bug (as much as most current storage stacks tolerate it) but on
the other hand returning IO errors later or even corrupting RAID5 on resync
is, in my opinion, not a sane error handling on the kernel side either so I
think we need to do better.

I also think the performance cost of the unconditional bounce buffering is
so heavy that it's just a polite way of pushing the app to do proper IO
buffer synchronization itself (assuming it cares about IO performance but
given it bothered with direct IO it presumably does). 

So the question is how to get out of this mess with the least disruption
possible which IMO also means providing easy way for well-behaved apps to
avoid the overhead.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR