linux-kernel - Re: fall back from direct to buffered I/O when stable writes are required

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251104233824.GO196370@frogsfrogsfrogs>
Date: Tue, 4 Nov 2025 15:38:24 -0800
From: "Darrick J. Wong" <djwong@...nel.org>
To: Christoph Hellwig <hch@....de>
Cc: Jan Kara <jack@...e.cz>, Keith Busch <kbusch@...nel.org>,
	Dave Chinner <david@...morbit.com>,
	Carlos Maiolino <cem@...nel.org>,
	Christian Brauner <brauner@...nel.org>,
	"Martin K. Petersen" <martin.petersen@...cle.com>,
	linux-kernel@...r.kernel.org, linux-xfs@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, linux-raid@...r.kernel.org,
	linux-block@...r.kernel.org
Subject: Re: fall back from direct to buffered I/O when stable writes are
 required

On Mon, Nov 03, 2025 at 01:21:11PM +0100, Christoph Hellwig wrote:
> On Mon, Nov 03, 2025 at 12:14:06PM +0100, Jan Kara wrote:
> > > Yes, it's pretty clear that the result in non-deterministic in what you
> > > get.  But that result still does not result in corruption, because
> > > there is a clear boundary ( either the sector size, or for NVMe
> > > optionally even a larger bodunary) that designates the atomicy boundary.
> > 
> > Well, is that boundary really guaranteed? I mean if you modify the buffer
> > under IO couldn't it happen that the DMA sees part of the sector new and
> > part of the sector old? I agree the window is small but I think the real
> > guarantee is architecture dependent and likely cacheline granularity or
> > something like that.
> 
> If you actually modify it: yes.  But I think Keith' argument was just
> about regular racing reads vs writes.
> 
> > > pretty clearly not an application bug.  It's also pretty clear that
> > > at least some applications (qemu and other VMs) have been doings this
> > > for 20+ years.
> > 
> > Well, I'm mostly of the opinion that modifying IO buffers in flight is an
> > application bug (as much as most current storage stacks tolerate it) but on
> > the other hand returning IO errors later or even corrupting RAID5 on resync
> > is, in my opinion, not a sane error handling on the kernel side either so I
> > think we need to do better.
> 
> Yes.  Also if you look at the man page which is about official as it gets
> for the semantics you can't find anything requiring the buffers to be
> stable (but all kinds of other odd rants).
> 
> > I also think the performance cost of the unconditional bounce buffering is
> > so heavy that it's just a polite way of pushing the app to do proper IO
> > buffer synchronization itself (assuming it cares about IO performance but
> > given it bothered with direct IO it presumably does). 
> >
> > So the question is how to get out of this mess with the least disruption
> > possible which IMO also means providing easy way for well-behaved apps to
> > avoid the overhead.
> 
> Remember the cases where this matters is checksumming and parity, where
> we touch all the cache lines anyway and consume the DRAM bandwidth,
> although bounce buffering upgrades this from pure reads to also writes.
> So the overhead is heavy, but if we handle it the right way, that is
> doing the checksum/parity calculation while the cache line is still hot
> it should not be prohibitive.  And getting this right in the direct
> I/O code means that the low-level code could stop bounce buffering
> for buffered I/O, providing a major speedup there.
> 
> I've been thinking a bit more on how to better get the copy close to the
> checksumming at least for PI, and to avoid the extra copies for RAID5
> buffered I/O. M maybe a better way is to mark a bio as trusted/untrusted
> so that the checksumming/raid code can bounce buffer it, and I start to
> like that idea.  A complication is that PI could relax that requirement
> if we support PI passthrough from userspace (currently only for block
> device, but I plan to add file system support), where the device checks
> it, but we can't do that for parity RAID.

IIRC, a PI disk is supposed to check the supplied CRC against the
supplied data, and fail the write if there's a discrepancy, right?  In
that case, an application can't actually corrupt its own data because
hardware will catch it.

For reads, the kernel will check the supplied CRC against the data
buffer, right?  So a program can blow itself up, but that only affects
the buggy program.

I think that means the following:

A. We can allow mutant directio to non-PI devices because buggy programs
   can only screw themselves over.  Not great but we've allowed this
   forever.

B. We can also allow it to PI devices because those buggy programs will
   get hit with EIOs immediately.

C. Mutant directio reads from a RAID1/5 on non-PI devices are ok-ish
   because the broken application can decide to retry and that's just
   wasting resources.

D. Mutant directio reads from a RAID1/5 on PI devices are not good
   because the read failure will result in an unnecessary rebuild, which
   could turn really bad if the other disks are corrupt.

E. Mutant directio writes to a RAID5 are bad bad bad because you corrupt
   the stripe and now unsuspecting users on other strips lose data.

I think the btrfs corruption problems are akin to a RAID5 where you can
persist the wrong CRC to storage and you'll only see it on re-read; but
at least the blast is contained to the buggy application's file.

I wonder if that means we really need a way to convey the potential
damage of a mutant write through the block layer / address space so that
the filesystem can do the right thing?  IOWs, instead of a single
stable-pages flag, something along the lines of:

enum mutation_blast_radius {
	/* nobody will notice a thing */
	MBR_UNCHECKED,

	/* program doing the corruption will notice */
	MBR_BADAPP,

	/* everyone else's data get corrupted too */
	MBR_EVERYONE,
};

AS_STABLE_WRITES is set for MBR_BADAPP and MBR_EVERYONE, and the
directio -> dontcache flag change is done for a write to a MBR_EVERYONE
bdev.

Hm?

--D