linux-ext4 - Re: [RFC PATCH v2 0/4] remove buffer heads from ext2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250407165411.GA6262@frogsfrogsfrogs>
Date: Mon, 7 Apr 2025 09:54:11 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: Christoph Hellwig <hch@...radead.org>
Cc: Catherine Hoang <catherine.hoang@...cle.com>,
	linux-ext4@...r.kernel.org
Subject: Re: [RFC PATCH v2 0/4] remove buffer heads from ext2

On Sun, Apr 06, 2025 at 11:25:09PM -0700, Christoph Hellwig wrote:
> On Fri, Apr 04, 2025 at 09:43:22AM -0700, Darrick J. Wong wrote:
> > On Thu, Mar 27, 2025 at 03:32:42AM -0700, Christoph Hellwig wrote:
> > > On Tue, Mar 25, 2025 at 06:49:24PM -0700, Catherine Hoang wrote:
> > > > Hi all,
> > > > 
> > > > This series is an effort to begin removing buffer heads from ext2. 
> > > 
> > > Why is that desirable?
> > 
> > struct buffer_head is a mismash of things -- originally it was a landing
> > place for the old buffer cache, right?
> 
> Yes.
> 
> > So it has the necessary things
> > like a pointer to a memory page, the disk address, a length, buffer
> > state flags (uptodate/dirty), and some locks.
> 
> Yes. (although the page to folio migration is almost done).

Yeah, now that your folio conversion has landed for xfs_buf, Catherine
should crib your implementation into this one.  Though at least ext2
doesn't have all the discontiguous multi-fsblock stuff to deal with.

> > For filesystem metadata
> > blocks I think that's all that most filesystems really need.
> 
> Yes.
> 
> > Assuming
> > that filesystems /never/ want overlapping metadata buffers, I think it's
> > more efficient to look up buffer objects via an rhashtable instead of
> > walking the address_space xarray to find a folio, and then walking a
> > linked list from that folio to find the particular bh.
> 
> Probably.  But does it make a practical difference for ext2?

Not really, ext2 is just the test vehicle for getting this going.

> > Unfortunately, it also has a bunch of file mapping state information
> > (e.g. BH_Delalloc) that aren't needed for caching metadata blocks.  All
> > the confusion that results from the incohesive mixing of these two
> > usecases goes away by separating out the metadata buffers into a
> > separate cache and (ha) leaving the filesystems to port the file IO
> > paths to iomap.
> 
> Exactly.
> 
> > Separating filesystem metadata buffers into a private datastructure
> > instead of using the blockdev pagecache also closes off an entire class
> > of attack surface where evil userspace can wait for a filesystem to load
> > a metadata block into memory and validate it;
> 
> It does close that window.  OTOH that behavior has traditionally been
> expected very much for extN for use with some tools (IIRC tunefs), so
> changing that would actually break existing userspace.

Ted at least would like to get rid of tune2fs' scribbling on the block
device.  It's really gross in ext4 since tune2fs can race with the
kernel to update sb fields and recalculate the checksum.

> > and then scribble on the
> > pagecache block to cause the filesystem driver to make the wrong
> > decisions -- look at all the ext4 metadata_csum bugs where syzkaller
> > discovered that the decision to call the crc32c driver was gated on a
> > bit in a bufferhead, and setting that bit having not initialized the
> > crc32c driver would lead to a kernel crash.  Nowadays we have
> > CONFIG_BLK_DEV_WRITE_MOUNTED to shut that down, though it defaults to y
> > and I think that might actually break leased layout things like pnfs.
> 
> pNFS scsi/nvme only works for remote nodes, so it's not affected by
> this.  The block layout could theoretically work locally, but it's
> a pretty broken protocol to start with.

Ah.

> > So the upsides are: faster lookups, a more cohesive data structure that
> > only tries to do one thing, and closing attack surfaces.
> > 
> > The downsides: this new buffer cache code still needs: an explicit hook
> > into the dirty pagecache timeout to start its own writeback; to provide
> > its own shrinker; and some sort of solution for file mapping metadata so
> > that fsync can flush just those blocks and not the whole cache.
> 
> The other major downside is instead of one maintained code base we now
> have one per file system.  That's probably okay for well maintained
> file systems with somewhat special needs (say btrfs or xfs), but I'd
> rather not have that for every misc file system using buffer_heads
> right now.
> 
> So if we want to reduce problems with buffer_heads I'd much rather see
> all data path usage go away ASAP.  After that the weird entanglement
> with jbd2 would be the next step (and maybe a private buffer cache for
> ext4/gfs2 is the answer there).  But copy and pasting a new buffer cache
> into simple and barely maintained file systems like ext2 feels at least
> a bit questionable.

The next step is to yank all that code up to fs/ and port another
filesystem (e.g. fat) to it.  But let's let Catherine get it working for
all the metadata types within the confines of ext2.

--D