[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z_NvxebhNgXSIWkQ@infradead.org>
Date: Sun, 6 Apr 2025 23:25:09 -0700
From: Christoph Hellwig <hch@...radead.org>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: Christoph Hellwig <hch@...radead.org>,
Catherine Hoang <catherine.hoang@...cle.com>,
linux-ext4@...r.kernel.org
Subject: Re: [RFC PATCH v2 0/4] remove buffer heads from ext2
On Fri, Apr 04, 2025 at 09:43:22AM -0700, Darrick J. Wong wrote:
> On Thu, Mar 27, 2025 at 03:32:42AM -0700, Christoph Hellwig wrote:
> > On Tue, Mar 25, 2025 at 06:49:24PM -0700, Catherine Hoang wrote:
> > > Hi all,
> > >
> > > This series is an effort to begin removing buffer heads from ext2.
> >
> > Why is that desirable?
>
> struct buffer_head is a mismash of things -- originally it was a landing
> place for the old buffer cache, right?
Yes.
> So it has the necessary things
> like a pointer to a memory page, the disk address, a length, buffer
> state flags (uptodate/dirty), and some locks.
Yes. (although the page to folio migration is almost done).
> For filesystem metadata
> blocks I think that's all that most filesystems really need.
Yes.
> Assuming
> that filesystems /never/ want overlapping metadata buffers, I think it's
> more efficient to look up buffer objects via an rhashtable instead of
> walking the address_space xarray to find a folio, and then walking a
> linked list from that folio to find the particular bh.
Probably. But does it make a practical difference for ext2?
> Unfortunately, it also has a bunch of file mapping state information
> (e.g. BH_Delalloc) that aren't needed for caching metadata blocks. All
> the confusion that results from the incohesive mixing of these two
> usecases goes away by separating out the metadata buffers into a
> separate cache and (ha) leaving the filesystems to port the file IO
> paths to iomap.
Exactly.
> Separating filesystem metadata buffers into a private datastructure
> instead of using the blockdev pagecache also closes off an entire class
> of attack surface where evil userspace can wait for a filesystem to load
> a metadata block into memory and validate it;
It does close that window. OTOH that behavior has traditionally been
expected very much for extN for use with some tools (IIRC tunefs), so
changing that would actually break existing userspace.
> and then scribble on the
> pagecache block to cause the filesystem driver to make the wrong
> decisions -- look at all the ext4 metadata_csum bugs where syzkaller
> discovered that the decision to call the crc32c driver was gated on a
> bit in a bufferhead, and setting that bit having not initialized the
> crc32c driver would lead to a kernel crash. Nowadays we have
> CONFIG_BLK_DEV_WRITE_MOUNTED to shut that down, though it defaults to y
> and I think that might actually break leased layout things like pnfs.
pNFS scsi/nvme only works for remote nodes, so it's not affected by
this. The block layout could theoretically work locally, but it's
a pretty broken protocol to start with.
> So the upsides are: faster lookups, a more cohesive data structure that
> only tries to do one thing, and closing attack surfaces.
>
> The downsides: this new buffer cache code still needs: an explicit hook
> into the dirty pagecache timeout to start its own writeback; to provide
> its own shrinker; and some sort of solution for file mapping metadata so
> that fsync can flush just those blocks and not the whole cache.
The other major downside is instead of one maintained code base we now
have one per file system. That's probably okay for well maintained
file systems with somewhat special needs (say btrfs or xfs), but I'd
rather not have that for every misc file system using buffer_heads
right now.
So if we want to reduce problems with buffer_heads I'd much rather see
all data path usage go away ASAP. After that the weird entanglement
with jbd2 would be the next step (and maybe a private buffer cache for
ext4/gfs2 is the answer there). But copy and pasting a new buffer cache
into simple and barely maintained file systems like ext2 feels at least
a bit questionable.
Powered by blists - more mailing lists