linux-kernel - Re: [GIT PULL] bcachefs changes for 6.12-rc1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZvNWqhnUgqk5BlS4@dread.disaster.area>
Date: Wed, 25 Sep 2024 10:17:46 +1000
From: Dave Chinner <david@...morbit.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Kent Overstreet <kent.overstreet@...ux.dev>,
	linux-bcachefs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, Dave Chinner <dchinner@...hat.com>
Subject: Re: [GIT PULL] bcachefs changes for 6.12-rc1

On Tue, Sep 24, 2024 at 09:57:13AM -0700, Linus Torvalds wrote:
> On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@...morbit.com> wrote:
> >
> > That's effectively what the patch did - it added a spinlock per hash
> > list.
> 
> Yeah, no, I like your patches, they all seem to be doing sane things
> and improve the code.
> 
> The part I don't like is how the basic model that your patches improve
> seems quite a bit broken.
> 
> For example, that whole superblock list lock contention just makes me
> go "Christ, that's stupid". Not because your patches to fix it with
> Waiman's fancy list would be wrong or bad, but because the whole pain
> is self-inflicted garbage.

I'm not about to disagree with that assessment.

> And it's all historically very very understandable. It wasn't broken
> when it was written.
> 
> That singly linked list is 100% sane in the historical context of "we
> really don't want anything fancy for this". The whole list of inodes
> for a superblock is basically unimportant, and it's main use (only
> use?) is for the final "check that we don't have busy inodes at umount
> time, remove any pending stuff".
> 
> So using a stupid linked list was absolutely the right thing to do,
> but now that just the *locking* for that list is a pain, it turns out
> that we probably shouldn't use a list at all. Not because the list was
> wrong, but because a flat list is such a pain for locking, since
> there's no hierarchy to spread the locking out over.

Right, that's effectively what the dlist infrastructure has taught
us - we need some kind of heirarchy to spread the locking over. But
what that heirachy is for a "iterate every object" list looks like
isn't really clear cut...

Another option I'd considered was to offload the iteration to
filesystems that have internal tracking mechanisms. e.g. use a
superblock method (say ->for_all_cached_inodes()) that gets passed a
callback function for the operation to perform on a stabilised
inode.

This would work for XFS - we already have such an internal callback
based cache-walking infrastructure - xfs_iwalk() - that is used
extensively for internal admin, gc and scrub/repair functions.
If we could use this for VFS icache traversals, then
we wouldn't need to maintain the sb->s_inodes list at all in XFS.

But I didn't go down that route because I didn't think we wanted to
encourage each major filesysetm to have their own unique internal
inode caching implementations with there own special subtle
differences. The XFS icache code is pretty complex and really
requires a lot of XFS expertise to understand - that makes changing
global inode caching behaviour or life cycle semantics much more
difficult than it already is.

That said, if we do decide that a model where filesystems will
provide their own inode caches is acceptible, then as a first step
we could convert the generic s_inodes list iteration to the callback
model fairly easily....

> (We used to have that kind of "flat lock" for the dcache too, but
> "dcache_lock" went away many moons ago, and good riddance - but the
> only reason it could go away is that the dcache has a hierarchy, so
> that you can always lock just the local dentry (or the parent dentry)
> if you are just careful).

> 
> > [ filesystems doing their own optimized thing ]
> >
> > IOWs, it's not clear to me that this is a caching model we really
> > want to persue in general because of a) the code duplication and b)
> > the issues such an inode life-cycle model has interacting with the
> > VFS life-cycle expectations...
> 
> No, I'm sure you are right, and I'm just frustrated with this code
> that probably _should_ look a lot more like the dcache code, but
> doesn't.
> 
> I get the very strong feeling that we should have a per-superblock
> hash table that we could also traverse the entries of. That way the
> superblock inode list would get some structure (the hash buckets) that
> would allow the locking to be distributed (and we'd only need one lock
> and just share it between the "hash inode" and "add it to the
> superblock list").

The only problem with this is the size of the per-sb hash tables
needed for scalability - we can't allocate system sized hash tables
for every superblock just in case a superblock might be asked to
cache 100 million inodes. That's why Kent used rhashtables for the
bcachefs implementation - they resize according to how many objects
are being indexed, and hence scale both up and down.

That is also why XFS uses multiple radix trees per-sb in it's icache
implementation - they scale up efficiently, yet have a small
memory footprint when only a few inodes are in cache in a little
used filesystem.

> But that would require something much more involved than "improve the
> current code".

Yup.

FWIW, I think all this "how do we cache inodes better" discussion is
somehwat glossing over a more important question we need to think
about first: do we even need a fully fledged inode cache anymore?

Every inode brought into cache is pinned by a dentry. The dentry
cache has an LRU and cache aging, and so by the time a dentry is
aged out of the dentry cache and the inode is finally dropped, it
has not been in use for some time. It has aged out of the current
working set.

Given that we've already aged the inode out of the working set.  why
do we then need to dump the inode into another LRU and age that out
again before we reclaim the inode? What does this double caching
actually gaining us?

I've done experiments on XFS marking all inodes with I_DONT_CACHE,
which means it gets removed from the inode cache the moment the
reference count goes to zero (i.e. when the dentry is dropped from
cache). This leaves the inode life cycle mirroring the dentry
life-cycle. i.e. the inode is evicted when the dentry is aged out as
per normal.

On SSD based systems, I really don't see any performance degradation
for my typical workstation workloads like git tree operations and
kernel compiles. I don't see any noticable impact on streaming
inode/data workloads, either. IOWs, the dentry cache appears to be
handling the workings set maintenance duties pretty well for most
common workloads. Hence I question the need for LRU based inode
cache aging being needed at all.

So: should we be looking towards gutting the inode cache and so the
in-memory VFS inode lifecycle tracks actively referenced inodes? If
so, then the management of the VFS inodes becomes a lot simpler as
the LRU lock, maintenance and shrinker-based reclaim goes away
entirely. Lots of code gets simpler if we trim down the VFS inode
life cycle to remove the caching of unreferenced inodes...

-Dave.
-- 
Dave Chinner
david@...morbit.com