linux-kernel - Re: [RFC PATCH] vfs: add rcu-based find

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZmgkLHa6LoV8yzab@dread.disaster.area>
Date: Tue, 11 Jun 2024 20:17:16 +1000
From: Dave Chinner <david@...morbit.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: brauner@...nel.org, viro@...iv.linux.org.uk, jack@...e.cz,
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [RFC PATCH] vfs: add rcu-based find_inode variants for iget ops

On Fri, Jun 07, 2024 at 09:51:51AM +0200, Mateusz Guzik wrote:
> On Fri, Jun 07, 2024 at 12:04:58PM +1000, Dave Chinner wrote:
> > On Thu, Jun 06, 2024 at 04:05:15PM +0200, Mateusz Guzik wrote:
> > > Instantiating a new inode normally takes the global inode hash lock
> > > twice:
> > > 1. once to check if it happens to already be present
> > > 2. once to add it to the hash
> > > 
> > > The back-to-back lock/unlock pattern is known to degrade performance
> > > significantly, which is further exacerbated if the hash is heavily
> > > populated (long chains to walk, extending hold time). Arguably hash
> > > sizing and hashing algo need to be revisited, but that's beyond the
> > > scope of this patch.
> > > 
> > > A long term fix would introduce fine-grained locking, this was attempted
> > > in [1], but that patchset was already posted several times and appears
> > > stalled.
> > 
> > Why not just pick up those patches and drive them to completion?
> > 
> 
> Time constraints on my end aside.
> 
> From your own e-mail [1] last year problems are:
> 
> > - A lack of recent validation against ext4, btrfs and other
> > filesystems.
> > - the loss of lockdep coverage by moving to bit locks
> > - it breaks CONFIG_PREEMPT_RT=y because we nest other spinlocks
> >   inside the inode_hash_lock and we can't do that if we convert the
> >   inode hash to bit locks because RT makes spinlocks sleeping locks.
> > - There's been additions for lockless RCU inode hash lookups from
> >   AFS and ext4 in weird, uncommon corner cases and I have no idea
> >   how to validate they still work correctly with hash-bl. I suspect
> >   they should just go away with hash-bl, but....
> 
> > There's more, but these are the big ones.
> 
> I did see the lockdep and preempt_rt problem were patched up in a later
> iteration ([2]).
> 
> What we both agree on is that the patchset adds enough complexity that
> it needs solid justification. I assumed one was there on your end when
> you posted it.
> 
> For that entire patchset I don't have one. I can however justify the
> comparatively trivial thing I posted in this thread.

I didn't suggest you pick up the entire patchset, I suggested you
pull the inode cache conversion to hash-bl from it and use that
instead of hacking RCU lookups into the inode cache.

> That aside if I had to make the entire thing scale I would approach
> things differently, most notably in terms of locking granularity. Per
> your own statement things can be made to look great in microbenchmarks,
> but that does not necessarily mean they help. A lot of it is a tradeoff
> and making everything per-cpu for this particular problem may be taking
> it too far.

The hash-bl conversion doesn't make anything per-cpu, so I don't
know what you are complaining about here.

> For the inode hash it may be the old hack of having a lock array would
> do it more than well enough -- say 32 locks (or some other number
> dependent on hash size) each covering a dedicated subset.

No, that's a mid 1990s-era hack that was done because we didn't know
any better ways to scale global algorithms back then. 
It's not a scalable algorithm, it's a global algorithm that
has been sharded a few times to try to keep global scope operations
apart. The moment you have subset collisions because of a specific
workload pattern, then the scalability problem comes straight back.

We have been doing better than this for years, and it's not a viable
solution for things like core OS cache indexing. We need a construct
that is inherently scalable and has little cost to that inherent
scalability.

We have an inhernetly scalable solution already - hash-bl adds
fine-grained locking without changing locking behaviour nor
consuming extra memory. And it scales lookup, insert and remove with
a common locking scheme across all operations.

Your patch, however, just converts *some* of the lookup API
operations to use RCU. It adds complexity for things like inserts
which are going to need inode hash locking if the RCU lookup fails,
anyway.

Hence your patch optimises the case where the inode is in cache but
the dentry isn't, but we'll still get massive contention on lookup
when the RCU lookup on the inode cache and inserts are always going
to be required.

IOWs, even RCU lookups are not going to prevent inode hash lock
contention for parallel cold cache lookups. Hence, with RCU,
applications are going to see unpredictable contention behaviour
dependent on the memory footprint of the caches at the time of the
lookup. Users will have no way of predicting when the behaviour will
change, let alone have any way of mitigating it. Unpredictable
variable behaviour is the thing we want to avoid the most with core
OS caches.

In comparison, hash-bl based inode cache lookups will have the same
behaviour for both cache hits and cache misses. There will be
nearly zero contention for both types of lookups, as the insert can
be done immediately under the same lock that the is held for the
lookup. Hence performance will be predictable and largely
deterministic under a wide variety of workloads and changing memory
pressure. This is how we want the inode cache to behave.

IOWs, the hash-bl solution is superior to RCU lookups for many
reasons. RCU lookups might be fast, but it's not the best solution
to every problem that requires scalable behaviour.

> All that said, if someone(tm) wants to pick up your patchset and even
> commit it the way it is right now I'm not going to protest anything.
>
> I don't have time nor justification to do full work My Way(tm).

You *always* say this sort of thing when someone asks you to do
something different.

If you don't agree with the reviewer's comments, you make some
statement about how you "don't care enough to fix it properly" or
that you don't have time to fix it properly, or that you think the
reviewing is asking you to "fix everything" rather than just one
line in your patch so you reject all the reviewers comment, or you
say "if the maintainers want something" as a way of saying "you
don't have any authority, so I'm not going to listen to you at all",
or ....

> I did have time to hack up the initial rcu lookup, which imo does not
> require much to justify inclusion and does help the case I'm interested
> in.

I thin kit requires a lot more justification than you have given,
especially given the fact changing memory pressure, access patterns
and cache balance will have a great impact on whether the RCU path
makes any different to inode hash lock contention at all.

The hash-bl conversion is a much a better solution and the
work is mostly done. If you want your workload case to perform
better, then please pick up the inode hash lock conversion to
hash-bl.

-Dave.
-- 
Dave Chinner
david@...morbit.com