linux-kernel - Re: [RFC PATCH] vfs: add rcu-based find

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ly3nx2r4cbnvhlwcficugq7zj62xij54mzuyjowcwaucbvkwns@nuxskfzj6lvc>
Date: Thu, 13 Jun 2024 12:50:35 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Dave Chinner <david@...morbit.com>
Cc: Jan Kara <jack@...e.cz>, brauner@...nel.org, viro@...iv.linux.org.uk, 
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [RFC PATCH] vfs: add rcu-based find_inode variants for iget ops

On Wed, Jun 12, 2024 at 08:00:42AM +1000, Dave Chinner wrote:
> On Tue, Jun 11, 2024 at 01:28:46PM +0200, Jan Kara wrote:
> > On Tue 11-06-24 20:17:16, Dave Chinner wrote:
> > > Your patch, however, just converts *some* of the lookup API
> > > operations to use RCU. It adds complexity for things like inserts
> > > which are going to need inode hash locking if the RCU lookup fails,
> > > anyway.
> > > 
> > > Hence your patch optimises the case where the inode is in cache but
> > > the dentry isn't, but we'll still get massive contention on lookup
> > > when the RCU lookup on the inode cache and inserts are always going
> > > to be required.
> > > 
> > > IOWs, even RCU lookups are not going to prevent inode hash lock
> > > contention for parallel cold cache lookups. Hence, with RCU,
> > > applications are going to see unpredictable contention behaviour
> > > dependent on the memory footprint of the caches at the time of the
> > > lookup. Users will have no way of predicting when the behaviour will
> > > change, let alone have any way of mitigating it. Unpredictable
> > > variable behaviour is the thing we want to avoid the most with core
> > > OS caches.
> > 
> > I don't believe this is what Mateusz's patches do (but maybe I've terribly
> > misread them). iget_locked() does:
> > 
> > 	spin_lock(&inode_hash_lock);
> > 	inode = find_inode_fast(...);
> > 	spin_unlock(&inode_hash_lock);
> > 	if (inode)
> > 		we are happy and return
> > 	inode = alloc_inode(sb);
> > 	spin_lock(&inode_hash_lock);
> > 	old = find_inode_fast(...)
> > 	the rest of insert code
> > 	spin_unlock(&inode_hash_lock);
> > 
> > And Mateusz got rid of the first lock-unlock pair by teaching
> > find_inode_fast() to *also* operate under RCU. The second lookup &
> > insertion stays under inode_hash_lock as it is now.
> 
> Yes, I understand that. I also understand what that does to
> performance characteristics when memory pressure causes the working
> set cache footprint to change. This will result in currently 
> workloads that hit the fast path falling off the fast path and
> hitting contention and performing no better than they do today.
> 
> Remember, the inode has lock is taken when inode are evicted from
> memory, too, so contention on the inode hash lock will be much worse
> when we are cycling inodes through the cache compared to when we are
> just doing hot cache lookups.
> 
> > So his optimization is
> > orthogonal to your hash bit lock improvements AFAICT.
> 
> Not really. RCU for lookups is not necessary when hash-bl is used.
> The new apis and conditional locking changes needed for RCU to work
> are not needed with hash-bl. hash-bl scales and performs the same
> regardless of whether the workload is cache hot or cache-cold.
> 
> And the work is almost all done already - it just needs someone with
> time to polish it for merge.
> 
> > Sure his optimization
> > just ~halves the lock hold time for uncached cases (for cached it
> > completely eliminates the lock acquisition but I agree these are not that
> > interesting) so it is not a fundamental scalability improvement but still
> > it is a nice win for a contended lock AFAICT.
> 
> Yes, but my point is that it doesn't get rid of the scalability
> problem - it just kicks it down the road for small machines and
> people with big machines will continue to suffer from the global
> lock contention cycling inodes through the inode cache causes...
> 
> That's kinda my point - adding RCU doesn't fix the scalability
> problem, it simply hides a specific symptom of the problem *in some
> environments* for *some worklaods*. hash-bl works for everyone,
> everywhere and for all workloads, doesn't require new APIs or
> conditional locking changes, and th work is mostly done. Why take a
> poor solution when the same amount of verification effort would
> finish off a complete solution?
> 
> The hash-bl patchset is out there - I don't have time to finish it,
> so anyone who has time to work on inode hash lock scalability issues
> is free to take it and work on it. I may have written it, but I
> don't care who gets credit for getting it merged. Again: why take
> a poor solution just because the person who wants the scalability
> problem fixed won't pick up the almost finished work that has all
> ready been done?
> 

My patch submission explicitly states that it does not fix the
scalability problem but merely reduces it, so we are in agreement on
this bit. My primary point being that absent full solutions the
benefit/effort ratio is pretty good, but people are free to disagree
with it.

This conversation is going in circles, so how about this as a way
forward:

1. you could isolate the hash-bl parts from your original patchset (+
rebase) and write a quick rundown what needs to be done for this to be
committable
2. if you think you can find the time to do the work yourself in the
foreseeable future you could state the rough timeline (the thing will
probably want to miss this merge cycle anyway so there is plenty of
time)
3. if you can't commit to the work yourself you can look for someone to
do it for you. while you suggested that on the list there were no takers
(for example someone else could have stepped in after I said I'm not
going to do it, but that did not happen). perhaps you can prod people at
your dayjob and whatever non-list spots.

If you can't do the work in the foreseeable future (understandable) and
there are no takers (realistically I suspect there wont be) then you are
going to have stop opposing my patch on the grounds that hash-bl exists.

I don't know how much work is needed to sort it out, it is definitely
much more than what was needed for my thing, which is in part why I did
not go for hash-bl myself.

Of course there may be reasons to not include my patch regardless of the
state of hash-bl. If you have any then I suggest you state them for the
vfs folk to consider. If you intend to write it should not go in on the
because it does not fully fix the problem, then I note the commit
message both concedes there is a limitation and provides a justification
for inclusion despite of it. Merely stating there is still a scalability
ceiling does not address it. Claiming it adds too much complexity for
the reported benefit is imo not legit, but again it is a judgment call
to make by the vfs folk.

Right now the v4 landed in a vfs.inode.rcu branch. It really does not
have to reach master if someone gets hash-bl to a state where the vfs
cabal is happy with it. I don't know if Christian intends to submit it
to master in the upcomming merge cycle, it is perfectly fine with me if
it does not happen. Perhaps it even should not happen if the hash-bl
gets a sensible timeline. Even so, should my patch land in master and
hash-bl get work done at a much later date, there is no difficulty added
to it stemming from my thing -- at worst some trivial editing to resolve
a merge conflict.

And with this message I'm done with the entire ordeal, with 2 exceptions:
- if there is a bug reported against my patch i'm going to investigate
- if my patch is stalled in the vfs.inode.rcu branch for weeks and there
  is no replacement in sight (hash-bl, rhashtable, ${whatever_else}), I'm
  going to prod about it

Cheers.