[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=whagTfq=EgwpuywUUu0ti7PPQuE_ftVjZOVrjnLwtS0Ng@mail.gmail.com>
Date: Tue, 2 Jul 2024 15:14:54 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Christian Brauner <brauner@...nel.org>, kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
lkp@...el.com, Linux Memory Management List <linux-mm@...ck.org>, linux-kernel@...r.kernel.org,
ying.huang@...el.com, feng.tang@...el.com, fengwei.yin@...el.com
Subject: Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput
-33.7% regression
On Tue, 2 Jul 2024 at 14:15, Mateusz Guzik <mjguzik@...il.com> wrote:
>
> I asked you something in the previous e-mail though (with some nastiness
> of the problem pointed out) concerning handling of slow vs fastpath,
> here it is again:
>
> [..]for example did you know xfs does not honor rcu grace periods when
> recycling inodes?
I don't think it should matter.
Yes, the vfs layer will access the inode, but then the sequence number
checks should always verify that whatever access we did is then
validated after-the-fact.
And yes, any "stat under RCU" will obviously have to do the sequence
number check after having copied the data, the same way we now do it
after the "lockref_get_not_dead()".
So the RCU part just means that at least the allocation still exists.
But it might not have been the *right* inode, and that's the sequence
number check.
Or do you see something I'm missing? I'm not loving how we deal with
dentry->d_inode, but considering that the whole point of the RCU
lookup is that we don't hold any locks, I think it's almost
unavoidable.
> Sounds like you are deadset on the callback approach.
Hey, I happen to think that it's the easiest model, but you can easily
prove me wrong.
Almost all of my worries can probably be handled by just having a
separate "do this after the operation" helper function.
Also, maybe the "callback" model is the wrong one, and instead the
right one is to just say "the only thing that wants this is the 'stat'
family", and simply have a "path_statx()" function that is all inside
fs/namei.c, and that doesn't take any callbacks, but instead just
fills in the 'struct kstat'.
But if you have better ideas, go wild.
The only thing I really *don't* want is some kind of expansion on the
new "filename_lookup()" function that takes a new flag, and that then
random people can start using, and that is so fragile that everybody
will get it wrong.
Your 'vfs_rcu_magic_lookup()' looks like it would work, and we'd just
have to make the rule that it's never exported to modules or anything
like that, simply because I don't trust random drivers or vendor
modules to get it right.
> Can you pseudo-code how would the consumer look like in your case? Do
> you want the callback to execute for both slow and fastpath and switch
> on the flag? It is rather unclear what you are proposing here.
So what my half-aresed thinking was on how I'd do it is roughly:
- add a new "rcu_callback" function pointer to ' struct nameidata'
(or, if deciding to just fill in 'struct kstat', just the pointer to
that).
- in complete_walk(), where we currently do that
if (!try_to_unlazy(nd))
return -ECHILD;
instead do something like this:
if (nd->rcu_callback) {
int ret = nd->rcu_callback(nd);
if (ret) {
if (!validate_sequence_counts(nd))
ret = -ECHILD;
nd->path.mnt = NULL;
nd->path.dentry = NULL;
leave_rcu(nd);
return ret;
}
}
if (!try_to_unlazy(nd))
return -ECHILD;
- add a new version of filename_lookup() that takes that enw
callback, and just sets in in nameidata before it does that
retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
thing.
IOW, the point is that the callback can decide that it has gathered
enough information, and to basically "abort" the RCU path walk by
returning an error.
And if it does *not* return an error to abort (or if we never even get
there, because something else caused us to exit RCU mode), we act
exactly like we used to, and nothing has changed.
But if it *does* return an error, we stop walking, never take the refs
on the path, and we validate the sequence count.
Again, if the sequence count validation failed, we turn the error into
ECHILD, so that the path walking just gets re-done without RCU. So a
sequence point failure means that we did end up calling the callback
function, but we basically discard the result.
End result: the path walking will return that new error only if the
callback (a) existed, (b) returned that error and (c) the sequence
counts still validated properly afterwards.
And if for some reason it *cannot* fill in stat data (say, because the
filesystem is one that doesn't support "stat under RCU" or whatever),
the callback returns zero, and we do all the things we used to do.
And the *user* of this would basically just put in a "stat_rcu()"
callback that fills in the stat data and returns some error code that
path walking can't return, which then informs the caller that the stat
data has already been filled in and can be used as-is.
So now the caller can look something like this:
ret = filename_lookup_with_callback(... callback);
if (ret == -ECALLBACK) {
// this means that the stat info has already been
filled in, copy it to user mode
copy_stat_to_user(..);
return 0;
}
// "Real" error
if (ret)
return ret;
// Ok, we have a path, we do what we used to do before
error = vfs_getattr(&path, stat, request_mask, flags);
...
path_put(&path);
Anyway, that was my thinking. It *feels* fairly minimal. But no, I
never did the patch, and I might not have thought of some
complication, and it is all pretty messy, and no, I do not want to
force you into this kind of model if you hate it.
I don't think your vfs_rcu_magic_lookup() function is all that
fundamentally different, but the advantage of the callback model is
that I *think* the above would be a fairly surgical and minimal
addition to the path walk.
I _think_ your vfs_rcu_magic_lookup() model would require a fair
amount of moving code around to break into that
path_lookupat -> complete_walk -> try_to_unlazy
place and be able to restart etc. But maybe you have a clever plan.
Linus
Powered by blists - more mailing lists