[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3g3arsrwnyvv562v2rsfv2ms4ht4mk45vwdkvssxkrjhfjtpdz@umyx5tl2du7o>
Date: Tue, 2 Jul 2024 22:33:45 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Christian Brauner <brauner@...nel.org>,
kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev, lkp@...el.com,
Linux Memory Management List <linux-mm@...ck.org>, linux-kernel@...r.kernel.org, ying.huang@...el.com,
feng.tang@...el.com, fengwei.yin@...el.com
Subject: Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput
-33.7% regression
On Tue, Jul 02, 2024 at 11:41:44AM -0700, Linus Torvalds wrote:
> On Tue, 2 Jul 2024 at 10:58, Mateusz Guzik <mjguzik@...il.com> wrote:
> >
> > Suppose the rcu fast path lookup reads the dentry seqc, then does all
> > the legitimize_mnt and other work. Everything, except modifying the
> > lockref. The caller is given a mnt to put (per-cpu scalable), dentry
> > seqc read before any of the path validation and an indication this is
> > rcu.
>
> Yes.
>
> > Then after whatever is done if the seqc still matches this is the same
> > as if there was lockref get/put around it.
>
> So this is partly why I was thinking of a callback. That "check
> sequence number afterwards" is still important. And if it's a
> callback, it can be done in the path walking code, and it can go on
> and say "oh, I'll need to redo this without RCU".
>
> If it's a "we returned a dentry under RCU", suddenly the caller has to
> know this about the name lookup and do the repeating by hand.
>
> And as long as we don't expose it to modules and only use it for
> "stat()" and friends, I'm ok with it, but I'm just saying that it's
> all a bit scary.
>
> > The only worry is pointers suddenly going NULL or similar as
> > dentry/inode is looked at. To be worked out on per-syscall basis.
>
> We have subtle rules wrt dentry->d_inode. It can indeed become NULL at
> any time during the RCU walk, since what protects it is the d_lock and
> the dentry count.
>
> The inode itself is then RCU-free'd, so it will *exist*, but you can't
> just blindly use dentry->d_inode itself while under RCU.
>
Not only that, I think dentry can transition back to positive with
another inode.
> Which is why it's cached in 'struct nameidata', and we validate it
> with nd->seq when it's loaded. And why things like may_lookup() use
> nd->inode, not the dentry.
>
> And then a sequence point failure will return -ECHILD and do the walk
> again, while a callback success with all the sequence numbers matching
> would return -ECALLBACK or whatever, so that the caller would know
> "the stat information was already successfully completed by the
> callback".
>
> Anyway, that was my handwavy "this is why I was thinking of a
> callback" thing. But it's also an example of just how nasty and subtle
> this all is.
>
> But I'm convinced this is all eminently *solvable*. There's nothing
> fundamental here. Just a lot of small nasty details.
>
If you are politely by lkml standards suggesting I should probably drop
the idea due to unforseen complexities, I'll note I intend to write this
for kicks as time permits. If it turns out to be buggy af or there will
be nobody with time to review it, it simply wont go in and that will be
that.
I do think I have an ok understanding what's up, for example did you
know xfs does not honor rcu grace periods when recycling inodes?
https://lore.kernel.org/all/20231205113833.1187297-1-alexjlzheng@tencent.com/
So this would have to be opt-in per filesystem, probably stuffed
somewhere within the inode or dentry. I am definitely not reviewing all
the other filesystems for sanity on this front.
Rather, one could look over tmpfs, ext4, btrfs and maybe ask Kent to
sort out bcachefs (if necessary) and call it a day.
Sounds like you are deadset on the callback approach. I'm not going to
die on the inline hill, but I will spell it out so that we are on the
same page (and I have a question too).
In pseudo-code my stuff would like this (names are for ilustrative
purposes):
struct rcunameidata {
....
bool in_rcu;
};
...
struct rcunameidata *rnd;
error = vfs_rcu_magic_lookup(&rnd, ....);
if (error)
return error;
if (rnd->in_rcu) {
/*
* fast path goes here, callback code would be identical up to
* the point below
*/
/*
* Now validate
*/
error = vfs_rcu_magic_lookup_validate_or_drop(&rnd, ....))
if (error == 0) /* things worked out */
return export_stuff_to_the_user(....);
if (error < 0) /* fail */
return error;
}
/*
* slowpath goes here
*/
/*
* all done, now whack the lookup state. the routine returns void
*/
vfs_rcu_magic_lookup_finish(&rnd, ....);
if (!error)
error = export_stuff_to_the_user(....);
....
Can you pseudo-code how would the consumer look like in your case? Do
you want the callback to execute for both slow and fastpath and switch
on the flag? It is rather unclear what you are proposing here.
fwiw I think the above would serve as an easy to copy-paste idiom for
the few consumers which want it. All the complexity in their case is the
in_rcu block which wont go away with a callback. If you still want the
callback, callback it is.
Powered by blists - more mailing lists