[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <afe77383-fe56-4029-848e-1401e3297139@maowtm.org>
Date: Mon, 16 Jun 2025 01:24:09 +0100
From: Tingmao Wang <m@...wtm.org>
To: Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
Song Liu <song@...nel.org>
Cc: Mickaël Salaün <mic@...ikod.net>,
NeilBrown <neil@...wn.name>, bpf@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-security-module@...r.kernel.org, kernel-team@...a.com,
andrii@...nel.org, eddyz87@...il.com, ast@...nel.org, daniel@...earbox.net,
martin.lau@...ux.dev, viro@...iv.linux.org.uk, kpsingh@...nel.org,
mattbobrowski@...gle.com, amir73il@...il.com, repnop@...gle.com,
jlayton@...nel.org, josef@...icpanda.com, gnoack@...gle.com
Subject: Ref-less parent walk from Landlock (was: Re: [PATCH v3 bpf-next 1/5]
namei: Introduce new helper function path_walk_parent())
On 6/12/25 13:31, Christian Brauner wrote:
> On Thu, Jun 12, 2025 at 11:49:08AM +0200, Jan Kara wrote:
>> On Thu 12-06-25 11:01:16, Jan Kara wrote:
>>> On Wed 11-06-25 11:08:30, Song Liu wrote:
>>>> On Wed, Jun 11, 2025 at 10:50 AM Tingmao Wang <m@...wtm.org> wrote:
>>>>> [...]
>>>>> Aside from the "exposing mount seqcounts" problem, what do you think about
>>>>> the parent_iterator approach I suggested earlier? I feel that it is
>>>>> better than such a callback - more flexible, and also fits in right with
>>>>> the BPF API you already designed (i.e. with a callback you might then have
>>>>> to allow BPF to pass a callback?). There are some specifics that I can
>>>>> improve - Mickaël suggested some in our discussion:
>>>>>
>>>>> - Letting the caller take rcu_read_lock outside rather than doing it in
>>>>> path_walk_parent_start
>>>>>
>>>>> - Instead of always requiring a struct parent_iterator, allow passing in
>>>>> NULL for the iterator to path_walk_parent to do a reference walk without
>>>>> needing to call path_walk_parent_start - this way might be simpler and
>>>>> path_walk_parent_start/end can just be for rcu case.
>>>>>
>>>>> but what do you think about the overall shape of it?
>>>>
>>>> Personally, I don't have strong objections to this design. But VFS
>>>> folks may have other concerns with it.
>>>
>>> From what I've read above I'm not sure about details of the proposal but I
>>> don't think mixing of RCU & non-RCU walk in a single function / iterator is
>>> a good idea. IMHO the code would be quite messy. After all we have
>>> follow_dotdot_rcu() and follow_dotdot() as separate functions for a reason.
>>> Also given this series went through several iterations and we don't yet
>>> have an acceptable / correct solution suggests getting even the standard
>>> walk correct is hard enough. RCU walk is going to be only worse. So I'd
>>> suggest to get the standard walk finished and agreed on first and
>>> investigate feasibility of RCU variant later.
>>
>> OK, I've now read some of Tingmaon's and Christian's replies which I've
>> missed previously so I guess I now better understand why you complicate
>> things with RCU walking but still I'm of the opinion that we should start
>> with getting the standard walk working. IMHO pulling in RCU walk into the
>> iterator will bring it to a completely new complexity level...
>
> I would not want it in the first place. But I have a deep seated
> aversion to exposing two different variants.
Hi Christian, Jan, Song,
I do appreciate your thoughts here and thanks for taking the time to
explain. I just have some specific points which I would like you to
consider:
Taking a step back, maybe the specific designs need a bit more thought,
but are you at all open to the idea of letting other subsystems take
advantage of a rcu-based parent walk? Testing shows that for specific
cases of a deep directory hierarchy the speedup (for time in Landlock) can
be almost 60%, and still very significant for the average case. [1]
I think what I'm proposing here is basically what follow_dotdot_rcu
already does (aside from checking rename_seq, but actually that was mostly
a conservative check. I think we're good even if we just check dentry seq
across the path walk calls), and in fact given the latest suggestion to
base the path walk helper on a modified version of follow_dotdot (that
takes path argument instead of using nameidata), I can see one approach
here being to do the same for follow_dotdot_rcu (i.e. extracting the logic
from start to before "nd->next_seq = read_seqcount_begin(&parent->d_seq);").
That way, we will not be "inventing" any new code that messes with VFS
internals.
In respect to the comment from Jan, I'm putting the suggestion out right
now to avoid only surfacing this ask after Song's path iterator API has
just been merged. I'm not saying we have to do it here and now, but if
there is at all a possibility of incorporating rcu-based walk in this
helper (or a separate helper - I personally don't mind either way), I
would like to make sure that possibility stays open.
I'm happy to wait till Song's current patch is finished before continuing
this, but if there is strong objection to two separate APIs, I would
really appreciate if we can end up in a state where further change to
implement this is possible.
> Especially if the second variant wants or needs access to internal details
> such as mount or dentry sequence counts. I'm not at all in favor of that.
I don't want to expose VFS internals, but are you worried about even
making use of them? (well, rename_lock and d_seq is already accessible
from outside since they are defined in include/linux/dcache.h, and so it's
just the (readout of the) mount seqcount here that will gain additional
exposure, but maybe we can mark it with __private.)
Would it be less worrying if any checks against those seqcount values are
kept within follow_dotdot_rcu, but just that it is stored in an iterator
that outside code are supposed to treat as opaque? (We can maybe define
the semantic here as basically "this iterator makes sure your rcu walk
can't result in states where a reference-taking walk won't get to, as long
as you retry when the function returns -EAGAIN (or maybe -ECHILD)").
[1]: https://lore.kernel.org/all/cover.1748997840.git.m@maowtm.org/#t
Thanks a lot,
Tingmao
Powered by blists - more mailing lists