[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250611.faich0Chohg3@digikod.net>
Date: Wed, 11 Jun 2025 17:39:00 +0200
From: Mickaël Salaün <mic@...ikod.net>
To: Christian Brauner <brauner@...nel.org>
Cc: Tingmao Wang <m@...wtm.org>, Song Liu <song@...nel.org>,
Al Viro <viro@...iv.linux.org.uk>, amir73il@...il.com, andrii@...nel.org, ast@...nel.org,
bpf@...r.kernel.org, daniel@...earbox.net, eddyz87@...il.com, gnoack@...gle.com,
jack@...e.cz, jlayton@...nel.org, josef@...icpanda.com, kernel-team@...a.com,
kpsingh@...nel.org, linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-security-module@...r.kernel.org, martin.lau@...ux.dev, mattbobrowski@...gle.com,
repnop@...gle.com
Subject: Re: [PATCH v3 bpf-next 0/5] bpf path iterator
On Wed, Jun 11, 2025 at 01:36:46PM +0200, Christian Brauner wrote:
> On Mon, Jun 09, 2025 at 09:08:34AM +0100, Tingmao Wang wrote:
> > On 6/9/25 07:23, Song Liu wrote:
> > > On Sun, Jun 8, 2025 at 10:34 AM Tingmao Wang <m@...wtm.org> wrote:
> > > [...]
> > >> Hi Song, Christian, Al and others,
> > >>
> > >> Previously I proposed in [1] to add ability to do a reference-less parent
> > >> walk for Landlock. However, as Christian pointed out and I do agree in
> > >> hindsight, it is not a good idea to do things like this in non-VFS code.
> > >>
> > >> However, I still think this is valuable to consider given the performance
> > >> improvement, and after some discussion with Mickaël, I would like to
> > >> propose extending Song's helper to support such usage. While I recognize
> > >> that this patch series is already in its v3, and I do not want to delay it
> > >> by too much, putting this proposal out now is still better than after this
> > >> has merged, so that we may consider signature changes.
> > >>
> > >> I've created a proof-of-concept and did some brief testing. The
> > >> performance improvement attained here is the same as in [1] (with a "git
> > >> status" workload, median landlock overhead 35% -> 28%, median time in
> > >> landlock decreases by 26.6%).
> > >>
> > >> If this idea is accepted, I'm happy to work on it further, split out this
> > >> patch, update the comments and do more testing etc, potentially in
> > >> collaboration with Song.
> > >>
> > >> An alternative to this is perhaps to add a new helper
> > >> path_walk_parent_rcu, also living in namei.c, that will be used directly
> > >> by Landlock. I'm happy to do it either way, but with some experimentation
> > >> I personally think that the code in this patch is still clean enough, and
> > >> can avoid some duplication.
> > >>
> > >> Patch title: path_walk_parent: support reference-less walk
> > >>
> > >> A later commit will update the BPF path iterator to use this.
> > >>
> > >> Signed-off-by: Tingmao Wang <m@...wtm.org>
> > > [...]
> > >>
> > >> -bool path_walk_parent(struct path *path, const struct path *root);
> > >> +struct parent_iterator {
> > >> + struct path path;
> > >> + struct path root;
> > >> + bool rcu;
> > >> + /* expected seq of path->dentry */
> > >> + unsigned next_seq;
> > >> + unsigned m_seq, r_seq;
> > >
> > > Most of parent_iterator is not really used by reference walk.
> > > So it is probably just separate the two APIs?
> >
> > I don't mind either way, but I feel like it might be nice to just have one
> > style of APIs (i.e. an iterator with start / end / next vs just one
> > function), even though this is not totally necessary for the ref-taking
> > walk. After all, the BPF use case is iterator-based. This also means
> > that the code at the user's side (mostly thinking of Landlock here) is
> > slightly simpler.
> >
> > But I've not experimented with the other way. I'm open to both, and I'm
> > happy to send a patch later for a separate API (in that case that would
> > not depend on this and I might just start a new series).
> >
> > Would like to hear what VFS folks thinks of this first tho, and whether
> > there's any preference in one or two APIs.
>
> I really dislike exposing the sequence number for mounts an for
> dentries. That's just nonsense and a non-VFS low-level consumer of this
> API has zero business caring about any of that. It's easy to
> misunderstand, it's easy to abuse so that's not a good way of doing
> this. It's the wrong API.
>
> >
> > >
> > > Also, is it ok to make m_seq and r_seq available out of fs/?
>
> No, it's not.
>
> >
> > The struct is not intended to be used directly by code outside. Not sure
>
> That doesn't mean anything. It's simply the wrong API if it has to spill
> so much of its bowels.
>
> > what is the standard way to do this but we can make it private by e.g.
> > putting the seq values in another struct, if needed. Alternatively I
> > think we can hide the entire struct behind an opaque pointer by doing the
> > allocation ourselves.
> >
> > >
> > >> +};
> > >> +
> > >> +#define PATH_WALK_PARENT_UPDATED 0
> > >> +#define PATH_WALK_PARENT_ALREADY_ROOT -1
> > >> +#define PATH_WALK_PARENT_RETRY -2
> > >> +
> > >> +void path_walk_parent_start(struct parent_iterator *pit,
> > >> + const struct path *path, const struct path *root,
> > >> + bool ref_less);
> > >> +int path_walk_parent(struct parent_iterator *pit, struct path *next_parent);
> > >> +int path_walk_parent_end(struct parent_iterator *pit);
> > >
> > > I think it is better to make this rcu walk a separate set of APIs.
> > > IOW, we will have:
> > >
> > > int path_walk_parent(struct path *path, struct path *root);
> > >
> > > and
> > >
> > > void path_walk_parent_rcu_start(struct parent_iterator *pit,
> > > const struct path *path, const struct path *root);
> > > int path_walk_parent_rcu_next(struct parent_iterator *pit, struct path
> > > *next_parent);
> > > int path_walk_parent_rcu_end(struct parent_iterator *pit);
> >
> > (replied above)
>
> Exposing two sets of different APIs for essentially the same things is
> not going to happen.
>
> The VFS doesn't expose a rcu variant and a non-rcu variant for itself so
> we are absolutely not going to do that for outside stuff.
>
> It always does the try RCU first, then try to continue the walk by
> falling back to REF walk (e.g., via try_to_unlazy()). If that doesn't
> work then let the caller know and require them to decide whether to
> abort or redo everything in ref-walk.
That's indeed what is done by choose_mountpoint() (relying on
choose_mountpoint_rcu() when possible), but this proposal is about doing
a full path walk (i.e. multiple calls to path_walk_parent) within RCU.
>
> There's zero need in that scheme for the caller to see any of the
> internals of the VFS and that's what you should aim for.
Yes, but how could we detect if a full path walk is invalid (because of
a rename or mount change)?
Powered by blists - more mailing lists