[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <175097828167.2280845.5635569182786599451@noble.neil.brown.name>
Date: Fri, 27 Jun 2025 08:51:21 +1000
From: "NeilBrown" <neil@...wn.name>
To: "Song Liu" <songliubraving@...a.com>
Cc: "Tingmao Wang" <m@...wtm.org>,
Mickaël Salaün <mic@...ikod.net>,
"Song Liu" <song@...nel.org>, "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-security-module@...r.kernel.org"
<linux-security-module@...r.kernel.org>,
"brauner@...nel.org" <brauner@...nel.org>,
"Kernel Team" <kernel-team@...a.com>, "andrii@...nel.org" <andrii@...nel.org>,
"eddyz87@...il.com" <eddyz87@...il.com>, "ast@...nel.org" <ast@...nel.org>,
"daniel@...earbox.net" <daniel@...earbox.net>,
"martin.lau@...ux.dev" <martin.lau@...ux.dev>,
"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
"jack@...e.cz" <jack@...e.cz>, "kpsingh@...nel.org" <kpsingh@...nel.org>,
"mattbobrowski@...gle.com" <mattbobrowski@...gle.com>,
Günther Noack <gnoack@...gle.com>
Subject: Re: [PATCH v5 bpf-next 0/5] bpf path iterator
On Fri, 27 Jun 2025, Song Liu wrote:
>
>
> > On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@...wn.name> wrote:
>
> [...]
>
> >> I guess I misunderstood the proposal of vfs_walk_ancestors()
> >> initially, so some clarification:
> >>
> >> I think vfs_walk_ancestors() is good for the rcu-walk, and some
> >> rcu-then-ref-walk. However, I don’t think it fits all use cases.
> >> A reliable step-by-step ref-walk, like this set, works well with
> >> BPF, and we want to keep it.
> >
> > The distinction between rcu-walk and ref-walk is an internal
> > implementation detail. You as a caller shouldn't need to think about
> > the difference. You just want to walk. Note that LOOKUP_RCU is
> > documented in namei.h as "semi-internal". The only uses outside of
> > core-VFS code is in individual filesystem's d_revalidate handler - they
> > are checking if they are allowed to sleep or not. You should never
> > expect to pass LOOKUP_RCU to an VFS API - no other code does.
> >
> > It might be reasonable for you as a caller to have some control over
> > whether the call can sleep or not. LOOKUP_CACHED is a bit like that.
> > But for dotdot lookup the code will never sleep - so that is not
> > relevant.
>
> Unfortunately, the BPF use case is more complicated. In some cases,
> the callback function cannot be call in rcu critical sections. For
> example, the callback may need to read xatter. For these cases, we
> we cannot use RCU walk at all.
I really think you should stop using the terms RCU walk and ref-walk. I
think they might be focusing your thinking in an unhelpful direction.
The key issue about reading xattrs is that it might need to sleep.
Focusing on what might need to sleep and what will never need to sleep
is a useful approach - the distinction is wide spread in the kernel and
several function take a flag indicating if they are permitted to sleep,
or if failure when sleeping would be required.
So your above observation is better described as
The vfs_walk_ancestors() API has an (implicit) requirement that the
callback mustn't sleep. This is a problem for some use-cases
where the call back might need to sleep - e.g. for accessing xattrs.
That is a good and useful observation. I can see three possibly
responses:
1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is
always allowed to sleep. I don't particularly like this approach.
2/ Use repeated calls to vfs_walk_parent() when the handling of each
ancestor might need to sleep. I see no problem with supporting both
vfs_walk_ancestors() and vfs_walk_parent(). There is plenty of
precedent for having different interfaces for different use cases.
3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
If the callback finds that it needs to sleep but that "may sleep"
isn't set, it returns some well known status, like -EWOULDBLOCK (or
-ECHILD). It can expect to be called again but with "may sleep" set.
This is my preferred approach. There is precedent with the
d_revalidate callbacks which works like this.
I suspect that accessing xattrs might often be possible without
sleeping. It is conceivable that we could add a "may sleep" argument
to vfs_getxattr() so that it could still often be used without
requiring vfs_walk_ancestors() to permit sleeping.
This would almost certainly require a clear demonstration that
there was a performance cost in not having the option of non-sleeping
vfs_getxattr().
>
> > I strongly suggest you stop thinking about rcu-walk vs ref-walk. Think
> > about the needs of your code. If you need a high-performance API, then
> > ask for a high-performance API, don't assume what form it will take or
> > what the internal implementation details will be.
>
> At the moment, we need a ref-walk API on the BPF side. The RCU walk
> is a totally separate topic.
Do you mean "we need step-by-step walking" or do you mean "we need to
potentially sleep for each ancestor"? These are conceptually different
requirements, but I cannot tell which you mean when you talk about "RCU
walk".
Thanks,
NeilBrown
>
> > I think you already have a clear answer that a step-by-step API will not
> > be read-only on the dcache (i.e. it will adjust refcounts) and so will
> > not be high performance. If you want high performance, you need to
> > accept a different style of API.
>
> Agreed.
>
> Thanks,
> Song
>
>
Powered by blists - more mailing lists