linux-kernel - Re: multi-fentry proposal. Was: [PATCH bpf-next v2 02/18] x86,bpf: add bpf_global

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3643244.iIbC2pHGDl@7940hx>
Date: Thu, 17 Jul 2025 10:37:41 +0800
From: Menglong Dong <menglong.dong@...ux.dev>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Jiri Olsa <jolsa@...nel.org>, Andrii Nakryiko <andrii@...nel.org>,
 Menglong Dong <menglong8.dong@...il.com>,
 Steven Rostedt <rostedt@...dmis.org>, bpf <bpf@...r.kernel.org>,
 Martin KaFai Lau <martin.lau@...ux.dev>,
 Eduard Zingerman <eddyz87@...il.com>, Song Liu <song@...nel.org>,
 Yonghong Song <yonghong.song@...ux.dev>,
 John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>,
 Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>,
 LKML <linux-kernel@...r.kernel.org>,
 Network Development <netdev@...r.kernel.org>
Subject:
 Re: multi-fentry proposal. Was: [PATCH bpf-next v2 02/18] x86,bpf: add
 bpf_global_caller for global trampoline

On Thursday, July 17, 2025 10:13 AM Alexei Starovoitov <alexei.starovoitov@...il.com> write:
> On Wed, Jul 16, 2025 at 6:51 PM Menglong Dong <menglong.dong@...ux.dev> wrote:
> >
> > On Thursday, July 17, 2025 8:59 AM Alexei Starovoitov <alexei.starovoitov@...il.com> write:
> > > On Wed, Jul 16, 2025 at 6:06 AM Menglong Dong <menglong.dong@...ux.dev> wrote:
> > > >
> > > > On Wednesday, July 16, 2025 12:35 AM Alexei Starovoitov <alexei.starovoitov@...il.com> write:
> > > > > On Tue, Jul 15, 2025 at 1:37 AM Menglong Dong <menglong.dong@...ux.dev> wrote:
> > > > > >
> > > > > >
> > > > > > On 7/15/25 10:25, Alexei Starovoitov wrote:
> > > > [......]
> > > > > >
> > > > > > According to my benchmark, it has ~5% overhead to save/restore
> > > > > > *5* variants when compared with *0* variant. The save/restore of regs
> > > > > > is fast, but it still need 12 insn, which can produce ~6% overhead.
> > > > >
> > > > > I think it's an ok trade off, because with one global trampoline
> > > > > we do not need to call rhashtable lookup before entering bpf prog.
> > > > > bpf prog will do it on demand if/when it needs to access arguments.
> > > > > This will compensate for a bit of lost performance due to extra save/restore.
> > > >
> > > > I don't understand here :/
> > > >
> > > > The rhashtable lookup is done at the beginning of the global trampoline,
> > > > which is called before we enter bpf prog. The bpf progs is stored in the
> > > > kfunc_md, and we need get them from the hash table.
> > >
> > > Ahh. Right.
> > >
> > > Looking at the existing bpf trampoline... It has complicated logic
> > > to handle livepatching and tailcalls. Your global trampoline
> > > doesn't, and once that is added it's starting to feel that it will
> > > look just as complex as the current one.
> > > So I think we better repurpose what we have.
> > > Maybe we can rewrite the existing one in C too.
> >
> > You are right, the tailcalls is not handled yet. But for the livepatching,
> > it is already handled, as we always get the origin ip from the stack
> > and call it, just like how the bpf trampoline handle the livepatching.
> > So no addition handling is needed here.
> >
> > >
> > > How about the following approach.
> > > I think we discussed something like this in the past
> > > and Jiri tried to implement something like this.
> > > Andrii reminded me recently about it.
> > >
> > > Say, we need to attach prog A to 30k functions.
> > > 10k with 2 args, 10k with 3 args, and 10k with 7 args.
> > > We can generate 3 _existing_ bpf trampolines for 2,3,7 args
> > > with hard coded prog A in there (the cookies would need to be
> > > fetched via binary search similar to kprobe-multi).
> > > The arch_prepare_bpf_trampoline() supports BPF_TRAMP_F_ORIG_STACK.
> > > So one 2-arg trampoline will work to invoke prog A in all 10k 2-arg functions.
> > > We don't need to match types, but have to compare that btf_func_model-s
> > > are the same.
> > >
> > > Menglong, your global trampoline for 0,1,..6 args works only for x86,
> > > because btf_func_model doesn't care about sizes of args,
> > > but it's not the correct mental model to use.
> > >
> > > The above "10k with 2 args" is a simplified example.
> > > We will need an arch specific callback is_btf_func_model_equal()
> > > that will compare func models in arch specific ways.
> > > For x86-64 the number of args is all it needs.
> > > For other archs it will compare sizes and flags too.
> > > So 30k functions will be sorted into
> > > 10k with btf_func_model_1, 10k with btf_func_model_2 and so on.
> > > And the corresponding number of equivalent trampolines will be generated.
> > >
> > > Note there will be no actual BTF types. All args will be untyped and
> > > untrusted unlike current fentry.
> > > We can go further and sort 30k functions by comparing BTFs
> > > instead of btf_func_model-s, but I suspect 30k funcs will be split
> > > into several thousands of exact BTFs. At that point multi-fentry
> > > benefits are diminishing and we might as well generate 30k unique
> > > bpf trampolines for 30k functions and avoid all the complexity.
> > > So I would sort by btf_func_model compared by arch specific comparator.
> > >
> > > Now say prog B needs to be attached to another 30k functions.
> > > If all 30k+30k functions are different then it's the same as
> > > the previous step.
> > > Say, prog A is attached to 10k funcs with btf_func_model_1.
> > > If prog B wants to attach to the exact same func set then we
> > > just regenerate bpf trampoline with hard coded progs A and B
> > > and reattach.
> > > If not then we need to split the set into up to 3 sets.
> > > Say, prog B wants 5k funcs, but only 1k func are common:
> > > (prog_A, 9k func with btf_func_model_1) -> bpf trampoline X
> > > (prog_A, prog_B, 1k funcs with btf_func_model_1) -> bpf trampoline Y
> > > (prog_B, 4k funcs with btf_func_model_1) -> bpf trampoline Z
> > >
> > > And so on when prog C needs to be attached.
> > > At detach time we can merge sets/trampolines,
> > > but for now we can leave it all fragmented.
> > > Unlike regular fentry progs the multi-fentry progs are not going to
> > > be attached for long time. So we can reduce the detach complexity.
> > >
> > > The nice part of the algorithm is that coexistence of fentry
> > > and multi-fentry is easy.
> > > If fentry is already attached to some function we just
> > > attach multi-fentry prog to that bpf trampoline.
> > > If multi-fentry was attached first and fentry needs to be attached,
> > > we create a regular bpf trampoline and add both progs there.
> >
> > This seems not easy, and it is exactly how I handle the
> > coexistence now:
> >
> >   https://lore.kernel.org/bpf/20250528034712.138701-16-dongml2@chinatelecom.cn/
> >   https://lore.kernel.org/bpf/20250528034712.138701-17-dongml2@chinatelecom.cn/
> >   https://lore.kernel.org/bpf/20250528034712.138701-18-dongml2@chinatelecom.cn/
> 
> hmm. exactly? That's very different.
> You're relying on kfunc_md for prog list.
> The above proposal doesn't need kfunc_md in the critical path.
> All progs are built into the trampolines.
> 
> > The most difficult part is that we need a way to replace the the
> > multi-fentry with fentry for the function in the ftrace atomically. Of
> > course, we can remove the global trampoline first, and then attach
> > the bpf trampoline, which will make things much easier. But a
> > short suspend will happen for the progs in fentry-multi.
> 
> I don't follow.
> In the above proposal fentry attach/detach is atomic.
> Prepare a new trampoline, single call to ftrace to modify_fentry().

modify_fentry() is used to operate on the same ftrace_ops. For
example, we have the bpf trampoline A, and its corresponding
ftrace_ops is opsA. Now, the image of the trampolineA is updated,
we call modify_fentry() for opsA to update the direct call of it.

When we talk about the coexistence, it means the functionA is
attached with the global trampoline B, whose ftrace_ops is
opsB. We can't call modify_fentry(trampolineA, new_addr) here,
as the opsA is not register yet. And we can't call register_fentry
too, as the functionA is already in the direct_functions when we
register opsB.

So we need a way to do such transition.

> 
> > >
> > > The intersect and sorting by btf_func_model is not trivial,
> > > but we can hold global trampoline_mutex, so no concerns of races.
> > >
> > > Example:
> > > bpf_link_A is a set of:
> > > (prog_A, funcs X,Y with btf_func_model_1)
> > > (prog_A, funcs N,M with btf_func_model_2)
> > >
> > > To attach prog B via bpf_link_B that wants:
> > > (prog_B, funcs Y,Z with btf_func_model_1)
> > > (prog_B, funcs P,Q with btf_func_model_3)
> > >
> > > walk all existing links, intersect and split, and update the links.
> > > At the end:
> > >
> > > bpf_link_A:
> > > (prog_A, funcs X with btf_func_model_1)
> > > (prog_A, prog_B funcs Y with btf_func_model_1)
> > > (prog_A, funcs N,M with btf_func_model_2)
> > >
> > > bpf_link_B:
> > > (prog_A, prog_B funcs Y with btf_func_model_1)
> > > (prog_B, funcs Z with btf_func_model_1)
> > > (prog_B, funcs P,Q with btf_func_model_3)
> > >
> > > When link is detached: walk its own tuples, remove the prog,
> > > if nr_progs == 0 -> detach corresponding trampoline,
> > > if nr_progs > 0 -> remove prog and regenerate trampoline.
> > >
> > > If fentry prog C needs to be attached to N it might split bpf_link_A:
> > > (prog_A, funcs X with btf_func_model_1)
> > > (prog_A, prog_B funcs Y with btf_func_model_1)
> > > (prog_A, funcs M with btf_func_model_2)
> > > (prog_A, prog_C funcs N with _fentry_)
> > >
> > > Last time we gave up on it because we discovered that
> > > overlap support was too complicated, but I cannot recall now
> > > what it was :)
> > > Maybe all of the above repeating some old mistakes.
> >
> > In my impression, this is exactly the solution of Jiri's, and this is
> > part of the discussion that I know:
> >
> >   https://lore.kernel.org/bpf/ZfKY6E8xhSgzYL1I@krava/
> 
> Yes. It's similar, but somehow it feels simple enough now.
> The algorithms for both detach and attach fit on one page,
> and everything is uniform. There are no spaghetty of corner cases.
>