linux-kernel - multi-fentry proposal. Was: [PATCH bpf-next v2 02/18] x86,bpf: add bpf_global

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQJ47PJXxjqES8BvtWkPq3fj9D0oTF6qqeNNpG66-_MGCg@mail.gmail.com>
Date: Wed, 16 Jul 2025 17:59:45 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Menglong Dong <menglong.dong@...ux.dev>, Jiri Olsa <jolsa@...nel.org>, 
	Andrii Nakryiko <andrii@...nel.org>
Cc: Menglong Dong <menglong8.dong@...il.com>, Steven Rostedt <rostedt@...dmis.org>, 
	bpf <bpf@...r.kernel.org>, Martin KaFai Lau <martin.lau@...ux.dev>, 
	Eduard Zingerman <eddyz87@...il.com>, Song Liu <song@...nel.org>, 
	Yonghong Song <yonghong.song@...ux.dev>, John Fastabend <john.fastabend@...il.com>, 
	KP Singh <kpsingh@...nel.org>, Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>, 
	LKML <linux-kernel@...r.kernel.org>, Network Development <netdev@...r.kernel.org>
Subject: multi-fentry proposal. Was: [PATCH bpf-next v2 02/18] x86,bpf: add
 bpf_global_caller for global trampoline

On Wed, Jul 16, 2025 at 6:06 AM Menglong Dong <menglong.dong@...ux.dev> wrote:
>
> On Wednesday, July 16, 2025 12:35 AM Alexei Starovoitov <alexei.starovoitov@...il.com> write:
> > On Tue, Jul 15, 2025 at 1:37 AM Menglong Dong <menglong.dong@...ux.dev> wrote:
> > >
> > >
> > > On 7/15/25 10:25, Alexei Starovoitov wrote:
> [......]
> > >
> > > According to my benchmark, it has ~5% overhead to save/restore
> > > *5* variants when compared with *0* variant. The save/restore of regs
> > > is fast, but it still need 12 insn, which can produce ~6% overhead.
> >
> > I think it's an ok trade off, because with one global trampoline
> > we do not need to call rhashtable lookup before entering bpf prog.
> > bpf prog will do it on demand if/when it needs to access arguments.
> > This will compensate for a bit of lost performance due to extra save/restore.
>
> I don't understand here :/
>
> The rhashtable lookup is done at the beginning of the global trampoline,
> which is called before we enter bpf prog. The bpf progs is stored in the
> kfunc_md, and we need get them from the hash table.

Ahh. Right.

Looking at the existing bpf trampoline... It has complicated logic
to handle livepatching and tailcalls. Your global trampoline
doesn't, and once that is added it's starting to feel that it will
look just as complex as the current one.
So I think we better repurpose what we have.
Maybe we can rewrite the existing one in C too.

How about the following approach.
I think we discussed something like this in the past
and Jiri tried to implement something like this.
Andrii reminded me recently about it.

Say, we need to attach prog A to 30k functions.
10k with 2 args, 10k with 3 args, and 10k with 7 args.
We can generate 3 _existing_ bpf trampolines for 2,3,7 args
with hard coded prog A in there (the cookies would need to be
fetched via binary search similar to kprobe-multi).
The arch_prepare_bpf_trampoline() supports BPF_TRAMP_F_ORIG_STACK.
So one 2-arg trampoline will work to invoke prog A in all 10k 2-arg functions.
We don't need to match types, but have to compare that btf_func_model-s
are the same.

Menglong, your global trampoline for 0,1,..6 args works only for x86,
because btf_func_model doesn't care about sizes of args,
but it's not the correct mental model to use.

The above "10k with 2 args" is a simplified example.
We will need an arch specific callback is_btf_func_model_equal()
that will compare func models in arch specific ways.
For x86-64 the number of args is all it needs.
For other archs it will compare sizes and flags too.
So 30k functions will be sorted into
10k with btf_func_model_1, 10k with btf_func_model_2 and so on.
And the corresponding number of equivalent trampolines will be generated.

Note there will be no actual BTF types. All args will be untyped and
untrusted unlike current fentry.
We can go further and sort 30k functions by comparing BTFs
instead of btf_func_model-s, but I suspect 30k funcs will be split
into several thousands of exact BTFs. At that point multi-fentry
benefits are diminishing and we might as well generate 30k unique
bpf trampolines for 30k functions and avoid all the complexity.
So I would sort by btf_func_model compared by arch specific comparator.

Now say prog B needs to be attached to another 30k functions.
If all 30k+30k functions are different then it's the same as
the previous step.
Say, prog A is attached to 10k funcs with btf_func_model_1.
If prog B wants to attach to the exact same func set then we
just regenerate bpf trampoline with hard coded progs A and B
and reattach.
If not then we need to split the set into up to 3 sets.
Say, prog B wants 5k funcs, but only 1k func are common:
(prog_A, 9k func with btf_func_model_1) -> bpf trampoline X
(prog_A, prog_B, 1k funcs with btf_func_model_1) -> bpf trampoline Y
(prog_B, 4k funcs with btf_func_model_1) -> bpf trampoline Z

And so on when prog C needs to be attached.
At detach time we can merge sets/trampolines,
but for now we can leave it all fragmented.
Unlike regular fentry progs the multi-fentry progs are not going to
be attached for long time. So we can reduce the detach complexity.

The nice part of the algorithm is that coexistence of fentry
and multi-fentry is easy.
If fentry is already attached to some function we just
attach multi-fentry prog to that bpf trampoline.
If multi-fentry was attached first and fentry needs to be attached,
we create a regular bpf trampoline and add both progs there.

The intersect and sorting by btf_func_model is not trivial,
but we can hold global trampoline_mutex, so no concerns of races.

Example:
bpf_link_A is a set of:
(prog_A, funcs X,Y with btf_func_model_1)
(prog_A, funcs N,M with btf_func_model_2)

To attach prog B via bpf_link_B that wants:
(prog_B, funcs Y,Z with btf_func_model_1)
(prog_B, funcs P,Q with btf_func_model_3)

walk all existing links, intersect and split, and update the links.
At the end:

bpf_link_A:
(prog_A, funcs X with btf_func_model_1)
(prog_A, prog_B funcs Y with btf_func_model_1)
(prog_A, funcs N,M with btf_func_model_2)

bpf_link_B:
(prog_A, prog_B funcs Y with btf_func_model_1)
(prog_B, funcs Z with btf_func_model_1)
(prog_B, funcs P,Q with btf_func_model_3)

When link is detached: walk its own tuples, remove the prog,
if nr_progs == 0 -> detach corresponding trampoline,
if nr_progs > 0 -> remove prog and regenerate trampoline.

If fentry prog C needs to be attached to N it might split bpf_link_A:
(prog_A, funcs X with btf_func_model_1)
(prog_A, prog_B funcs Y with btf_func_model_1)
(prog_A, funcs M with btf_func_model_2)
(prog_A, prog_C funcs N with _fentry_)

Last time we gave up on it because we discovered that
overlap support was too complicated, but I cannot recall now
what it was :)
Maybe all of the above repeating some old mistakes.

Jiri,
How does the above proposal look to you?