[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQ+G+mQPJ+O1Oc9+UW=J17CGNC5B=usCmUDxBA-ze+gZGw@mail.gmail.com>
Date: Tue, 10 Jun 2025 20:32:20 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Menglong Dong <menglong8.dong@...il.com>
Cc: Steven Rostedt <rostedt@...dmis.org>, Jiri Olsa <jolsa@...nel.org>, bpf <bpf@...r.kernel.org>,
Menglong Dong <dongml2@...natelecom.cn>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH bpf-next 00/25] bpf: tracing multi-link support
On Tue, May 27, 2025 at 8:49 PM Menglong Dong <menglong8.dong@...il.com> wrote:
>
> 1. Add per-function metadata storage support.
> 2. Add bpf global trampoline support for x86_64.
> 3. Add bpf global trampoline link support.
> 4. Add tracing multi-link support.
> 5. Compatibility between tracing and tracing_multi.
...
> ... and I think it will be a
> liberation to split it out to another series :/
There are lots of interesting ideas here and you know
already what the next step should be...
Split it into small chunks.
As presented it's hard to review and even if maintainers take on
that challenge the set is unlandable, since it spans various
subsystems.
In a small reviewable patch set we can argue about
approach A vs B while the current set has too many angles
to argue about.
Like the new concept of global trampoline.
It's nice to write bpf_global_caller() in asm
compared to arch_prepare_bpf_trampoline() that emits asm
on the fly, but it seems the only thing where it truly
needs asm is register save/restore. The rest can be done in C.
I suspect the whole gtramp can be written in C.
There is an attribute(interrupt) that all compilers support...
or use no attributes and inline asm for regs save/restore ?
or attribute(naked) and more inline asm ?
> no-mitigate + hash table mode
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> nop | fentry | fm_single | fm_all | km_single | km_all
> 9.014ms | 162.378ms | 180.511ms | 446.286ms | 220.634ms | 1465.133ms
> 9.038ms | 161.600ms | 178.757ms | 445.807ms | 220.656ms | 1463.714ms
> 9.048ms | 161.435ms | 180.510ms | 452.530ms | 220.943ms | 1487.494ms
> 9.030ms | 161.585ms | 178.699ms | 448.167ms | 220.107ms | 1463.785ms
> 9.056ms | 161.530ms | 178.947ms | 445.609ms | 221.026ms | 1560.584ms
...
> no-mitigate + function padding mode
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> nop | fentry | fm_single | fm_all | km_single | km_all
> 9.320ms | 166.454ms | 184.094ms | 193.884ms | 227.320ms | 1441.462ms
> 9.326ms | 166.651ms | 183.954ms | 193.912ms | 227.503ms | 1544.634ms
> 9.313ms | 170.501ms | 183.985ms | 191.738ms | 227.801ms | 1441.284ms
> 9.311ms | 166.957ms | 182.086ms | 192.063ms | 410.411ms | 1489.665ms
> 9.329ms | 166.332ms | 182.196ms | 194.154ms | 227.443ms | 1511.272ms
>
> The overhead of fentry_multi_all is a little higher than the
> fentry_multi_single. Maybe it is because the function
> ktime_get_boottime_ns(), which is used in bpf_testmod_bench_run(), is also
> traced? I haven't figured it out yet, but it doesn't matter :/
I think it matters a lot.
Looking at patch 25 the fm_all (in addition to fm_single) only
suppose to trigger from ktime_get_boottime,
but for hash table mode the difference is huge.
10M bpf_fentry_test1() calls are supposed to dominate 2 calls
to ktime_get and whatever else is called there,
but this is not what numbers tell.
Same discrepancy with kprobe_multi. 7x difference has to be understood,
since it's a sign that the benchmark is not really measuring
what it is supposed to measure. Which casts doubts on all numbers.
Another part is how come fentry is 20x slower than nop.
We don't see it in the existing bench-es. That's another red flag.
You need to rethink benchmarking strategy. The bench itself
should be spotless. Don't invent new stuff. Add to existing benchs.
They already measure nop, fentry, kprobe, kprobe-multi.
Then only introduce a global trampoline with a simple hash tab.
Compare against current numbers for fentry.
fm_single has to be within couple percents of fentry.
Then make fm_all attach to everything except funcs that bench trigger calls.
fm_all has to be exactly equal to fm_single.
If the difference is 2.5x like here (180 for fm_single vs 446 for fm_all)
something is wrong. Investigate it and don't proceed without full
understanding.
And only then introduce 5 byte special insn that indices into
an array for fast access to metadata.
Your numbers are a bit suspicious, but they show that fm_single
with hash tab is the same speed as the special kfunc_md_arch_support().
Which is expected.
With fm_all that triggers small set of kernel function
in a tight benchmark loop the performance of hashtab vs special
should _also_ be the same, because hashtab will perform O(1) lookup
that is hot in the cache (or hashtab has bad collisions and should be fixed).
fm_all should have the same speed as fm_single too,
because bench will only attach to things outside of the tight bench loop.
So attaching to thousands of kernel functions that are not being
triggered by the benchmark should not affect results.
The performance advantage of special kfunc_md_arch_support()
can probably only be seen in production when fentry.multi attaches
to thousands of kernel functions and random functions are called.
Then hash tab cache misses will be noticeable vs direct access.
There will be cache misses in both cases, but significantly more misses
for hash tab. Only then we can decide where special stuff is truly necessary.
So patches 2 and 3 are really last. After everything had already landed.
Powered by blists - more mailing lists