linux-kernel - Re: [PATCH] uprobes: Optimize the allocation of insn

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzbN-+p0cDnHQPDwhVaqs-r-_Ft-LdUwY2KHG1xfrmyjBQ@mail.gmail.com>
Date: Fri, 9 Aug 2024 11:34:49 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: "Liao, Chang" <liaochang1@...wei.com>
Cc: peterz@...radead.org, mingo@...hat.com, acme@...nel.org, 
	namhyung@...nel.org, mark.rutland@....com, alexander.shishkin@...ux.intel.com, 
	jolsa@...nel.org, irogers@...gle.com, adrian.hunter@...el.com, 
	kan.liang@...ux.intel.com, 
	"oleg@...hat.com >> Oleg Nesterov" <oleg@...hat.com>, Andrii Nakryiko <andrii@...nel.org>, 
	Masami Hiramatsu <mhiramat@...nel.org>, Steven Rostedt <rostedt@...dmis.org>, paulmck@...nel.org, 
	linux-kernel@...r.kernel.org, linux-perf-users@...r.kernel.org, 
	bpf@...r.kernel.org, linux-trace-kernel@...r.kernel.org
Subject: Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

On Fri, Aug 9, 2024 at 12:16 AM Liao, Chang <liaochang1@...wei.com> wrote:
>
>
>
> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
> > On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@...wei.com> wrote:
> >>
> >> Hi Andrii and Oleg.
> >>
> >> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
> >> on arm64. I notice recent discussions on the performance and scalability of uprobes
> >> within the mailing list. Considering this interest, I've added you and other relevant
> >> maintainers to the CC list for broader visibility and potential collaboration.
> >>
> >
> > Hi Liao,
> >
> > As you can see there is an active work to improve uprobes, that
> > changes lifetime management of uprobes, removes a bunch of locks taken
> > in the uprobe/uretprobe hot path, etc. It would be nice if you can
> > hold off a bit with your changes until all that lands. And then
> > re-benchmark, as costs might shift.
>
> Andrii, I'm trying to integrate your lockless changes into the upstream
> next-20240806 kernel tree. And I ran into some conflicts. please let me
> know which kernel you're currently working on.
>

My patches are  based on tip/perf/core. But I also just pushed all the
changes I have accumulated (including patches I haven't sent for
review just yet), plus your patches for sighand lock removed applied
on top into [0]. So you can take a look and use that as a base for
now. Keep in mind, a bunch of those patches might still change, but
this should give you the best currently achievable performance with
uprobes/uretprobes. E.g., I'm getting something like below on x86-64
(note somewhat linear scalability with number of CPU cores, with
per-CPU performance *slowly* declining):

uprobe-nop            ( 1 cpus):    3.565 ± 0.004M/s  (  3.565M/s/cpu)
uprobe-nop            ( 2 cpus):    6.742 ± 0.009M/s  (  3.371M/s/cpu)
uprobe-nop            ( 3 cpus):   10.029 ± 0.056M/s  (  3.343M/s/cpu)
uprobe-nop            ( 4 cpus):   13.118 ± 0.014M/s  (  3.279M/s/cpu)
uprobe-nop            ( 5 cpus):   16.360 ± 0.011M/s  (  3.272M/s/cpu)
uprobe-nop            ( 6 cpus):   19.650 ± 0.045M/s  (  3.275M/s/cpu)
uprobe-nop            ( 7 cpus):   22.926 ± 0.010M/s  (  3.275M/s/cpu)
uprobe-nop            ( 8 cpus):   24.707 ± 0.025M/s  (  3.088M/s/cpu)
uprobe-nop            (10 cpus):   30.842 ± 0.018M/s  (  3.084M/s/cpu)
uprobe-nop            (12 cpus):   33.623 ± 0.037M/s  (  2.802M/s/cpu)
uprobe-nop            (14 cpus):   39.199 ± 0.009M/s  (  2.800M/s/cpu)
uprobe-nop            (16 cpus):   41.698 ± 0.018M/s  (  2.606M/s/cpu)
uprobe-nop            (24 cpus):   65.078 ± 0.018M/s  (  2.712M/s/cpu)
uprobe-nop            (32 cpus):   84.580 ± 0.017M/s  (  2.643M/s/cpu)
uprobe-nop            (40 cpus):  101.992 ± 0.268M/s  (  2.550M/s/cpu)
uprobe-nop            (48 cpus):  101.032 ± 1.428M/s  (  2.105M/s/cpu)
uprobe-nop            (56 cpus):  110.986 ± 0.736M/s  (  1.982M/s/cpu)
uprobe-nop            (64 cpus):  124.145 ± 0.110M/s  (  1.940M/s/cpu)
uprobe-nop            (72 cpus):  134.940 ± 0.200M/s  (  1.874M/s/cpu)
uprobe-nop            (80 cpus):  143.918 ± 0.235M/s  (  1.799M/s/cpu)

uretprobe-nop         ( 1 cpus):    1.987 ± 0.001M/s  (  1.987M/s/cpu)
uretprobe-nop         ( 2 cpus):    3.766 ± 0.003M/s  (  1.883M/s/cpu)
uretprobe-nop         ( 3 cpus):    5.638 ± 0.002M/s  (  1.879M/s/cpu)
uretprobe-nop         ( 4 cpus):    7.275 ± 0.003M/s  (  1.819M/s/cpu)
uretprobe-nop         ( 5 cpus):    9.124 ± 0.004M/s  (  1.825M/s/cpu)
uretprobe-nop         ( 6 cpus):   10.818 ± 0.007M/s  (  1.803M/s/cpu)
uretprobe-nop         ( 7 cpus):   12.721 ± 0.014M/s  (  1.817M/s/cpu)
uretprobe-nop         ( 8 cpus):   13.639 ± 0.007M/s  (  1.705M/s/cpu)
uretprobe-nop         (10 cpus):   17.023 ± 0.009M/s  (  1.702M/s/cpu)
uretprobe-nop         (12 cpus):   18.576 ± 0.014M/s  (  1.548M/s/cpu)
uretprobe-nop         (14 cpus):   21.660 ± 0.004M/s  (  1.547M/s/cpu)
uretprobe-nop         (16 cpus):   22.922 ± 0.013M/s  (  1.433M/s/cpu)
uretprobe-nop         (24 cpus):   34.756 ± 0.069M/s  (  1.448M/s/cpu)
uretprobe-nop         (32 cpus):   44.869 ± 0.153M/s  (  1.402M/s/cpu)
uretprobe-nop         (40 cpus):   53.397 ± 0.220M/s  (  1.335M/s/cpu)
uretprobe-nop         (48 cpus):   48.903 ± 2.277M/s  (  1.019M/s/cpu)
uretprobe-nop         (56 cpus):   42.144 ± 1.206M/s  (  0.753M/s/cpu)
uretprobe-nop         (64 cpus):   42.656 ± 1.104M/s  (  0.666M/s/cpu)
uretprobe-nop         (72 cpus):   46.299 ± 1.443M/s  (  0.643M/s/cpu)
uretprobe-nop         (80 cpus):   46.469 ± 0.808M/s  (  0.581M/s/cpu)

uprobe-ret            ( 1 cpus):    1.219 ± 0.008M/s  (  1.219M/s/cpu)
uprobe-ret            ( 2 cpus):    1.862 ± 0.008M/s  (  0.931M/s/cpu)
uprobe-ret            ( 3 cpus):    2.874 ± 0.005M/s  (  0.958M/s/cpu)
uprobe-ret            ( 4 cpus):    3.512 ± 0.002M/s  (  0.878M/s/cpu)
uprobe-ret            ( 5 cpus):    3.549 ± 0.001M/s  (  0.710M/s/cpu)
uprobe-ret            ( 6 cpus):    3.425 ± 0.003M/s  (  0.571M/s/cpu)
uprobe-ret            ( 7 cpus):    3.551 ± 0.009M/s  (  0.507M/s/cpu)
uprobe-ret            ( 8 cpus):    3.050 ± 0.002M/s  (  0.381M/s/cpu)
uprobe-ret            (10 cpus):    2.706 ± 0.002M/s  (  0.271M/s/cpu)
uprobe-ret            (12 cpus):    2.588 ± 0.003M/s  (  0.216M/s/cpu)
uprobe-ret            (14 cpus):    2.589 ± 0.003M/s  (  0.185M/s/cpu)
uprobe-ret            (16 cpus):    2.575 ± 0.001M/s  (  0.161M/s/cpu)
uprobe-ret            (24 cpus):    1.808 ± 0.011M/s  (  0.075M/s/cpu)
uprobe-ret            (32 cpus):    1.853 ± 0.001M/s  (  0.058M/s/cpu)
uprobe-ret            (40 cpus):    1.952 ± 0.002M/s  (  0.049M/s/cpu)
uprobe-ret            (48 cpus):    2.075 ± 0.007M/s  (  0.043M/s/cpu)
uprobe-ret            (56 cpus):    2.441 ± 0.004M/s  (  0.044M/s/cpu)
uprobe-ret            (64 cpus):    1.880 ± 0.012M/s  (  0.029M/s/cpu)
uprobe-ret            (72 cpus):    0.962 ± 0.002M/s  (  0.013M/s/cpu)
uprobe-ret            (80 cpus):    1.040 ± 0.011M/s  (  0.013M/s/cpu)

uretprobe-ret         ( 1 cpus):    0.981 ± 0.000M/s  (  0.981M/s/cpu)
uretprobe-ret         ( 2 cpus):    1.421 ± 0.001M/s  (  0.711M/s/cpu)
uretprobe-ret         ( 3 cpus):    2.050 ± 0.003M/s  (  0.683M/s/cpu)
uretprobe-ret         ( 4 cpus):    2.596 ± 0.002M/s  (  0.649M/s/cpu)
uretprobe-ret         ( 5 cpus):    3.105 ± 0.003M/s  (  0.621M/s/cpu)
uretprobe-ret         ( 6 cpus):    3.886 ± 0.002M/s  (  0.648M/s/cpu)
uretprobe-ret         ( 7 cpus):    3.016 ± 0.001M/s  (  0.431M/s/cpu)
uretprobe-ret         ( 8 cpus):    2.903 ± 0.000M/s  (  0.363M/s/cpu)
uretprobe-ret         (10 cpus):    2.755 ± 0.001M/s  (  0.276M/s/cpu)
uretprobe-ret         (12 cpus):    2.400 ± 0.001M/s  (  0.200M/s/cpu)
uretprobe-ret         (14 cpus):    3.972 ± 0.001M/s  (  0.284M/s/cpu)
uretprobe-ret         (16 cpus):    3.940 ± 0.003M/s  (  0.246M/s/cpu)
uretprobe-ret         (24 cpus):    3.002 ± 0.003M/s  (  0.125M/s/cpu)
uretprobe-ret         (32 cpus):    3.018 ± 0.003M/s  (  0.094M/s/cpu)
uretprobe-ret         (40 cpus):    1.846 ± 0.000M/s  (  0.046M/s/cpu)
uretprobe-ret         (48 cpus):    2.487 ± 0.004M/s  (  0.052M/s/cpu)
uretprobe-ret         (56 cpus):    2.470 ± 0.006M/s  (  0.044M/s/cpu)
uretprobe-ret         (64 cpus):    2.027 ± 0.014M/s  (  0.032M/s/cpu)
uretprobe-ret         (72 cpus):    1.108 ± 0.011M/s  (  0.015M/s/cpu)
uretprobe-ret         (80 cpus):    0.982 ± 0.005M/s  (  0.012M/s/cpu)


-ret variants (single-stepping case for x86-64) still suck, but they
suck 2x less now with your patches :) Clearly more work ahead for
those, though.


  [0] https://github.com/anakryiko/linux/commits/uprobes-lockless-cumulative/


> Thanks.
>
> >
> > But also see some remarks below.
> >
> >> Thanks.
> >>
> >> 在 2024/7/27 17:44, Liao Chang 写道:
> >>> The profiling result of single-thread model of selftests bench reveals
> >>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
> >>> ARM64. On my local testing machine, 5% of CPU time is consumed by
> >>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
> >>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
> >>>
> >>> This patch introduce struct uprobe_breakpoint to track previously
> >>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
> >>> need for redundant insn_slot writes and subsequent expensive cache
> >>> flush, especially on architecture like ARM64. This patch has been tested
> >>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
> >>> bench and Redis GET/SET benchmark result below reveal obivious
> >>> performance gain.
> >>>
> >>> before-opt
> >>> ----------
> >>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
> >>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
> >>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> >
> > I'm surprised that nop and push variants are much slower than ret
> > variant. This is exactly opposite on x86-64. Do you have an
> > explanation why this might be happening? I see you are trying to
> > optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> > variant of uprobe that normally shouldn't be used. Typically uprobe is
> > installed on nop (for USDT) and on function entry (which would be push
> > variant, `push %rbp` instruction).
> >
> > ret variant, for x86-64, causes one extra step to go back to user
> > space to execute original instruction out-of-line, and then trapping
> > back to kernel for running uprobe. Which is what you normally want to
> > avoid.
> >
> > What I'm getting at here. It seems like maybe arm arch is missing fast
> > emulated implementations for nops/push or whatever equivalents for
> > ARM64 that is. Please take a look at that and see why those are slow
> > and whether you can make those into fast uprobe cases?
>
> I will spend the weekend figuring out the questions you raised. Thanks for
> pointing them out.
>
> >
> >>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
> >>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
> >>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
> >>> Redis SET (RPS) uprobe: 42728.52
> >>> Redis GET (RPS) uprobe: 43640.18
> >>> Redis SET (RPS) uretprobe: 40624.54
> >>> Redis GET (RPS) uretprobe: 41180.56
> >>>
> >>> after-opt
> >>> ---------
> >>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> >>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> >>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> >>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> >>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> >>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> >>> Redis SET (RPS) uprobe: 43939.69
> >>> Redis GET (RPS) uprobe: 45200.80
> >>> Redis SET (RPS) uretprobe: 41658.58
> >>> Redis GET (RPS) uretprobe: 42805.80
> >>>
> >>> While some uprobes might still need to share the same insn_slot, this
> >>> patch compare the instructions in the resued insn_slot with the
> >>> instructions execute out-of-line firstly to decides allocate a new one
> >>> or not.
> >>>
> >>> Additionally, this patch use a rbtree associated with each thread that
> >>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
> >>> rbtree of uprobe_breakpoints has smaller node, better locality and less
> >>> contention, it result in faster lookup times compared to find_uprobe().
> >>>
> >>> The other part of this patch are some necessary memory management for
> >>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
> >>> hit uprobe that doesn't already have a corresponding node in rbtree. All
> >>> uprobe_breakpoints will be freed when thread exit.
> >>>
> >>> Signed-off-by: Liao Chang <liaochang1@...wei.com>
> >>> ---
> >>>  include/linux/uprobes.h |   3 +
> >>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
> >>>  2 files changed, 211 insertions(+), 38 deletions(-)
> >>>
> >
> > [...]
>
> --
> BR
> Liao, Chang