[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a22d6d79-fa7e-62b2-0ac1-575068f176a5@huawei.com>
Date: Fri, 9 Aug 2024 15:16:25 +0800
From: "Liao, Chang" <liaochang1@...wei.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
CC: <peterz@...radead.org>, <mingo@...hat.com>, <acme@...nel.org>,
<namhyung@...nel.org>, <mark.rutland@....com>,
<alexander.shishkin@...ux.intel.com>, <jolsa@...nel.org>,
<irogers@...gle.com>, <adrian.hunter@...el.com>, <kan.liang@...ux.intel.com>,
"oleg@...hat.com >> Oleg Nesterov" <oleg@...hat.com>, Andrii Nakryiko
<andrii@...nel.org>, Masami Hiramatsu <mhiramat@...nel.org>, Steven Rostedt
<rostedt@...dmis.org>, <paulmck@...nel.org>, <linux-kernel@...r.kernel.org>,
<linux-perf-users@...r.kernel.org>, <bpf@...r.kernel.org>,
<linux-trace-kernel@...r.kernel.org>
Subject: Re: [PATCH] uprobes: Optimize the allocation of insn_slot for
performance
在 2024/8/9 2:26, Andrii Nakryiko 写道:
> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@...wei.com> wrote:
>>
>> Hi Andrii and Oleg.
>>
>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
>> on arm64. I notice recent discussions on the performance and scalability of uprobes
>> within the mailing list. Considering this interest, I've added you and other relevant
>> maintainers to the CC list for broader visibility and potential collaboration.
>>
>
> Hi Liao,
>
> As you can see there is an active work to improve uprobes, that
> changes lifetime management of uprobes, removes a bunch of locks taken
> in the uprobe/uretprobe hot path, etc. It would be nice if you can
> hold off a bit with your changes until all that lands. And then
> re-benchmark, as costs might shift.
Andrii, I'm trying to integrate your lockless changes into the upstream
next-20240806 kernel tree. And I ran into some conflicts. please let me
know which kernel you're currently working on.
Thanks.
>
> But also see some remarks below.
>
>> Thanks.
>>
>> 在 2024/7/27 17:44, Liao Chang 写道:
>>> The profiling result of single-thread model of selftests bench reveals
>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>
>>> This patch introduce struct uprobe_breakpoint to track previously
>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>> need for redundant insn_slot writes and subsequent expensive cache
>>> flush, especially on architecture like ARM64. This patch has been tested
>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>> bench and Redis GET/SET benchmark result below reveal obivious
>>> performance gain.
>>>
>>> before-opt
>>> ----------
>>> trig-uprobe-nop: 0.371 ± 0.001M/s (0.371M/prod)
>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>> trig-uprobe-ret: 1.637 ± 0.001M/s (1.647M/prod)
>
> I'm surprised that nop and push variants are much slower than ret
> variant. This is exactly opposite on x86-64. Do you have an
> explanation why this might be happening? I see you are trying to
> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> variant of uprobe that normally shouldn't be used. Typically uprobe is
> installed on nop (for USDT) and on function entry (which would be push
> variant, `push %rbp` instruction).
>
> ret variant, for x86-64, causes one extra step to go back to user
> space to execute original instruction out-of-line, and then trapping
> back to kernel for running uprobe. Which is what you normally want to
> avoid.
>
> What I'm getting at here. It seems like maybe arm arch is missing fast
> emulated implementations for nops/push or whatever equivalents for
> ARM64 that is. Please take a look at that and see why those are slow
> and whether you can make those into fast uprobe cases?
I will spend the weekend figuring out the questions you raised. Thanks for
pointing them out.
>
>>> trig-uretprobe-nop: 0.331 ± 0.004M/s (0.331M/prod)
>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
>>> trig-uretprobe-ret: 0.854 ± 0.002M/s (0.854M/prod)
>>> Redis SET (RPS) uprobe: 42728.52
>>> Redis GET (RPS) uprobe: 43640.18
>>> Redis SET (RPS) uretprobe: 40624.54
>>> Redis GET (RPS) uretprobe: 41180.56
>>>
>>> after-opt
>>> ---------
>>> trig-uprobe-nop: 0.916 ± 0.001M/s (0.916M/prod)
>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>> trig-uprobe-ret: 1.855 ± 0.000M/s (1.855M/prod)
>>> trig-uretprobe-nop: 0.640 ± 0.000M/s (0.640M/prod)
>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>> trig-uretprobe-ret: 0.978 ± 0.003M/s (0.978M/prod)
>>> Redis SET (RPS) uprobe: 43939.69
>>> Redis GET (RPS) uprobe: 45200.80
>>> Redis SET (RPS) uretprobe: 41658.58
>>> Redis GET (RPS) uretprobe: 42805.80
>>>
>>> While some uprobes might still need to share the same insn_slot, this
>>> patch compare the instructions in the resued insn_slot with the
>>> instructions execute out-of-line firstly to decides allocate a new one
>>> or not.
>>>
>>> Additionally, this patch use a rbtree associated with each thread that
>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
>>> rbtree of uprobe_breakpoints has smaller node, better locality and less
>>> contention, it result in faster lookup times compared to find_uprobe().
>>>
>>> The other part of this patch are some necessary memory management for
>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
>>> hit uprobe that doesn't already have a corresponding node in rbtree. All
>>> uprobe_breakpoints will be freed when thread exit.
>>>
>>> Signed-off-by: Liao Chang <liaochang1@...wei.com>
>>> ---
>>> include/linux/uprobes.h | 3 +
>>> kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
>>> 2 files changed, 211 insertions(+), 38 deletions(-)
>>>
>
> [...]
--
BR
Liao, Chang
Powered by blists - more mailing lists