[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAEf4BzaRzHFs-gyC5FGsbh4EX4=-QP4_i7A5ts++-J0JPaOb1g@mail.gmail.com>
Date: Thu, 5 Dec 2024 10:23:42 -0800
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Liao Chang <liaochang1@...wei.com>
Cc: mhiramat@...nel.org, oleg@...hat.com, peterz@...radead.org,
mingo@...hat.com, acme@...nel.org, namhyung@...nel.org, mark.rutland@....com,
alexander.shishkin@...ux.intel.com, jolsa@...nel.org, irogers@...gle.com,
adrian.hunter@...el.com, kan.liang@...ux.intel.com,
linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
linux-perf-users@...r.kernel.org, bpf@...r.kernel.org
Subject: Re: [PATCH v4 0/2] uprobes: Improve scalability by reducing the
contention on siglock
On Tue, Nov 5, 2024 at 6:07 PM Andrii Nakryiko
<andrii.nakryiko@...il.com> wrote:
>
> On Tue, Oct 22, 2024 at 12:42 AM Liao Chang <liaochang1@...wei.com> wrote:
> >
> > The profiling result of BPF selftest on ARM64 platform reveals the
> > significant contention on the current->sighand->siglock is the
> > scalability bottleneck. The reason is also very straightforward that all
> > producer threads of benchmark have to contend the spinlock mentioned to
> > resume the TIF_SIGPENDING bit in thread_info that might be removed in
> > uprobe_deny_signal().
> >
> > The contention on current->sighand->siglock is unnecessary, this series
> > remove them thoroughly. I've use the script developed by Andrii in [1]
> > to run benchmark. The CPU used was Kunpeng916 (Hi1616), 4 NUMA nodes,
> > 64 cores@...GHz running the kernel on next tree + the optimization in
> > [2] for get_xol_insn_slot().
> >
> > before-opt
> > ----------
> > uprobe-nop ( 1 cpus): 0.907 ± 0.003M/s ( 0.907M/s/cpu)
> > uprobe-nop ( 2 cpus): 1.676 ± 0.008M/s ( 0.838M/s/cpu)
> > uprobe-nop ( 4 cpus): 3.210 ± 0.003M/s ( 0.802M/s/cpu)
> > uprobe-nop ( 8 cpus): 4.457 ± 0.003M/s ( 0.557M/s/cpu)
> > uprobe-nop (16 cpus): 3.724 ± 0.011M/s ( 0.233M/s/cpu)
> > uprobe-nop (32 cpus): 2.761 ± 0.003M/s ( 0.086M/s/cpu)
> > uprobe-nop (64 cpus): 1.293 ± 0.015M/s ( 0.020M/s/cpu)
> >
> > uprobe-push ( 1 cpus): 0.883 ± 0.001M/s ( 0.883M/s/cpu)
> > uprobe-push ( 2 cpus): 1.642 ± 0.005M/s ( 0.821M/s/cpu)
> > uprobe-push ( 4 cpus): 3.086 ± 0.002M/s ( 0.771M/s/cpu)
> > uprobe-push ( 8 cpus): 3.390 ± 0.003M/s ( 0.424M/s/cpu)
> > uprobe-push (16 cpus): 2.652 ± 0.005M/s ( 0.166M/s/cpu)
> > uprobe-push (32 cpus): 2.713 ± 0.005M/s ( 0.085M/s/cpu)
> > uprobe-push (64 cpus): 1.313 ± 0.009M/s ( 0.021M/s/cpu)
> >
> > uprobe-ret ( 1 cpus): 1.774 ± 0.000M/s ( 1.774M/s/cpu)
> > uprobe-ret ( 2 cpus): 3.350 ± 0.001M/s ( 1.675M/s/cpu)
> > uprobe-ret ( 4 cpus): 6.604 ± 0.000M/s ( 1.651M/s/cpu)
> > uprobe-ret ( 8 cpus): 6.706 ± 0.005M/s ( 0.838M/s/cpu)
> > uprobe-ret (16 cpus): 5.231 ± 0.001M/s ( 0.327M/s/cpu)
> > uprobe-ret (32 cpus): 5.743 ± 0.003M/s ( 0.179M/s/cpu)
> > uprobe-ret (64 cpus): 4.726 ± 0.016M/s ( 0.074M/s/cpu)
> >
> > after-opt
> > ---------
> > uprobe-nop ( 1 cpus): 0.985 ± 0.002M/s ( 0.985M/s/cpu)
> > uprobe-nop ( 2 cpus): 1.773 ± 0.005M/s ( 0.887M/s/cpu)
> > uprobe-nop ( 4 cpus): 3.304 ± 0.001M/s ( 0.826M/s/cpu)
> > uprobe-nop ( 8 cpus): 5.328 ± 0.002M/s ( 0.666M/s/cpu)
> > uprobe-nop (16 cpus): 6.475 ± 0.002M/s ( 0.405M/s/cpu)
> > uprobe-nop (32 cpus): 4.831 ± 0.082M/s ( 0.151M/s/cpu)
> > uprobe-nop (64 cpus): 2.564 ± 0.053M/s ( 0.040M/s/cpu)
> >
> > uprobe-push ( 1 cpus): 0.964 ± 0.001M/s ( 0.964M/s/cpu)
> > uprobe-push ( 2 cpus): 1.766 ± 0.002M/s ( 0.883M/s/cpu)
> > uprobe-push ( 4 cpus): 3.290 ± 0.009M/s ( 0.823M/s/cpu)
> > uprobe-push ( 8 cpus): 4.670 ± 0.002M/s ( 0.584M/s/cpu)
> > uprobe-push (16 cpus): 5.197 ± 0.004M/s ( 0.325M/s/cpu)
> > uprobe-push (32 cpus): 5.068 ± 0.161M/s ( 0.158M/s/cpu)
> > uprobe-push (64 cpus): 2.605 ± 0.026M/s ( 0.041M/s/cpu)
> >
> > uprobe-ret ( 1 cpus): 1.833 ± 0.001M/s ( 1.833M/s/cpu)
> > uprobe-ret ( 2 cpus): 3.384 ± 0.003M/s ( 1.692M/s/cpu)
> > uprobe-ret ( 4 cpus): 6.677 ± 0.004M/s ( 1.669M/s/cpu)
> > uprobe-ret ( 8 cpus): 6.854 ± 0.005M/s ( 0.857M/s/cpu)
> > uprobe-ret (16 cpus): 6.508 ± 0.006M/s ( 0.407M/s/cpu)
> > uprobe-ret (32 cpus): 5.793 ± 0.009M/s ( 0.181M/s/cpu)
> > uprobe-ret (64 cpus): 4.743 ± 0.016M/s ( 0.074M/s/cpu)
> >
> > Above benchmark results demonstrates a obivious improvement in the
> > scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
> > of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.
> >
> > v4->v3:
> > 1. Rebase v3 [3] to the lateset tip/perf/core.
> > 2. Acked-by: Masami Hiramatsu (Google) <mhiramat@...nel.org>
> > 3. Acked-by: Oleg Nesterov <oleg@...hat.com>
> >
> > v3->v2:
> > Renaming the flag in [2/2], s/deny_signal/signal_denied/g.
> >
> > v2->v1:
> > Oleg pointed out the _DENY_SIGNAL will be replaced by _ACK upon the
> > completion of singlestep which leads to handle_singlestep() has no
> > chance to restore the removed TIF_SIGPENDING [3] and some case in
> > question. So this revision proposes to use a flag in uprobe_task to
> > track the denied TIF_SIGPENDING instead of new UPROBE_SSTEP state.
> >
> > [1] https://lore.kernel.org/all/20240731214256.3588718-1-andrii@kernel.org
> > [2] https://lore.kernel.org/all/20240727094405.1362496-1-liaochang1@huawei.com
> > [3] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/
> >
> > Liao Chang (2):
> > uprobes: Remove redundant spinlock in uprobe_deny_signal()
> > uprobes: Remove the spinlock within handle_singlestep()
> >
> > include/linux/uprobes.h | 1 +
> > kernel/events/uprobes.c | 10 +++++-----
> > 2 files changed, 6 insertions(+), 5 deletions(-)
> >
> > --
> > 2.34.1
> >
>
> This patch set has been ready for a long while, can we please apply it
> to perf/core as well? Thank you!
Liao,
This patch set doesn't apply cleanly to perf/core anymore, can you
please rebase one more time and resend? Thanks!
Powered by blists - more mailing lists