[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzY6tXrDGkW6mkxCY551pZa1G+Sgxeuex==nvHUEp9ynpg@mail.gmail.com>
Date: Mon, 8 Jul 2024 17:25:14 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Masami Hiramatsu <mhiramat@...nel.org>, Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...nel.org, andrii@...nel.org, linux-kernel@...r.kernel.org,
rostedt@...dmis.org, oleg@...hat.com, jolsa@...nel.org, clm@...a.com,
paulmck@...nel.org, bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH 00/10] perf/uprobe: Optimize uprobes
On Mon, Jul 8, 2024 at 3:56 PM Masami Hiramatsu <mhiramat@...nel.org> wrote:
>
> On Mon, 08 Jul 2024 11:12:41 +0200
> Peter Zijlstra <peterz@...radead.org> wrote:
>
> > Hi!
> >
> > These patches implement the (S)RCU based proposal to optimize uprobes.
> >
> > On my c^Htrusty old IVB-EP -- where each (of the 40) CPU calls 'func' in a
> > tight loop:
> >
> > perf probe -x ./uprobes test=func
> > perf stat -ae probe_uprobe:test -- sleep 1
> >
> > perf probe -x ./uprobes test=func%return
> > perf stat -ae probe_uprobe:test__return -- sleep 1
> >
> > PRE:
> >
> > 4,038,804 probe_uprobe:test
> > 2,356,275 probe_uprobe:test__return
> >
> > POST:
> >
> > 7,216,579 probe_uprobe:test
> > 6,744,786 probe_uprobe:test__return
> >
>
> Good results! So this is another series of Andrii's batch register?
> (but maybe it becomes simpler)
yes, this would be an alternative to my patches
Peter,
I didn't have time to look at the patches just yet, but I managed to
run a quick benchmark (using bench tool we have as part of BPF
selftests) to see both single-threaded performance and how the
performance scales with CPUs (now that we are not bottlenecked on
register_rwsem). Here are some results:
[root@...neltest003.10.atn6 ~]# for num_threads in {1..20}; do ./bench \
-a -d10 -p $num_threads trig-uprobe-nop | grep Summary; done
Summary: hits 3.278 ± 0.021M/s ( 3.278M/prod)
Summary: hits 4.364 ± 0.005M/s ( 2.182M/prod)
Summary: hits 6.517 ± 0.011M/s ( 2.172M/prod)
Summary: hits 8.203 ± 0.004M/s ( 2.051M/prod)
Summary: hits 9.520 ± 0.012M/s ( 1.904M/prod)
Summary: hits 8.316 ± 0.007M/s ( 1.386M/prod)
Summary: hits 7.893 ± 0.037M/s ( 1.128M/prod)
Summary: hits 8.490 ± 0.014M/s ( 1.061M/prod)
Summary: hits 8.022 ± 0.005M/s ( 0.891M/prod)
Summary: hits 8.471 ± 0.019M/s ( 0.847M/prod)
Summary: hits 8.156 ± 0.021M/s ( 0.741M/prod)
...
(numbers in the first column is total throughput, and xxx/prod is
per-thread throughput). Single-threaded performance (about 3.3 mln/s)
is on part with what I had with my patches. And clearly it scales
better with more thread now that register_rwsem is gone, though,
unfortunately, it doesn't really scale linearly.
Quick profiling for the 8-threaded benchmark shows that we spend >20%
in mmap_read_lock/mmap_read_unlock in find_active_uprobe. I think
that's what would prevent uprobes from scaling linearly. If you have
some good ideas on how to get rid of that, I think it would be
extremely beneficial. We also spend about 14% of the time in
srcu_read_lock(). The rest is in interrupt handling overhead, actual
user-space function overhead, and in uprobe_dispatcher() calls.
Ramping this up to 16 threads shows that mmap_rwsem is getting more
costly, up to 45% of CPU. SRCU is also growing a bit slower to 19% of
CPU. Is this expected? (I'm not familiar with the implementation
details)
P.S. Would you be able to rebase your patches on top of latest
probes/for-next, which include Jiri's sys_uretprobe changes. Right now
uretprobe benchmarks are quite unrepresentative because of that.
Thanks!
>
> Thank you,
>
> >
> > Patches also available here:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/uprobes
> >
> >
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@...nel.org>
Powered by blists - more mailing lists