linux-kernel - Re: [PATCH 00/10] perf/uprobe: Optimize uprobes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAEf4BzY6tXrDGkW6mkxCY551pZa1G+Sgxeuex==nvHUEp9ynpg@mail.gmail.com>
Date: Mon, 8 Jul 2024 17:25:14 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Masami Hiramatsu <mhiramat@...nel.org>, Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...nel.org, andrii@...nel.org, linux-kernel@...r.kernel.org, 
	rostedt@...dmis.org, oleg@...hat.com, jolsa@...nel.org, clm@...a.com, 
	paulmck@...nel.org, bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH 00/10] perf/uprobe: Optimize uprobes

On Mon, Jul 8, 2024 at 3:56 PM Masami Hiramatsu <mhiramat@...nel.org> wrote:
>
> On Mon, 08 Jul 2024 11:12:41 +0200
> Peter Zijlstra <peterz@...radead.org> wrote:
>
> > Hi!
> >
> > These patches implement the (S)RCU based proposal to optimize uprobes.
> >
> > On my c^Htrusty old IVB-EP -- where each (of the 40) CPU calls 'func' in a
> > tight loop:
> >
> >   perf probe -x ./uprobes test=func
> >   perf stat -ae probe_uprobe:test  -- sleep 1
> >
> >   perf probe -x ./uprobes test=func%return
> >   perf stat -ae probe_uprobe:test__return -- sleep 1
> >
> > PRE:
> >
> >   4,038,804      probe_uprobe:test
> >   2,356,275      probe_uprobe:test__return
> >
> > POST:
> >
> >   7,216,579      probe_uprobe:test
> >   6,744,786      probe_uprobe:test__return
> >
>
> Good results! So this is another series of Andrii's batch register?
> (but maybe it becomes simpler)

yes, this would be an alternative to my patches

Peter,

I didn't have time to look at the patches just yet, but I managed to
run a quick benchmark (using bench tool we have as part of BPF
selftests) to see both single-threaded performance and how the
performance scales with CPUs (now that we are not bottlenecked on
register_rwsem). Here are some results:

[root@...neltest003.10.atn6 ~]# for num_threads in {1..20}; do ./bench \
-a -d10 -p $num_threads trig-uprobe-nop | grep Summary; done
Summary: hits    3.278 ± 0.021M/s (  3.278M/prod)
Summary: hits    4.364 ± 0.005M/s (  2.182M/prod)
Summary: hits    6.517 ± 0.011M/s (  2.172M/prod)
Summary: hits    8.203 ± 0.004M/s (  2.051M/prod)
Summary: hits    9.520 ± 0.012M/s (  1.904M/prod)
Summary: hits    8.316 ± 0.007M/s (  1.386M/prod)
Summary: hits    7.893 ± 0.037M/s (  1.128M/prod)
Summary: hits    8.490 ± 0.014M/s (  1.061M/prod)
Summary: hits    8.022 ± 0.005M/s (  0.891M/prod)
Summary: hits    8.471 ± 0.019M/s (  0.847M/prod)
Summary: hits    8.156 ± 0.021M/s (  0.741M/prod)
...

(numbers in the first column is total throughput, and xxx/prod is
per-thread throughput). Single-threaded performance (about 3.3 mln/s)
is on part with what I had with my patches. And clearly it scales
better with more thread now that register_rwsem is gone, though,
unfortunately, it doesn't really scale linearly.

Quick profiling for the 8-threaded benchmark shows that we spend >20%
in mmap_read_lock/mmap_read_unlock in find_active_uprobe. I think
that's what would prevent uprobes from scaling linearly. If you have
some good ideas on how to get rid of that, I think it would be
extremely beneficial. We also spend about 14% of the time in
srcu_read_lock(). The rest is in interrupt handling overhead, actual
user-space function overhead, and in uprobe_dispatcher() calls.

Ramping this up to 16 threads shows that mmap_rwsem is getting more
costly, up to 45% of CPU. SRCU is also growing a bit slower to 19% of
CPU. Is this expected? (I'm not familiar with the implementation
details)

P.S. Would you be able to rebase your patches on top of latest
probes/for-next, which include Jiri's sys_uretprobe changes. Right now
uretprobe benchmarks are quite unrepresentative because of that.
Thanks!

>
> Thank you,
>
> >
> > Patches also available here:
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/uprobes
> >
> >
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@...nel.org>