lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20201022082138.2322434-1-jolsa@kernel.org>
Date:   Thu, 22 Oct 2020 10:21:22 +0200
From:   Jiri Olsa <jolsa@...nel.org>
To:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andriin@...com>
Cc:     netdev@...r.kernel.org, bpf@...r.kernel.org,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...omium.org>, Daniel Xu <dxu@...uu.xyz>,
        Steven Rostedt <rostedt@...dmis.org>,
        Jesper Brouer <jbrouer@...hat.com>,
        Toke Høiland-Jørgensen <toke@...hat.com>,
        Viktor Malik <vmalik@...hat.com>
Subject: [RFC bpf-next 00/16] bpf: Speed up trampoline attach

hi,
this patchset tries to speed up the attach time for trampolines
and make bpftrace faster for wildcard use cases like:

  # bpftrace -ve "kfunc:__x64_sys_s* { printf("test\n"); }"

Profiles show mostly ftrace backend, because we add trampoline
functions one by one and ftrace direct function registering is
quite expensive. Thus main change in this patchset is to allow
batch attach and use just single ftrace call to attach or detach
multiple ips/trampolines.

This patchset also contains other speedup changes that showed
up in profiles:

  - delayed link free
    to bypass detach cycles completely

  - kallsyms rbtree search
    change linear search to rb tree search

For clean attach workload I added also new attach selftest,
which is not meant to be merged but is used to show profile
results.

Following numbers show speedup after applying specific change
on top of the previous (and including the previous changes).

profiled with: 'perf stat -r 5 -e cycles:k,cycles:u ...'

For bpftrace:

  # bpftrace -ve "kfunc:__x64_sys_s* { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}"

  - base

      3,290,457,628      cycles:k         ( +-  0.27% )
        933,581,973      cycles:u         ( +-  0.20% )

      50.25 +- 4.79 seconds time elapsed  ( +-  9.53% )

  + delayed link free

      2,535,458,767      cycles:k         ( +-  0.55% )
        940,046,382      cycles:u         ( +-  0.27% )

      33.60 +- 3.27 seconds time elapsed  ( +-  9.73% )

  + kallsym rbtree search

      2,199,433,771      cycles:k         ( +-  0.55% )
        936,105,469      cycles:u         ( +-  0.37% )

      26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )

  + batch support

      1,456,854,867      cycles:k         ( +-  0.57% )
        937,737,431      cycles:u         ( +-  0.13% )

      12.44 +- 2.98 seconds time elapsed  ( +- 23.95% )

  + rcu fix

      1,427,959,119      cycles:k         ( +-  0.87% )
        930,833,507      cycles:u         ( +-  0.23% )

      14.53 +- 3.51 seconds time elapsed  ( +- 24.14% )


For attach_test numbers do not show direct time speedup when
using the batch support, but show big decrease in kernel cycles.
It seems the time is spent in rcu waiting, which I tried to
address in most likely wrong rcu fix:

  # ./test_progs -t attach_test

  - base

      1,350,136,760      cycles:k         ( +-  0.07% )
         70,591,712      cycles:u         ( +-  0.26% )

      24.26 +- 2.82 seconds time elapsed  ( +- 11.62% )

  + delayed link free

        996,152,309      cycles:k         ( +-  0.37% )
         69,263,150      cycles:u         ( +-  0.50% )

      15.63 +- 1.80 seconds time elapsed  ( +- 11.51% )

  + kallsym rbtree search

        390,217,706      cycles:k         ( +-  0.66% )
         68,999,019      cycles:u         ( +-  0.46% )

      14.11 +- 2.11 seconds time elapsed  ( +- 14.98% )

  + batch support

         37,410,887      cycles:k         ( +-  0.98% )
         70,062,158      cycles:u         ( +-  0.39% )

      26.80 +- 4.10 seconds time elapsed  ( +- 15.31% )

  + rcu fix

         36,812,432      cycles:k         ( +-  2.52% )
         69,907,191      cycles:u         ( +-  0.38% )

      15.04 +- 2.94 seconds time elapsed  ( +- 19.54% )


I still need to go through the changes and double check them,
also those ftrace changes are most likely wrong and most likely
I broke few tests (hence it's RFC), but I wonder you guys would
like this batch solution and if there are any thoughts on that.

Also available in
  git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  bpf/batch

thanks,
jirka


---
Jiri Olsa (16):
      ftrace: Add check_direct_entry function
      ftrace: Add adjust_direct_size function
      ftrace: Add get/put_direct_func function
      ftrace: Add ftrace_set_filter_ips function
      ftrace: Add register_ftrace_direct_ips function
      ftrace: Add unregister_ftrace_direct_ips function
      kallsyms: Use rb tree for kallsyms name search
      bpf: Use delayed link free in bpf_link_put
      bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support
      bpf: Add BPF_TRAMPOLINE_BATCH_DETACH support
      bpf: Sync uapi bpf.h to tools
      bpf: Move synchronize_rcu_mult for batch processing (NOT TO BE MERGED)
      libbpf: Add trampoline batch attach support
      libbpf: Add trampoline batch detach support
      selftests/bpf: Add trampoline batch test
      selftests/bpf: Add attach batch test (NOT TO BE MERGED)

 include/linux/bpf.h                                       |  18 +++++-
 include/linux/ftrace.h                                    |   7 +++
 include/uapi/linux/bpf.h                                  |   8 +++
 kernel/bpf/syscall.c                                      | 125 ++++++++++++++++++++++++++++++++++----
 kernel/bpf/trampoline.c                                   |  95 +++++++++++++++++++++++------
 kernel/kallsyms.c                                         |  95 ++++++++++++++++++++++++++---
 kernel/trace/ftrace.c                                     | 304 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------
 net/bpf/test_run.c                                        |  55 +++++++++++++++++
 tools/include/uapi/linux/bpf.h                            |   8 +++
 tools/lib/bpf/bpf.c                                       |  24 ++++++++
 tools/lib/bpf/bpf.h                                       |   2 +
 tools/lib/bpf/libbpf.c                                    | 126 ++++++++++++++++++++++++++++++++++++++-
 tools/lib/bpf/libbpf.h                                    |   5 +-
 tools/lib/bpf/libbpf.map                                  |   2 +
 tools/testing/selftests/bpf/prog_tests/attach_test.c      |  27 +++++++++
 tools/testing/selftests/bpf/prog_tests/trampoline_batch.c |  45 ++++++++++++++
 tools/testing/selftests/bpf/progs/attach_test.c           |  62 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/trampoline_batch_test.c |  75 +++++++++++++++++++++++
 18 files changed, 995 insertions(+), 88 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/attach_test.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/trampoline_batch.c
 create mode 100644 tools/testing/selftests/bpf/progs/attach_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/trampoline_batch_test.c

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ