netdev - Re: [RFC bpf-next 09/16] bpf: Add BPF_TRAMPOLINE_BATCH

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAEf4Bzb_HPmGSoUX+9+LvSP2Yb95OqEQKtjpMiW1Um-rixAM8Q@mail.gmail.com>
Date:   Fri, 23 Oct 2020 13:03:22 -0700
From:   Andrii Nakryiko <andrii.nakryiko@...il.com>
To:     Jiri Olsa <jolsa@...nel.org>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andriin@...com>,
        Networking <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...omium.org>, Daniel Xu <dxu@...uu.xyz>,
        Steven Rostedt <rostedt@...dmis.org>,
        Jesper Brouer <jbrouer@...hat.com>,
        Toke Høiland-Jørgensen <toke@...hat.com>,
        Viktor Malik <vmalik@...hat.com>
Subject: Re: [RFC bpf-next 09/16] bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support

On Thu, Oct 22, 2020 at 8:01 AM Jiri Olsa <jolsa@...nel.org> wrote:
>
> Adding BPF_TRAMPOLINE_BATCH_ATTACH support, that allows to attach
> tracing multiple fentry/fexit pograms to trampolines within one
> syscall.
>
> Currently each tracing program is attached in seprate bpf syscall
> and more importantly by separate register_ftrace_direct call, which
> registers trampoline in ftrace subsystem. We can save some cycles
> by simple using its batch variant register_ftrace_direct_ips.
>
> Before:
>
>  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
>     { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
>
>      2,199,433,771      cycles:k               ( +-  0.55% )
>        936,105,469      cycles:u               ( +-  0.37% )
>
>              26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )
>
> After:
>
>  Performance counter stats for './src/bpftrace -ve kfunc:__x64_sys_s*
>     { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}' (5 runs):
>
>      1,456,854,867      cycles:k               ( +-  0.57% )
>        937,737,431      cycles:u               ( +-  0.13% )
>
>              12.44 +- 2.98 seconds time elapsed  ( +- 23.95% )
>
> The new BPF_TRAMPOLINE_BATCH_ATTACH syscall command expects
> following data in union bpf_attr:
>
>   struct {
>           __aligned_u64   in;
>           __aligned_u64   out;
>           __u32           count;
>   } trampoline_batch;
>
>   in    - pointer to user space array with file descrptors of loaded bpf
>           programs to attach
>   out   - pointer to user space array for resulting link descriptor
>   count - number of 'in/out' file descriptors
>
> Basically the new code gets programs from 'in' file descriptors and
> attaches them the same way the current code does, apart from the last
> step that registers probe ip with trampoline. This is done at the end
> with new register_ftrace_direct_ips function.
>
> The resulting link descriptors are written in 'out' array and match
> 'in' array file descriptors order.
>

I think this is a pretty hard API to use correctly from user-space.
Think about all those partially attached and/or partially detached BPF
programs. And subsequent clean up for them. Also there is nothing even
close to atomicity, so you might get a spurious invocation a few times
before batch-attach fails mid-way and the kernel (hopefully) will
detach those already attached programs in an attempt to clean
everything up. Debugging and handling that is a big pain for users,
IMO.

Here's a raw idea, let's think if it would be possible to implement
something like this. It seems like what you need is to create a set of
logically-grouped placeholders for multiple functions you are about to
attach to. Until the BPF program is attached, those placeholders are
just no-ops (e.g., they might jump to an "inactive" single trampoline,
which just immediately returns). Then you attach the BPF program
atomically into a single place, and all those no-op jumps to a
trampoline start to call the BPF program at the same time. It's not
strictly atomic, but is much closer in time with each other. Also,
because it's still a single trampoline, you get a nice mapping to a
single bpf_link, so detaching is not an issue.

Basically, maybe ftrace subsystem could provide a set of APIs to
prepare a set of functions to attach to. Then BPF subsystem would just
do what it does today, except instead of attaching to a specific
kernel function, it would attach to ftrace's placeholder. I don't know
anything about ftrace implementation, so this might be far off. But I
thought that looking at this problem from a bit of a different angle
would benefit the discussion. Thoughts?

> Signed-off-by: Jiri Olsa <jolsa@...nel.org>
> ---
>  include/linux/bpf.h      | 15 ++++++-
>  include/uapi/linux/bpf.h |  7 ++++
>  kernel/bpf/syscall.c     | 88 ++++++++++++++++++++++++++++++++++++++--
>  kernel/bpf/trampoline.c  | 69 +++++++++++++++++++++++++++----
>  4 files changed, 164 insertions(+), 15 deletions(-)
>

[...]