linux-kernel - tracepoint_probe_register and bpf. Was: [PATCH tip 0/3] Improvements of scheduler related Tracepoints

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:   Wed, 20 Dec 2017 18:03:26 -0800
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Teng Qin <qinteng@...com>, mingo@...hat.com, bgregg@...flix.com,
        daniel@...earbox.net, linux-kernel@...r.kernel.org,
        kernel-team@...com, netdev@...r.kernel.org,
        "David S. Miller" <davem@...emloft.net>, brouer@...hat.com
Subject: tracepoint_probe_register and bpf. Was: [PATCH tip 0/3] Improvements
 of scheduler related Tracepoints

On Mon, Dec 18, 2017 at 10:11:57AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 09:09:51AM -0800, Alexei Starovoitov wrote:
> 
> > yeah. Currently bpf progs are called at the end of
> > perf_trace_##call()
> > {
> >   .. regular tracepoint copy craft
> >   perf_trace_run_bpf_submit( &copied args )
> > }
> > 
> > from bpf pov we'd rather get access to raw args passed into
> > perf_trace_##call.
> > Sounds like you're suggesting to let bpf side register its
> > progs directly via tracepoint_probe_register() ?
> > That would solve the whole thing really nicely indeed.
> 
> Not sure I thought that far; but if you want the probe arguments either
> grab them from the perf probe you're running in or indeed register your
> own, don't play silly buggers and copy them in the trace data.

So I've tried both approaches:

1. grab arguments from:
perf_trace_##call(void *__data, proto)

it works ok, but it adds 50-70 bytes of .text to every perf_trace_*
functions which multiplied by the number of tracepoints (~500) which in
total adds ~30k of .text to vmlinux
Not too bad, but it also adds extra branch to critical
execution path, so not ideal

2. register your own
This approach I think is much better.
The trick I'm doing is the following:
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)  \
/* no 'static' here. The bpf probe functions are global */              \
notrace void                                                            \
__bpf_trace_##call(void *__data, proto)                                 \
{                                                                       \
        struct trace_event_call *event_call = __data;                   \
        \
        __FN_COUNT(bpf_trace_run, args)(event_call, CAST_TO_U64(args)); \
}

Where CAST_TO_U64 is a macro to cast all args to u64 one by one.
bpf_trace_run*() functions are defined as:
void bpf_trace_run1(struct trace_event_call *call, u64 arg1);
void bpf_trace_run2(struct trace_event_call *call, u64 arg1, u64 arg2);
void bpf_trace_run3(struct trace_event_call *call, u64 arg1, u64 arg2, u64 arg3);

so in assembler it looks like:
(gdb) disassemble __bpf_trace_xdp_exception
Dump of assembler code for function __bpf_trace_xdp_exception:
   0xffffffff81132080 <+0>:	mov    %ecx,%ecx
   0xffffffff81132082 <+2>:	jmpq   0xffffffff811231f0 <bpf_trace_run3>

where

TRACE_EVENT(xdp_exception,
        TP_PROTO(const struct net_device *dev,
                 const struct bpf_prog *xdp, u32 act),

The above assembler snippet is casting 32-bit 'act' field into 'u64'
to pass into bpf_trace_run3() and all other registers are passed as-is.
So all of ~500 of __bpf_trace_*() functions are only 5-10 _byte_ long
and in total this approach adds 7k bytes to .text and 8k bytes
to .rodata since I want them to appear in kallsyms, so I don't
have to add another 8-byte fields to 'struct trace_event_class'
and 'struct trace_event_call'

Such approach, I think, is the lowest overhead we can possibly have
while calling trace_xdp_exception() from kernel C code and
transitioning into bpf land.
Since many folks are starting to use tracepoint+bpf at speeds of
1M+ events per second this would be very valuable optimization.

Diff stat so far:
 drivers/gpu/drm/i915/i915_trace.h | 13 ++++++++++---
 fs/btrfs/super.c                  |  4 ++++
 include/linux/trace_events.h      | 27 +++++++++++++++++++++++++++
 include/trace/bpf_probe.h         | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/trace/define_trace.h      |  1 +
 include/uapi/linux/bpf.h          |  4 ++++
 kernel/trace/bpf_trace.c          | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 191 insertions(+), 3 deletions(-)

Since I'm not touching anything on ftrace or perf side,
I'm thinking to extend BPF_PROG_ATTACH command to specify
which tracepoint to attach to.
The user space will look like:

prog_fd = bpf_prog_load(...);
bpf_prog_attach(prog_fd, "xdp_exception");

On the kernel side I will walk kallsyms to
find __tracepoint_xdp_exception record, check that 
tp->name == "xdp_exception",
then find __bpf_trace_xdp_exception() address (also in kallsyms),
and finally use tracepoint_probe_register() to connect the two.

Thoughts?