netdev - Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180921122500.GA17312@kernel.org>
Date:   Fri, 21 Sep 2018 09:25:00 -0300
From:   Arnaldo Carvalho de Melo <acme@...nel.org>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Alexei Starovoitov <ast@...nel.org>,
        "David S . Miller" <davem@...emloft.net>, daniel@...earbox.net,
        netdev@...r.kernel.org, kernel-team@...com
Subject: Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog
 load/unload

Em Thu, Sep 20, 2018 at 08:14:46PM -0700, Alexei Starovoitov escreveu:
> On Thu, Sep 20, 2018 at 03:56:51PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote:
> > > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
> > > to having to cope with munmapping parts of existing mmaps, etc.
> > > 
> > > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
> > > now it would be used just in this clean case for undoing a
> > > PERF_RECORD_MMAP for a BPF program.
> > > 
> > > The ABI is already complicated, starting to use something called
> > > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
> > > I think.
> > 
> > Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult
> > part was getting the perf tool to dtrt for that use-case. But if we need
> > unmap events, doing the unmap record now is the right thing.
> 
> Thanks for the pointers!
> The answer is a bit long. pls bear with me.

Ditto with me :-)
 
> I have considered adding MUNMAP to match existing MMAP, but went
> without it because I didn't want to introduce new bit in perf_event_attr
> and emit these new events in a misbalanced conditional way for prog load/unload.
> Like old perf is asking kernel for mmap events via mmap bit, so prog load events

By prog load events you mean that old perf, having perf_event_attr.mmap = 1 ||
perf_event_attr.mmap2 = 1 will cause the new kernel to emit
PERF_RECORD_MMAP records for the range of addresses that a BPF program
is being loaded on, right?

> will be in perf.data, but old perf report won't recognize them anyway.

Why not? It should lookup the symbol and find it in the rb_tree of maps,
with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the
BPF core, no? It'll be an unresolved symbol, but a resolved map.

> Whereas new perf would certainly want to catch bpf events and will set
> both mmap and mumap bits.

new perf with your code will find a symbol, not a map, because your code
catches a special case PERF_RECORD_MMAP and instead of creating a
'struct map' will create a 'struct symbol' and insert it in the kallsyms
'struct map', right?
 
> Then if I add MUNMAP event without new bit and emit MMAP/MUNMAP
> conditionally based on single mmap bit they will confuse old perf
> and it will print warning about 'unknown events'.

That is unfortunate and I'll turn that part into a pr_debug()
 
> Both situations are ugly, hence I went with reuse of MMAP event
> for both load/unload.

So, its doubly odd, i.e. MMAP used for mmap() and for munmap() and the
effects in the tooling is not to create or remove a 'struct map', but to
alter an existing symbol table for the kallsyms map.

> In such case old perf silently ignores them. Which is what I wanted.

In theory the old perf should catch the PERF_RECORD_MMAP with a string
in the filename part and insert a new map into the kernel mmap rb_tree,
and then samples would be resolved to this map, but since there is no
backing DSO with a symtab, it would stop at that, just stating that the
map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly
there is something in there that makes it ignore this PERF_RECORD_MMAP
emitted by the BPF kernel code when loading a new program.

> When we upgrade the kernel we cannot synchronize the kernel upgrade
> (or downgrade) with user space perf package upgrade.

sure

> Hence not confusing old perf is important.

Thanks for trying to achieve that, and its a pity that that "unknown
record" message is a pr_warning or pr_info and not a pr_debug().

> With new kernel new bpf mmap events get into perf.data and
> new perf picks them up.
> 
> Few more considerations:
> 
> I consider synthetic perf events to be non-ABI. Meaning they're
> emitted by perf user space into perf.data and there is a convention
> on names, but it's not a kernel abi. Like RECORD_MMAP with
> event.filename == "[module_name]" is an indication for perf report
> to parse elf/build-id of dso==module_name.
> There is no such support in the kernel. Kernel doesn't emit
> such events for module load/unload. If in the future
> we decide to extend kernel with such events they don't have
> to match what user space perf does today.

Right, that is another unfortunate state of affairs, kernel module
load/unload should already be supported, reported by the kernel via a
proper PERF_RECORD_MODULE_LOAD/UNLOAD
 
> Why this is important? To get to next step.
> As Arnaldo pointed out this patch set is missing support for
> JITed prog annotations and displaying asm code. Absolutely correct.
> This set only helps perf to reveal the names of bpf progs that _were_
> running at the time of perf record, but there is no way yet for
> perf report to show asm code of the prog that was running.
> In that sense bpf is drastically different from java, other jits
> and normal profiling.
> bpf JIT happens in the kernel and only kernel knows the mapping
> between original source and JITed code.
> In addition there are bpf2bpf functions. In the future there will
> be bpf libraries, more type info, line number support, etc.
> I strongly believe perf RECORD_* events should NOT care about
> the development that happens on the bpf side.
> The only thing kernel will be telling user space is that bpf prog
> with prog_id=X was loaded.
> Then based on prog_id the 'perf record' phase can query the kernel
> for bpf related information. There is already a way to fetch
> JITed image based on prog_id.
> Then perf will emit synthetic RECORD_FOOBAR into perf.data
> that will contain bpf related info (like complete JITed image)
> and perf report can process it later and annotate things in UI.

There is another longstanding TODO list entry: PERF_RECORD_MMAP records
should include a build-id, to avoid either userspace getting confused
when there is an update of some mmap DSO, for long running sessions, for
instance, or to have to scan the just recorded perf.data file for DSOs
with samples to then read it from the file system (more races).

Have you ever considered having a build-id for bpf objects that could be
used here?
 
> It may seem that there is a race here.
> Like when 'perf record' see 'bpf prog was loaded with prog_id=X' event
> it will ask the kernel about prog_id=X, but that prog could be
> unloaded already.
> In such case prog_id will not exist and perf record can ignore such event.
> So no race.
> The kernel doesn't need to pass all information about bpf prog to
> the user space via RECORD_*. Instead 'perf record' can emit
> synthetic events into perf.data.
> I was thinking to extend RECORD_MMAP with prog_id already

Right, if prog_id is a uniquely identifying cookie that is based on the
file contents, like files in a git repo, then this would serve for both
bpf progs and for everything else.

Will the JITed code from some BPF bytecode be different if the same
bytecode is JITed in several machines, all having the exact same
hardware?

> (instead of passing kallsyms's bpf prog name in event->mmap.filename)
> but bpf functions don't have their own prog_id. Multiple bpf funcs
> with different JITed blobs are considered to be a part of single prog_id.
> So as a step one I'm only extending RECORD_MMAP with addr and kallsym
> name of bpf function/prog.
> As a step two the plan is to add notification mechanism for prog load/unload
> that will include prog_id and design new synthetic RECORD_* events in
> perf user space that will contain JITed code, line info, BTF, etc.

So, will the kernel JIT a bytecode, load it somewhere and run it, then,
when unloading it, keep it somewhere (a filesystem with some limits,
etc) so that at some later time (with some timeouts) tooling can, using
its id/buildid cookie get the contents and symbol table to have a better
annotation experience?

Using /proc/kcore as right now we should be able to annotate the BPF JIT
as it is loaded, in 'perf top', for instance, with your suggested code,
because we would have a symbol resolved that spans the whole bpf
program, I'll try applying your patch and seeing how it goes.

> TLDR:
> step 1 (this patch set)
> Single bpf prog_load can call multiple
> bpf_prog_kallsyms_add() -> RECORD_MMAP with addr+kallsym only
> Similarly unload calls multiple
> bpf_prog_kallsyms_del() -> RECORD_MMAP with addr only
> 
> step 2 (future work)
> single event for bpf prog_load with prog_id only.
> Either via perf ring buffer or ftrace or tracepoints or some
> other notification mechanism.
> 
> It may seem that step 2 obsoletes step 1. It can, but I think
> it will complement it. There is a lot more code there and
> a lot more discussions to have.
> Step 1 is already big improvement.
> 
> Thoughts?

See above, and, bear with me too :-)

- Arnaldo