netdev - Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180921221343.g52n7c4edisvice3@ast-mbp>
Date:   Fri, 21 Sep 2018 15:13:44 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Arnaldo Carvalho de Melo <acme@...nel.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Alexei Starovoitov <ast@...nel.org>,
        "David S . Miller" <davem@...emloft.net>, daniel@...earbox.net,
        netdev@...r.kernel.org, kernel-team@...com
Subject: Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog
 load/unload

On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote:
>  
> > I have considered adding MUNMAP to match existing MMAP, but went
> > without it because I didn't want to introduce new bit in perf_event_attr
> > and emit these new events in a misbalanced conditional way for prog load/unload.
> > Like old perf is asking kernel for mmap events via mmap bit, so prog load events
> 
> By prog load events you mean that old perf, having perf_event_attr.mmap = 1 ||
> perf_event_attr.mmap2 = 1 will cause the new kernel to emit
> PERF_RECORD_MMAP records for the range of addresses that a BPF program
> is being loaded on, right?

right. it would be weird when prog load events are there, but not unload.

> > will be in perf.data, but old perf report won't recognize them anyway.
> 
> Why not? It should lookup the symbol and find it in the rb_tree of maps,
> with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the
> BPF core, no? It'll be an unresolved symbol, but a resolved map.
> 
> > Whereas new perf would certainly want to catch bpf events and will set
> > both mmap and mumap bits.
> 
> new perf with your code will find a symbol, not a map, because your code
> catches a special case PERF_RECORD_MMAP and instead of creating a
> 'struct map' will create a 'struct symbol' and insert it in the kallsyms
> 'struct map', right?

right.
bpf progs are more similar to kernel functions than to modules.
For modules it makes sense to create a new map and insert symbols into it.
For bpf JITed images there is no DSO to parse.
Single bpf elf file may contain multiple bpf progs and each prog may contain
multiple bpf functions. They will be loaded at different time and
will have different life time.

> In theory the old perf should catch the PERF_RECORD_MMAP with a string
> in the filename part and insert a new map into the kernel mmap rb_tree,
> and then samples would be resolved to this map, but since there is no
> backing DSO with a symtab, it would stop at that, just stating that the
> map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly
> there is something in there that makes it ignore this PERF_RECORD_MMAP
> emitted by the BPF kernel code when loading a new program.

In /proc/kcore there is already a section for module range.
Hence when perf processes bpf load/unload events the map is already created.
Therefore the patch 3 only searches for it and inserts new symbol into it.

In that sense the reuse of RECORD_MMAP event for bpf progs is indeed
not exactly clean, since no new map is created.
It's probably better to introduce PERF_RECORD_[INSERT|ERASE]_KSYM events ?

Such event potentially can be used for offline ksym resolution.
perf could process /proc/kallsyms during perf record and emit all of them
as synthetic PERF_RECORD_INSERT_KSYM into perf.data, so perf report can run
on a different server and still find the right symbols.

I guess, we can do bpf specific events too and keep RECORD_MMAP as-is.
How about single PERF_RECORD_BPF event with internal flag for load/unload ?

> Right, that is another unfortunate state of affairs, kernel module
> load/unload should already be supported, reported by the kernel via a
> proper PERF_RECORD_MODULE_LOAD/UNLOAD

I agree with Peter here. It would nice, but low priority.
modules are mostly static. Loaded once and stay there.

> There is another longstanding TODO list entry: PERF_RECORD_MMAP records
> should include a build-id, to avoid either userspace getting confused
> when there is an update of some mmap DSO, for long running sessions, for
> instance, or to have to scan the just recorded perf.data file for DSOs
> with samples to then read it from the file system (more races).
> 
> Have you ever considered having a build-id for bpf objects that could be
> used here?

build-id concept is not applicable to bpf.
bpf elf files on the disc don't have good correlation with what is
running in the kernel. bpf bytestream is converted and optimized
by the verifier. Then JITed.
So debug info left in .o file and original bpf bytestream in .o are
mostly useless.
For bpf programs we have 'program tag'. It is computed over original
bpf bytestream, so both kernel and user space can compute it.
In libbcc we use /var/tmp/bcc/bpf_prog_TAG/ directory to store original
source code of the program, so users looking at kernel stack traces
with bpf_prog_TAG can find the source.
It's similar to build-id, but not going to help perf to annotate
actual x86 instructions inside JITed image and show src code.
Since JIT runs in the kernel this problem cannot be solved by user space only.
It's a difficult problem and we have a plan to tackle that,
but it's step 2. A bunch of infra is needed on bpf side to
preserve the association during src_in_C -> original bpf insns ->
translated bpf insns -> JITed asm.
Then report it back to user space and then teach perf to properly annotate progs.

> Will the JITed code from some BPF bytecode be different if the same
> bytecode is JITed in several machines, all having the exact same
> hardware?

Yes. JITed code will be different depending on sysctl knobs (like const blinding)
So the same original bpf byte stream loaded at different times
may have different JITed image.

Even without security features like blinding the JIT can be different.
the bpf maps are separate from bpf progs. The bpf map is created first.
Then the same bpf instruction stream (depending on map type that it uses)
may be optimized by the verifier differently causing difference in JIT.

> > (instead of passing kallsyms's bpf prog name in event->mmap.filename)
> > but bpf functions don't have their own prog_id. Multiple bpf funcs
> > with different JITed blobs are considered to be a part of single prog_id.
> > So as a step one I'm only extending RECORD_MMAP with addr and kallsym
> > name of bpf function/prog.
> > As a step two the plan is to add notification mechanism for prog load/unload
> > that will include prog_id and design new synthetic RECORD_* events in
> > perf user space that will contain JITed code, line info, BTF, etc.
> 
> So, will the kernel JIT a bytecode, load it somewhere and run it, then,
> when unloading it, keep it somewhere (a filesystem with some limits,
> etc) so that at some later time (with some timeouts) tooling can, using
> its id/buildid cookie get the contents and symbol table to have a better
> annotation experience?

yes. The plan is to let perf fetch JITed image via BPF_OBJ_GET_INFO_BY_FD cmd
during perf record run and store it inside perf.data in synthetic records.
Then perf report can annotate it later.

> Using /proc/kcore as right now we should be able to annotate the BPF JIT
> as it is loaded, in 'perf top', for instance, with your suggested code,
> because we would have a symbol resolved that spans the whole bpf
> program, I'll try applying your patch and seeing how it goes.

yes. /proc/kcore helps to annotate during perf report when progs
are still loaded, but user experience sucks.
Like for networking bpf performance testing the user needs to:
- load the program
- start blasting traffic to trigger prog execution
- perf record -a sleep 5
- stop traffic
- perf report -> to analyze what's going on in bpf prog
- unload prog
All steps are manual.

We need to get to the point where we can:
- perf record -a ./my_bpf_test
- perf report
where my_bpf_test will load, attach, run traffic, detach, unload.
Most of selftests/bpf/* are done this way, but we cannot profile
them at the moment.