[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPhsuW7BO3gaaTPEyn2XdFTb+y5JBxU0oocJvMak=4jGHTJTqw@mail.gmail.com>
Date: Mon, 15 Oct 2018 16:33:26 -0700
From: Song Liu <liu.song.a23@...il.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: acme@...nel.org, Peter Zijlstra <peterz@...radead.org>,
Alexei Starovoitov <ast@...nel.org>,
"David S . Miller" <davem@...emloft.net>,
Daniel Borkmann <daniel@...earbox.net>,
Networking <netdev@...r.kernel.org>, kernel-team@...com
Subject: Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Fri, Sep 21, 2018 at 3:15 PM Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
>
> On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote:
> >
> > > I have considered adding MUNMAP to match existing MMAP, but went
> > > without it because I didn't want to introduce new bit in perf_event_attr
> > > and emit these new events in a misbalanced conditional way for prog load/unload.
> > > Like old perf is asking kernel for mmap events via mmap bit, so prog load events
> >
> > By prog load events you mean that old perf, having perf_event_attr.mmap = 1 ||
> > perf_event_attr.mmap2 = 1 will cause the new kernel to emit
> > PERF_RECORD_MMAP records for the range of addresses that a BPF program
> > is being loaded on, right?
>
> right. it would be weird when prog load events are there, but not unload.
>
> > > will be in perf.data, but old perf report won't recognize them anyway.
> >
> > Why not? It should lookup the symbol and find it in the rb_tree of maps,
> > with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the
> > BPF core, no? It'll be an unresolved symbol, but a resolved map.
> >
> > > Whereas new perf would certainly want to catch bpf events and will set
> > > both mmap and mumap bits.
> >
> > new perf with your code will find a symbol, not a map, because your code
> > catches a special case PERF_RECORD_MMAP and instead of creating a
> > 'struct map' will create a 'struct symbol' and insert it in the kallsyms
> > 'struct map', right?
>
> right.
> bpf progs are more similar to kernel functions than to modules.
> For modules it makes sense to create a new map and insert symbols into it.
> For bpf JITed images there is no DSO to parse.
> Single bpf elf file may contain multiple bpf progs and each prog may contain
> multiple bpf functions. They will be loaded at different time and
> will have different life time.
>
> > In theory the old perf should catch the PERF_RECORD_MMAP with a string
> > in the filename part and insert a new map into the kernel mmap rb_tree,
> > and then samples would be resolved to this map, but since there is no
> > backing DSO with a symtab, it would stop at that, just stating that the
> > map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly
> > there is something in there that makes it ignore this PERF_RECORD_MMAP
> > emitted by the BPF kernel code when loading a new program.
>
> In /proc/kcore there is already a section for module range.
> Hence when perf processes bpf load/unload events the map is already created.
> Therefore the patch 3 only searches for it and inserts new symbol into it.
>
> In that sense the reuse of RECORD_MMAP event for bpf progs is indeed
> not exactly clean, since no new map is created.
> It's probably better to introduce PERF_RECORD_[INSERT|ERASE]_KSYM events ?
>
> Such event potentially can be used for offline ksym resolution.
> perf could process /proc/kallsyms during perf record and emit all of them
> as synthetic PERF_RECORD_INSERT_KSYM into perf.data, so perf report can run
> on a different server and still find the right symbols.
>
> I guess, we can do bpf specific events too and keep RECORD_MMAP as-is.
> How about single PERF_RECORD_BPF event with internal flag for load/unload ?
>
> > Right, that is another unfortunate state of affairs, kernel module
> > load/unload should already be supported, reported by the kernel via a
> > proper PERF_RECORD_MODULE_LOAD/UNLOAD
>
> I agree with Peter here. It would nice, but low priority.
> modules are mostly static. Loaded once and stay there.
>
> > There is another longstanding TODO list entry: PERF_RECORD_MMAP records
> > should include a build-id, to avoid either userspace getting confused
> > when there is an update of some mmap DSO, for long running sessions, for
> > instance, or to have to scan the just recorded perf.data file for DSOs
> > with samples to then read it from the file system (more races).
> >
> > Have you ever considered having a build-id for bpf objects that could be
> > used here?
>
> build-id concept is not applicable to bpf.
> bpf elf files on the disc don't have good correlation with what is
> running in the kernel. bpf bytestream is converted and optimized
> by the verifier. Then JITed.
> So debug info left in .o file and original bpf bytestream in .o are
> mostly useless.
> For bpf programs we have 'program tag'. It is computed over original
> bpf bytestream, so both kernel and user space can compute it.
> In libbcc we use /var/tmp/bcc/bpf_prog_TAG/ directory to store original
> source code of the program, so users looking at kernel stack traces
> with bpf_prog_TAG can find the source.
> It's similar to build-id, but not going to help perf to annotate
> actual x86 instructions inside JITed image and show src code.
> Since JIT runs in the kernel this problem cannot be solved by user space only.
> It's a difficult problem and we have a plan to tackle that,
> but it's step 2. A bunch of infra is needed on bpf side to
> preserve the association during src_in_C -> original bpf insns ->
> translated bpf insns -> JITed asm.
> Then report it back to user space and then teach perf to properly annotate progs.
>
> > Will the JITed code from some BPF bytecode be different if the same
> > bytecode is JITed in several machines, all having the exact same
> > hardware?
>
> Yes. JITed code will be different depending on sysctl knobs (like const blinding)
> So the same original bpf byte stream loaded at different times
> may have different JITed image.
>
> Even without security features like blinding the JIT can be different.
> the bpf maps are separate from bpf progs. The bpf map is created first.
> Then the same bpf instruction stream (depending on map type that it uses)
> may be optimized by the verifier differently causing difference in JIT.
>
> > > (instead of passing kallsyms's bpf prog name in event->mmap.filename)
> > > but bpf functions don't have their own prog_id. Multiple bpf funcs
> > > with different JITed blobs are considered to be a part of single prog_id.
> > > So as a step one I'm only extending RECORD_MMAP with addr and kallsym
> > > name of bpf function/prog.
> > > As a step two the plan is to add notification mechanism for prog load/unload
> > > that will include prog_id and design new synthetic RECORD_* events in
> > > perf user space that will contain JITed code, line info, BTF, etc.
> >
> > So, will the kernel JIT a bytecode, load it somewhere and run it, then,
> > when unloading it, keep it somewhere (a filesystem with some limits,
> > etc) so that at some later time (with some timeouts) tooling can, using
> > its id/buildid cookie get the contents and symbol table to have a better
> > annotation experience?
>
> yes. The plan is to let perf fetch JITed image via BPF_OBJ_GET_INFO_BY_FD cmd
> during perf record run and store it inside perf.data in synthetic records.
> Then perf report can annotate it later.
Hi Peter and Arnaldo,
I am working with Alexei on the idea of fetching BPF program information via
BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
to perf_event_type, and dumped these events to perf event ring buffer.
I found that perf will not process event until the end of perf-record:
root@...t-test:~# ~/perf record -ag -- sleep 10
...... 10 seconds later
[ perf record: Woken up 34 times to write data ]
machine__process_bpf_event: prog_id 6 loaded
machine__process_bpf_event: prog_id 6 unloaded
[ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]
In this example, the bpf program was loaded and then unloaded in
another terminal. When machine__process_bpf_event() processes
the load event, the bpf program is already unloaded. Therefore,
machine__process_bpf_event() will not be able to get information
about the program via BPF_OBJ_GET_INFO_BY_FD cmd.
To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
as soon as perf get the event from kernel. I looked around the perf
code for a while. But I haven't found a good example where some
events are processed before the end of perf-record. Could you
please help me with this?
Thanks,
Song
Powered by blists - more mailing lists