lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <B1C5F28C-5359-4E03-88B6-726FA9C73BB4@fb.com>
Date:   Tue, 27 Nov 2018 19:04:35 +0000
From:   Song Liu <songliubraving@...com>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "ast@...nel.org" <ast@...nel.org>,
        "daniel@...earbox.net" <daniel@...earbox.net>,
        "acme@...nel.org" <acme@...nel.org>,
        Kernel Team <Kernel-team@...com>
Subject: Re: [PATCH perf,bpf 0/5] reveal invisible bpf programs



> On Nov 26, 2018, at 12:00 PM, Song Liu <songliubraving@...com> wrote:
> 
> 
> 
>> On Nov 26, 2018, at 6:50 AM, Peter Zijlstra <peterz@...radead.org> wrote:
>> 
>> On Thu, Nov 22, 2018 at 06:13:32PM +0000, Song Liu wrote:
>>> Hi Peter,
>>> 
>>>> On Nov 22, 2018, at 1:32 AM, Peter Zijlstra <peterz@...radead.org> wrote:
>>>> 
>>>> On Wed, Nov 21, 2018 at 11:54:57AM -0800, Song Liu wrote:
>>>>> Changes RFC -> PATCH v1:
>>>>> 
>>>>> 1. In perf-record, poll vip events in a separate thread;
>>>>> 2. Add tag to bpf prog name;
>>>>> 3. Small refactorings.
>>>>> 
>>>>> Original cover letter (with minor revisions):
>>>>> 
>>>>> This is to follow up Alexei's early effort to show bpf programs
>> 
>>>>> In this version, PERF_RECORD_BPF_EVENT is introduced to send real time BPF
>>>>> load/unload events to user space. In user space, perf-record is modified
>>>>> to listen to these events (through a dedicated ring buffer) and generate
>>>>> detailed information about the program (struct bpf_prog_info_event). Then,
>>>>> perf-report translates these events into proper symbols.
>>>>> 
>>>>> With this set, perf-report will show bpf program as:
>>>>> 
>>>>> 18.49%     0.16%  test  [kernel.vmlinux]    [k] ksys_write
>>>>> 18.01%     0.47%  test  [kernel.vmlinux]    [k] vfs_write
>>>>> 17.02%     0.40%  test  bpf_prog            [k] bpf_prog_07367f7ba80df72b_
>>>>> 16.97%     0.10%  test  [kernel.vmlinux]    [k] __vfs_write
>>>>> 16.86%     0.12%  test  [kernel.vmlinux]    [k] comm_write
>>>>> 16.67%     0.39%  test  [kernel.vmlinux]    [k] bpf_probe_read
>>>>> 
>>>>> Note that, the program name is still work in progress, it will be cleaner
>>>>> with function types in BTF.
>>>>> 
>>>>> Please share your comments on this.
>>>> 
>>>> So I see:
>>>> 
>>>> kernel/bpf/core.c:void bpf_prog_kallsyms_add(struct bpf_prog *fp)
>>>> 
>>>> which should already provide basic symbol information for extant eBPF
>>>> programs, right?
>>> 
>>> Right, if the BPF program is still loaded when perf-report runs, symbols 
>>> are available. 
>> 
>> Good, that is not something that was clear. The Changelog seems to imply
>> we need this new stuff in order to observe symbols.
> 
> I will clarify this in next version. 
> 
>> 
>>>> And (AFAIK) perf uses /proc/kcore for annotate on the current running
>>>> kernel (if not, it really should, given alternatives, jump_labels and
>>>> all other other self-modifying code).
>>>> 
>>>> So this fancy new stuff is only for the case where your profile spans
>>>> eBPF load/unload events (which should be relatively rare in the normal
>>>> case, right), or when you want source annotated asm output (I normally
>>>> don't bother with that).
>>> 
>>> This patch set adds two pieces of information:
>>> 1. At the beginning of perf-record, save info of existing BPF programs;
>>> 2. Gather information of BPF programs load/unload during perf-record. 
>>> 
>>> (1) is all in user space. It is necessary to show symbols of BPF program
>>> that are unloaded _after_ perf-record. (2) needs PERF_RECORD_BPF_EVENT 
>>> from the ring buffer. It covers BPF program loaded during perf-record 
>>> (perf record -- bpf_test). 
>> 
>> I'm saying that if you given them symbols; most people won't need any of
>> that ever.
>> 
>> And just tracking kallsyms is _much_ cheaper than either 1 or 2. Alexei
>> was talking fairly big amounts of data per BPF prog. Dumping and saving
>> that sounds like pointless overhead for 99% of the users.
> 
> Due to the kernel-module-like natural of BPF program, I think it is still
> necessary to cover cases that BPF programs are unloaded when perf-record
> runs. How about we add another step that scans all bpf_prog_XXXX from
> kallsyms, and synthesizes symbols for them?
> 
>> 
>>>> That is; I would really like this fancy stuff to be an optional extra
>>>> that is typically not needed.
>>>> 
>>>> Does that make sense?
>>> 
>>> (1) above is always enabled with this set. I added option no-bpf-events 
>>> to disable (2). I guess you prefer the (2) is disabled by default, and 
>>> enabled with an option?
>> 
>> I'm saying neither should be default enabled. Instead we should do
>> recording and tracking by default.
>> 
>> That gets people symbol information on BPF stuff, which is likely all
>> they ever need.
> 
> How about we extend PERF_RECORD_BPF_EVENT with basic symbol information
> (name, addr, length)? By default, we just record these events in the ring
> buffer, just like mmap events. This (plus scanning kallsyms before record)
> will enable symbols for all BPF programs. 
> 
> For more information, we add an option to enable more information 
> (annotated asm, etc.), with dedicated dummy event, thread, and ring buffer. 
> 

After syncing with Alexei offline, I realized this won't work cleanly. 

My initial idea is to extend PERF_RECORD_BPF_EVENT like:

struct bpf_event {
        struct perf_event_header header;
        u16 type;
        u16 flags;
        u32 id;          /* bpf prog_id */

        u64 start_addr;
        u32 length;    
        char name[KSYM_NAME_LEN];
};

This is a structure with variable length, which is OK. 

However, I missed the fact that each bpf program could have up to 256 sub
programs. To fit that into single bpf_event need some pretty ugly hack:

struct bpf_sub_prog {
        u64 start_addr;
        u32 length;    /* or length */
        u32 name_len;  /* length of name, struct bpf_event need it */
        char name[KSYM_NAME_LEN];
};

struct bpf_event {
        struct perf_event_header header;
        u16 type;
        u16 flags;
        u32 id;          /* bpf prog_id */
        u64 num_of_sub_prog;
        
	struct bpf_sub_prog sub_progs[];
};

In this case, bpf_sub_prog has variable length, thus struct bpf_event has
variable number of variable length members. This not impossible to handle
these, but it will be ugly for sure. 

One alternative to this approach is to keep one sub program per bpf_event, 
but generate multiple bpf_event (one for each sub program). Another solution
is to generate separate records (same or different PERF_RECORD_*) for 
bpf_event and bpf_sub_prog. 

However, I don't think any of these yield as clean perf ABI as current 
version. 

In summary, I can think of the following directions:

1. Current design. 
   Pros: simple perf ABI; 
   Cons: more work in perf tool.
2. One bpf_event per BPF prog, variable number of variable length variable.
   Pros: less work for perf tool, if annotation is not needed; 
   Cons: ugly perf ABI. 
3. Multiple bpf_event per BPF prog (one per sub prog). 
   Pros: less work for perf tool, if annotation is not needed; 
   Cons: need coordinate multiple records (same prog_id, different sub prog).
4. Separate bpf_event and bpf_sub_prog_event. 
   Pros: less work for perf tool, if annotation is not needed; 
   Cons: more complicated perf ABI (two PERF_RECORD_* or more complicated 
         encoding for the same PERF_RECORD_). 

Overall, I think current design is the best for simplest ABI. And it gives 
flexibility to tune what information perf tool get from bpf syscalls. 

Peter and Arnaldo, what's your recommendation on these directions (or other
better alternatives)?

Thanks,
Song

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ