linux-kernel - Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 2 Jul 2015 17:24:38 +0800
From:	"Wangnan (F)" <wangnan0@...wei.com>
To:	Alexei Starovoitov <ast@...mgrid.com>,
	He Kuang <hekuang@...wei.com>,
	Peter Zijlstra <peterz@...radead.org>
CC:	<rostedt@...dmis.org>, <masami.hiramatsu.pt@...achi.com>,
	<mingo@...hat.com>, <acme@...hat.com>, <jolsa@...nel.org>,
	<namhyung@...nel.org>, <linux-kernel@...r.kernel.org>,
	pi3orama <pi3orama@....com>
Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event



On 2015/7/2 11:52, Alexei Starovoitov wrote:
> On 7/1/15 8:38 PM, He Kuang wrote:
>>
>>
>> On 2015/7/2 10:48, Alexei Starovoitov wrote:
>>> On 7/1/15 4:58 AM, Peter Zijlstra wrote:
>>>>
>>>> But why create a separate trace buffer, it should go into the regular
>>>> perf buffer.
>>>
>>> +1
>>>
>>> I think
>>> +static char __percpu *perf_extra_trace_buf[PERF_NR_CONTEXTS];
>>> is redundant.
>>> It adds quite a bit of unnecessary complexity to the whole patch set.
>>>
>>> Also the call to bpf_output_sample() is not effective unless program
>>> returns 1. It's a confusing user interface.
>>>
>>> Also you cannot ever do:
>>>       BPF_FUNC_probe_read,
>>> +    BPF_FUNC_output_sample,
>>>       BPF_FUNC_ktime_get_ns,
>>> new functions must be added to the end.
>>>
>>> Why not just do:
>>> perf_trace_buf_prepare() + perf_trace_buf_submit() from the helper?
>>> No changes to current code.
>>> No need to call __get_data_size() and other overhead.
>>> The helper can be called multiple times from the same program.
>>> imo much cleaner.
>>>
>>
>> Invoke perf_trace_buf_submit() will generate a second perf
>> event (header->type = PERF_RECORD_SAMPLE) entry which is
>> different from the event entry outputed by the orignial
>> kprobe. So the final result of the example in 00/00 patch may
>> like this:
>>
>>    sample entry 1(from bpf_prog):
>>      comm timestamp1 generic_perform_write pmu_value=0x1234
>>    sample entry 2(from original kprobe):
>>      comm timestamp2 generic_perform_write: (ffffffff81140b60)
>> Compared with current implementation:
>>    combined sample entry:
>>      comm timestamp generic_perform_write: (ffffffff81140b60)
>> pmu_value=0x1234
>>
>> The former two entries may be discontinuous as there are multiple
>> threads and kprobes to be recorded, and there's a chance that one
>> entry is missed but the other is recorded. What we need is the
>> pmu_value read when 'generic_perform_write' enters, the two
>> entries result is not intuitive enough and userspace tools have
>> to do the work to find and combine those two sample entries to
>> get the result.
>
> Just change your example to return 0 and user space will see
> one sample.
>

Yes, by using perf_trace_buf_prepare() + perf_trace_buf_submit() in
helper function and let bpf program always returns 0 we can make data
collected by BPF programs output into samples, if following problems
are solved:

  1. In bpf program there's no way to get 'struct perf_event' or 'struct
     ftrace_event_call'. We have to deduce them through pt_regs:

     pt_regs -> ip -> kprobe -> struct trace_kprobe -> struct
      ftrace_event_call -> hlist_entry -> struct perf_event

     Which seems dirty, but without that we can't call
     perf_trace_buf_submit().

  2. Even if we finally get 'struct perf_event', I'm not sure whether
     user really concern on it. If we really concern on all information
     output through perf_trace_buf_submit() like callstack and
     register, why not make bpf program return non-zero instead? But then
     we have to consider how to connect two samples together.

So maybe writing a new function to replace perf_trace_buf_submit() and
output some light-weight information instead of full event data is
worth considering. Otherwise, maybe a dummy 'struct perf_event' for BPF
outputing is also acceptable?

What we are trying to do in previous patches is to merge data output by
BPF programs and original data output by perf_trace_buf_submit()
together. For example (expressed in CTF metadata format):

  event.header := struct {  // both output by perf_trace_buf_submit()
    integer { ... } id;
    integer { ... } timestamp;
  }
  event {
    name = "perf_bpf_probe:lock_page";
    ...
    fields := struct {
      integer { ... } perf_ip; // perf_trace_buf_submit()
      integer { ... } perf_tid; // perf_trace_buf_submit()
      ...
      integer { ... } page;   <-- Fetched using prologue
      integer { ... } cycle_cmu_counter;  <-- Output by BPF program
    }
  }

We believe that implemented should be simpler. Whether to use an extra
perf_trace_buf or not can be discussed. We have other choices. For
example, we can make BPF program write its data from the end of
bpf_trace_buf, and connect two parts of output data before calling
perf_trace_buf_submit().

Thank you.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/