linux-kernel - Re: [PATCH v3 bpf-next 1/4] tracing/probe: Add PERF_EVENT_IOC_QUERY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5ecdcd72-255d-26d1-baf3-dc64498753c2@fb.com>
Date:   Wed, 21 Aug 2019 18:43:49 +0000
From:   Yonghong Song <yhs@...com>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     Daniel Xu <dxu@...uu.xyz>,
        "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
        Song Liu <songliubraving@...com>,
        Andrii Nakryiko <andriin@...com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "acme@...nel.org" <acme@...nel.org>,
        Alexei Starovoitov <ast@...com>,
        "alexander.shishkin@...ux.intel.com" 
        <alexander.shishkin@...ux.intel.com>,
        "jolsa@...hat.com" <jolsa@...hat.com>,
        "namhyung@...nel.org" <namhyung@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>,
        "Arnaldo Carvalho de Melo" <acme@...hat.com>
Subject: Re: [PATCH v3 bpf-next 1/4] tracing/probe: Add
 PERF_EVENT_IOC_QUERY_PROBE ioctl



On 8/21/19 11:31 AM, Peter Zijlstra wrote:
> On Wed, Aug 21, 2019 at 04:54:47PM +0000, Yonghong Song wrote:
>> Currently, in kernel/trace/bpf_trace.c, we have
>>
>> unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
>> {
>>           unsigned int ret;
>>
>>           if (in_nmi()) /* not supported yet */
>>                   return 1;
>>
>>           preempt_disable();
>>
>>           if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
> 
> Yes, I'm aware of that.
> 
>> In the above, the events with bpf program attached will be missed
>> if the context is nmi interrupt, or if some recursion happens even with
>> the same or different bpf programs.
>> In case of recursion, the events will not be sent to ring buffer.
> 
> And while that is significantly worse than what ftrace/perf have, it is
> fundamentally the same thing.
> 
> perf allows (and iirc ftrace does too) 4 nested context per CPU
> (task,softirq,irq,nmi) but any recursion within those context and we
> drop stuff.
> 
> The BPF stuff is just more eager to drop things on the floor, but it is
> fundamentally the same.
> 
>> A lot of bpf-based tracing programs uses maps to communicate and
>> do not allocate ring buffer at all.
> 
> So extending PERF_RECORD_LOST doesn't work. But PERF_FORMAT_LOST might
> still work fine; but you get to implement it for all software events.

Could you give more specifics about PERF_FORMAT_LOST? Googling 
"PERF_FORMAT_LOST" only yields two emails which we are discussing here :-(

> 
>> Maybe we can still use ioctl based approach which is light weighted
>> compared to ring buffer approach? If a fd has bpf attached, nhit/nmisses
>> means the kprobe is processed by bpf program or not.
> 
> There is nothing kprobe specific here. Kprobes just appear to be the
> only one actually accounting the recursion cases, but everyone has
> them.

Sorry to be specific, kprobe is just an example, I actually refers to 
any perf event where bpf can attach to, which theoretically are any
perf events which can be opened with "perf_event_open" syscall although 
some of them (e.g., software events?) may not have bpf running hooks yet.

> 
>> Currently, for debugfs, the nhit/nmisses info is exposed at
>> {k|u}probe_profile. Alternative, we could expose the nhit/nmisses
>> in /proc/self/fdinfo/<fd>. User can query this interface to
>> get numbers.
> 
> No, we're not adding stuff to procfs for this.

No problem. Just a suggestion.