linux-kernel - Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160805194505.GA63020@ast-mbp.thefacebook.com>
Date:	Fri, 5 Aug 2016 12:45:07 -0700
From:	Alexei Starovoitov <alexei.starovoitov@...il.com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Brendan Gregg <bgregg@...flix.com>, Ingo Molnar <mingo@...hat.com>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
	linux-kernel@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
	Wang Nan <wangnan0@...wei.com>
Subject: Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

On Fri, Aug 05, 2016 at 12:52:09PM +0200, Peter Zijlstra wrote:
> > > > Currently overflow_handler is set at event alloc time. If we start
> > > > changing it on the fly with atomic xchg(), afaik things shouldn't
> > > > break, since each overflow_handler is run to completion and doesn't
> > > > change global state, right?
> 
> Yes, or even a simple WRITE_ONCE() to replace it, as long as we make
> sure to use a READ_ONCE() to load the pointer.
> 
> As long as we're sure to limit this poking to a single user its fairly
> simple to get right. The moment there can be concurrency a lot of fail
> can happen.

agreed.

> > So instead of normal __perf_event_output() writing into ringbuffer,
> > a bpf prog will be called that can optionally write into different
> > rb via bpf_perf_event_output. 
> 
> It could even chain and call into the original function once its done
> and have both outputs.

interesting idea. makes sense.
Also thinking about concurrency and the need to remember the original
handler somewhere, would it be cleaner api to add a bit to perf_event_attr
and use attr.config1 as bpf_fd ?
Then perf_event_open at event creation time will use bpf prog as
overflow_handler. That solves concurrency concerns and potential semantical
issues if we go with ioctl() approach.
Like if we perf_event_open() an event for a task, then bpf attach to it,
what children task and corresponding inherited events suppose to do?
Inherit overflow_handler, right? but then deatch of bpf in the parent
suppose to clear it in inherited events as well. A bit complicated.
I guess we can define it that way.
Just seems easier to do bpf attach at perf_event_open time only.

> > The question is what to pass into the
> > program to make the most use out of it. 'struct pt_regs' is done deal.
> > but perf_sample_data we cannot pass as-is, since it's kernel internal.
> 
> Urgh, does it have to be stable API? Can't we simply rely on the kernel
> headers to provide the right structure definition?

yes we can. The concern is about assumptions people will make about
perf_sample_data and the speed of access to it. From bpf program point
of view the pointer to perf_sample_data will be opaque unsafe pointer,
so any access to fields would have to be done via bpf_probe_read which
has non-trivial overhead.
If we go with the uapi mirror of perf_sample_data approach, it will be
fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we
have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere
and no fields are copied. When bpf program does 'skb->vlan_present'
the verifier rewrites it at load time into corresponding access to
kernel internal 'struct sk_buff' fields with bitmask, shifts and such.
For this case we can define something like
struct bpf_perf_sample_data {
  __u64 period;
};
then bpf prog will only be able to access that signle field which verifier
will translate into 'data->period' where data is 'struct perf_sample_data *'
Later we can add other fields if necessary. The kernel is free to mess
around with perf_sample_data whichever way without impacting bpf progs.