linux-kernel - Re: [PATCH v2] perf: Synchronously cleanup child events

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160121045556.GA99187@ast-mbp.thefacebook.com>
Date:	Wed, 20 Jan 2016 20:55:59 -0800
From:	Alexei Starovoitov <alexei.starovoitov@...il.com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
	Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
	vince@...ter.net, eranian@...gle.com,
	Arnaldo Carvalho de Melo <acme@...radead.org>,
	Jiri Olsa <jolsa@...nel.org>
Subject: Re: [PATCH v2] perf: Synchronously cleanup child events

On Wed, Jan 20, 2016 at 09:32:22AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 19, 2016 at 01:58:19PM -0800, Alexei Starovoitov wrote:
> > On Tue, Jan 19, 2016 at 09:05:58PM +0100, Peter Zijlstra wrote:
> 
> > > The most obvious place that generates such magical references would be
> > > the bpf arraymap doing perf_event_get() on things. There are a few other
> > > places that take temp references (perf_mmap_close), but those are
> > > 'short' lived and while ugly will not cause massive grief. The BPF one
> > > OTOH is a real problem here.
> > > 
> > > And looking at the BPF stuff, that code seems to assume
> > > perf_event_kernel_release() := put_event(), so this patch breaks that
> > > too.
> > > 
> > > 
> > > Alexei, is there a reason the arraymap stuff needs a perf event ref as
> > > opposed to a file ref? I'm forever a little confused on how perf<->bpf
> > > works.
> > 
> > A file ref will not work, since user space could have closed that
> > perf_event file to avoid unnecessary FDs.
> 
> So I'm (possibly again) confused on how BPF works.
> 
> I thought the reason you handed in perf events from userspace; as
> opposed to creating your own with perf_event_create_kernel_counter();
> was because userspace was interested in the output.

yes. There are two use cases of perf_events from bpf:
1. sw_bpf_output event is used by bpf to push samples into it and
   user spaces reads it as normal via mmap
2. PERF_TYPE_HARDWARE event is used by bpf program to read
   counters to measure things like number of cycles or tlb misses
   in a given function.
   In this case user space typically leaves FDs around, but it doesn't
   use them for anything.

> Also, BPF should not be a way to get around the filedesc resource limit.

all bpf tracing stuff is root only and maps are charged for every element.

> > Program only need the stable pointer to 'struct perf_event' which
> > it will use while running.
> > At the end it will call perf_event_kernel_release() which
> > is == put_event().
> > It was the case that 'perf_events' were normal refcnt-ed structures
> > and the last guy frees it.
> 
> Sort-of, but user events are (or should be, rather) tied to the filedesc
> to account the resources used.
> 
> There is also the event->owner field, we track the task that created the
> event, with your current scheme that is left dangling once userspace
> closes the last filedesc and you still have a ref open.
> 
> > This put_event_last() logic definitely looks problematic.
> > There are no ordering guarantees.
> > User space may close FD, while struct perf_event is still alive.
> > The loop around perf_event_last() looks buggy.
> > I'm obviously missing the main goal of this patch.
> 
> Right, so the patch in question tries to synchronously clean up
> everything related to the counter when we close the file. Such that the
> file better reflects the actual resource usage.
> 
> Currently we do this async (and with holes).
> 
> In short, user created event really should be filedesc based, yes we
> have event references, but those 'should' be short lived.

I'm still missing why it's the problem.
Which counter do you want to bump as part of perf_event_get() ?
still event->refcount, right?
but the same perf_event can be stored in multiple bpf maps
and many bpf programs can be using it, while nothing can
possibly prevent the user space to do close(perf_event_fd)
while programs are still running and collecting tlb miss data
from the counters.
So what do you propose?