[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160121045556.GA99187@ast-mbp.thefacebook.com>
Date: Wed, 20 Jan 2016 20:55:59 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
vince@...ter.net, eranian@...gle.com,
Arnaldo Carvalho de Melo <acme@...radead.org>,
Jiri Olsa <jolsa@...nel.org>
Subject: Re: [PATCH v2] perf: Synchronously cleanup child events
On Wed, Jan 20, 2016 at 09:32:22AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 19, 2016 at 01:58:19PM -0800, Alexei Starovoitov wrote:
> > On Tue, Jan 19, 2016 at 09:05:58PM +0100, Peter Zijlstra wrote:
>
> > > The most obvious place that generates such magical references would be
> > > the bpf arraymap doing perf_event_get() on things. There are a few other
> > > places that take temp references (perf_mmap_close), but those are
> > > 'short' lived and while ugly will not cause massive grief. The BPF one
> > > OTOH is a real problem here.
> > >
> > > And looking at the BPF stuff, that code seems to assume
> > > perf_event_kernel_release() := put_event(), so this patch breaks that
> > > too.
> > >
> > >
> > > Alexei, is there a reason the arraymap stuff needs a perf event ref as
> > > opposed to a file ref? I'm forever a little confused on how perf<->bpf
> > > works.
> >
> > A file ref will not work, since user space could have closed that
> > perf_event file to avoid unnecessary FDs.
>
> So I'm (possibly again) confused on how BPF works.
>
> I thought the reason you handed in perf events from userspace; as
> opposed to creating your own with perf_event_create_kernel_counter();
> was because userspace was interested in the output.
yes. There are two use cases of perf_events from bpf:
1. sw_bpf_output event is used by bpf to push samples into it and
user spaces reads it as normal via mmap
2. PERF_TYPE_HARDWARE event is used by bpf program to read
counters to measure things like number of cycles or tlb misses
in a given function.
In this case user space typically leaves FDs around, but it doesn't
use them for anything.
> Also, BPF should not be a way to get around the filedesc resource limit.
all bpf tracing stuff is root only and maps are charged for every element.
> > Program only need the stable pointer to 'struct perf_event' which
> > it will use while running.
> > At the end it will call perf_event_kernel_release() which
> > is == put_event().
> > It was the case that 'perf_events' were normal refcnt-ed structures
> > and the last guy frees it.
>
> Sort-of, but user events are (or should be, rather) tied to the filedesc
> to account the resources used.
>
> There is also the event->owner field, we track the task that created the
> event, with your current scheme that is left dangling once userspace
> closes the last filedesc and you still have a ref open.
>
> > This put_event_last() logic definitely looks problematic.
> > There are no ordering guarantees.
> > User space may close FD, while struct perf_event is still alive.
> > The loop around perf_event_last() looks buggy.
> > I'm obviously missing the main goal of this patch.
>
> Right, so the patch in question tries to synchronously clean up
> everything related to the counter when we close the file. Such that the
> file better reflects the actual resource usage.
>
> Currently we do this async (and with holes).
>
> In short, user created event really should be filedesc based, yes we
> have event references, but those 'should' be short lived.
I'm still missing why it's the problem.
Which counter do you want to bump as part of perf_event_get() ?
still event->refcount, right?
but the same perf_event can be stored in multiple bpf maps
and many bpf programs can be using it, while nothing can
possibly prevent the user space to do close(perf_event_fd)
while programs are still running and collecting tlb miss data
from the counters.
So what do you propose?
Powered by blists - more mailing lists