[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181016093253.GD4030@hirez.programming.kicks-ass.net>
Date: Tue, 16 Oct 2018 11:32:53 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Stephane Eranian <eranian@...gle.com>
Cc: Alexey Budankov <alexey.budankov@...ux.intel.com>,
Ingo Molnar <mingo@...nel.org>,
LKML <linux-kernel@...r.kernel.org>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...hat.com>, songliubraving@...com,
Thomas Gleixner <tglx@...utronix.de>,
Mark Rutland <mark.rutland@....com>, megha.dey@...el.com,
frederic@...nel.org
Subject: Re: [RFC][PATCH] perf: Rewrite core context handling
On Mon, Oct 15, 2018 at 11:31:24AM -0700, Stephane Eranian wrote:
> I have always had a hard time understanding the role of all these
> structs in the generic code. This is still very confusing and very
> hard to follow.
>
> In my mind, you have per-task and per-cpu perf_events contexts. And
> for each you can have multiple PMUs, some hw some sw. Each PMU has
> its own list of events maintained in RB tree. There is never any
> interactions between PMUs.
That is more or less how it was. We have per PMU task or CPU contexts:
task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
^ | ^ | ^
`---------------------------------' | `--> pmu <--'
v ^
perf_event ------'
Each task has an array of pointers to a perf_event_context. Each
perf_event_context has a direct relation to a PMU and a group of events
for that PMU. The task related perf_event_context's have a pointer back
to that task.
Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which
includes a perf_event_context, which again has a direct relation to that
PMU, and a group of events for that PMU.
The perf_cpu_context also tracks which task context is currently
associated with that CPU and includes a few other things like the
hrtimer for rotation etc..
Each perf_event is then associated with its PMU and one
perf_event_context.
> Maybe this is how this is done or proposed by your patches, but it
> certainly is not obvious.
No, my patch somewhat completely wrecks the above; and reduces to a
single task context and a single CPU context.
There were a number of problems with the above. One is that task-array
of pointer, which limited the number of task contexts we could have.
Now, we could've easily changed that to a list and called it a day.
That is not in fact a horribly difficult patch. If you combine that with
a patch that actually freed task context's when they go empty, that
might actually work.
But there are a number of other considerations that resulted in the
patch as presented:
- there is a bunch of per context state that is simply duplicated
between contexts, like for instance the time keeping. There is no
point in tracking the time for 'n' per task/cpu contexts when in fact
they're all the same.
- on context switch we have to iterate all these 'n' contexts and
switch them one by one. Instead of just switching one context and
calling it a day.
- for big.little we'd end up with 2 per-task contexts and only ever use
1 at any one time, which increases 'n' in the above cases for no
purpose.
- the actual per-pmu-per-context state is very small (as I think Alexey
already implied).
- a single context simplifies a bunch of things; including the
move_group case (we no longer have to adjust perf_event::ctx) and the
cpu-online tests and the ctx locking and it removes a bunch of
context lists (like active_ctx_list).
So a single context is what I went with. That all results in:
task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
^ | ^ ^
`---------------------------------' | |
| `--> perf_event_pmu_context
| ^ ^
| | |
| ,-----' v
| | perf_cpu_pmu_context
| | ^
| | |
v v v
perf_event ---> pmu
Because while the per-pmu-per-context state is small, it does exists,
this gives rise to perf_event_pmu_context. It tracks nr_events and
nr_active, which is used to (quickly) tell if rotation is required (it
is possible to reduce this state I think, but I've not yet gotten it
down to 0). It also tracks which events are actually active; iterating a
list is cheaper than finding them all in the RB-tree.
It also contains the task_ctx_data thing for LBR, which is a PMU
specific extra data thingy.
We then also keep a list of (active) perf_event_pmu_context in
perf_event_context, such that we can quickly find which PMUs are in fact
involved with the context. This simplifies context scheduling a little.
We then also need per-pmu-per-cpu state, which gives rise to
perf_cpu_pmu_context, and that mostly includes bits to drive the event
rotation, which per ABI is per PMU, but it also includes bits to do
perf_event_attr::exclusive scheduling, which is also naturally
per-pmu-per-cpu.
And yes, the above looks more complicated, but at the same time, a bunch
of things did get simplified. Maybe once the dust settles someone can
turn this here email into a sensible comment or something ;-)
> Also the Intel LBR is not a PMU on is own. Maybe you are talking about
> the BTS in arch/x86/even/sintel/bts.c.
This thing:
https://lkml.kernel.org/r/1510970046-25387-1-git-send-email-megha.dey@linux.intel.com
Powered by blists - more mailing lists