linux-kernel - Re: [RFC][PATCH] perf: Rewrite core context handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181016093253.GD4030@hirez.programming.kicks-ass.net>
Date:   Tue, 16 Oct 2018 11:32:53 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Stephane Eranian <eranian@...gle.com>
Cc:     Alexey Budankov <alexey.budankov@...ux.intel.com>,
        Ingo Molnar <mingo@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...hat.com>, songliubraving@...com,
        Thomas Gleixner <tglx@...utronix.de>,
        Mark Rutland <mark.rutland@....com>, megha.dey@...el.com,
        frederic@...nel.org
Subject: Re: [RFC][PATCH] perf: Rewrite core context handling

On Mon, Oct 15, 2018 at 11:31:24AM -0700, Stephane Eranian wrote:

> I have always had a hard time understanding the role of all these
> structs in the generic code. This is still very confusing and very
> hard to follow.
> 
> In my mind, you have per-task and per-cpu perf_events contexts.  And
> for each you can have multiple PMUs, some hw some sw.  Each PMU has
> its own list of events maintained in RB tree. There is never any
> interactions between PMUs.

That is more or less how it was. We have per PMU task or CPU contexts:


  task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
       ^                                 |    ^     |           ^
       `---------------------------------'    |     `--> pmu <--'
                                              v           ^
                                         perf_event ------'


Each task has an array of pointers to a perf_event_context. Each
perf_event_context has a direct relation to a PMU and a group of events
for that PMU. The task related perf_event_context's have a pointer back
to that task.

Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which
includes a perf_event_context, which again has a direct relation to that
PMU, and a group of events for that PMU.

The perf_cpu_context also tracks which task context is currently
associated with that CPU and includes a few other things like the
hrtimer for rotation etc..

Each perf_event is then associated with its PMU and one
perf_event_context.

> Maybe this is how this is done or proposed by your patches, but it
> certainly is not obvious.

No, my patch somewhat completely wrecks the above; and reduces to a
single task context and a single CPU context.

There were a number of problems with the above. One is that task-array
of pointer, which limited the number of task contexts we could have.

Now, we could've easily changed that to a list and called it a day.
That is not in fact a horribly difficult patch. If you combine that with
a patch that actually freed task context's when they go empty, that
might actually work.

But there are a number of other considerations that resulted in the
patch as presented:

 - there is a bunch of per context state that is simply duplicated
   between contexts, like for instance the time keeping. There is no
   point in tracking the time for 'n' per task/cpu contexts when in fact
   they're all the same.

 - on context switch we have to iterate all these 'n' contexts and
   switch them one by one. Instead of just switching one context and
   calling it a day.

 - for big.little we'd end up with 2 per-task contexts and only ever use
   1 at any one time, which increases 'n' in the above cases for no
   purpose.

 - the actual per-pmu-per-context state is very small (as I think Alexey
   already implied).

 - a single context simplifies a bunch of things; including the
   move_group case (we no longer have to adjust perf_event::ctx) and the
   cpu-online tests and the ctx locking and it removes a bunch of
   context lists (like active_ctx_list).

So a single context is what I went with. That all results in:


  task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
       ^                                 |    ^ ^
       `---------------------------------'    | |
                                              | `--> perf_event_pmu_context
                                              |       ^   ^
                                              |       |   |
                                              | ,-----'   v
                                              | |      perf_cpu_pmu_context
                                              | |         ^
                                              | |         |
                                              v v         v
                                         perf_event ---> pmu


Because while the per-pmu-per-context state is small, it does exists,
this gives rise to perf_event_pmu_context. It tracks nr_events and
nr_active, which is used to (quickly) tell if rotation is required (it
is possible to reduce this state I think, but I've not yet gotten it
down to 0). It also tracks which events are actually active; iterating a
list is cheaper than finding them all in the RB-tree.

It also contains the task_ctx_data thing for LBR, which is a PMU
specific extra data thingy.

We then also keep a list of (active) perf_event_pmu_context in
perf_event_context, such that we can quickly find which PMUs are in fact
involved with the context. This simplifies context scheduling a little.

We then also need per-pmu-per-cpu state, which gives rise to
perf_cpu_pmu_context, and that mostly includes bits to drive the event
rotation, which per ABI is per PMU, but it also includes bits to do
perf_event_attr::exclusive scheduling, which is also naturally
per-pmu-per-cpu.

And yes, the above looks more complicated, but at the same time, a bunch
of things did get simplified. Maybe once the dust settles someone can
turn this here email into a sensible comment or something ;-)

> Also the Intel LBR is not a PMU on is own. Maybe you are talking about
> the BTS in arch/x86/even/sintel/bts.c.

This thing:

  https://lkml.kernel.org/r/1510970046-25387-1-git-send-email-megha.dey@linux.intel.com