linux-kernel - Re: [RFC][PATCH] perf: Rewrite core context handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <FE66AE20-9785-4CF7-8D8D-CF2A9C696923@fb.com>
Date:   Tue, 16 Oct 2018 18:28:10 +0000
From:   Song Liu <songliubraving@...com>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     Alexey Budankov <alexey.budankov@...ux.intel.com>,
        Ingo Molnar <mingo@...nel.org>,
        lkml <linux-kernel@...r.kernel.org>,
        "acme@...nel.org" <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...hat.com>,
        Stephane Eranian <eranian@...gle.com>,
        "Thomas Gleixner" <tglx@...utronix.de>,
        "mark.rutland@....com" <mark.rutland@....com>,
        "megha.dey@...el.com" <megha.dey@...el.com>,
        "frederic@...nel.org" <frederic@...nel.org>
Subject: Re: [RFC][PATCH] perf: Rewrite core context handling

Hi Peter,

> On Oct 15, 2018, at 3:09 PM, Song Liu <songliubraving@...com> wrote:
> 
> 
> 
>> On Oct 15, 2018, at 1:34 AM, Peter Zijlstra <peterz@...radead.org> wrote:
>> 
>> On Mon, Oct 15, 2018 at 10:26:06AM +0300, Alexey Budankov wrote:
>>> Hi,
>>> 
>>> On 10.10.2018 13:45, Peter Zijlstra wrote:
>>>> Hi all,
>>>> 
>>>> There have been various issues and limitations with the way perf uses
>>>> (task) contexts to track events. Most notable is the single hardware PMU
>>>> task context, which has resulted in a number of yucky things (both
>>>> proposed and merged).
>>>> 
>>>> Notably:
>>>> 
>>>> - HW breakpoint PMU
>>>> - ARM big.little PMU
>>>> - Intel Branch Monitoring PMU
>>>> 
>>>> Since we now track the events in RB trees, we can 'simply' add a pmu
>>>> order to them and have them grouped that way, reducing to a single
>>>> context. Of course, reality never quite works out that simple, and below
>>>> ends up adding an intermediate data structure to bridge the context ->
>>>> pmu mapping.
>>>> 
>>>> Something a little like:
>>>> 
>>>>             ,------------------------[1:n]---------------------.
>>>>             V                                                  V
>>>>   perf_event_context <-[1:n]-> perf_event_pmu_context <--- perf_event
>>>>             ^                      ^     |                     |
>>>>             `--------[1:n]---------'     `-[n:1]-> pmu <-[1:n]-'
>>>> 
>>>> This patch builds (provided you disable CGROUP_PERF), boots and survives
>>>> perf-top without the machine catching fire.
>>>> 
>>>> There's still a fair bit of loose ends (look for XXX), but I think this
>>>> is the direction we should be going.
>>>> 
>>>> Comments?
>>>> 
>>>> Not-Quite-Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>>>> ---
>>>> arch/powerpc/perf/core-book3s.c |    4 
>>>> arch/x86/events/core.c          |    4 
>>>> arch/x86/events/intel/core.c    |    6 
>>>> arch/x86/events/intel/ds.c      |    6 
>>>> arch/x86/events/intel/lbr.c     |   16 
>>>> arch/x86/events/perf_event.h    |    6 
>>>> include/linux/perf_event.h      |   80 +-
>>>> include/linux/sched.h           |    2 
>>>> kernel/events/core.c            | 1412 ++++++++++++++++++++--------------------
>>>> 9 files changed, 815 insertions(+), 721 deletions(-)
>>> 
>>> Rewrite is impressive however it doesn't result in code base reduction as it is.
>> 
>> Yeah.. that seems to be nature of these things ..
>> 
>>> Nonetheless there is a clear demand for per pmu events groups tracking and rotation 
>>> in single cpu context (HW breakpoints, ARM big.little, Intel LBRs) and there is 
>>> a supply thru groups ordering on RB-tree.
>>> 
>>> This might be driven into the kernel by some new Perf features that would base on 
>>> that RB-tree groups ordering or by refactoring of existing code but in the way it 
>>> would result in overall code base reduction thus lowering support cost.
>> 
>> If you have a concrete suggestion on how to reduce complexity? I tried,
>> but couldn't find any (without breaking something).
>> 
>> The active lists and pmu_ctx_list could arguably be replaced with
>> (slower) iteratons over the RB tree, but you'll still need the per pmu
>> nr_events/nr_active counts to determine if rotation is required at all.
>> 
>> And like you know, performance is quite important here too. I'd love to
>> reduce complexity while maintaining or improve performance, but that
>> rarely if ever happens :/
> 
> How about this: 
> 
> 1. Keep multiple perf_cpu_context per CPU, just like before this patch. 
> 
> 2. For perf_event_context, add PMU as an order for the RB tree. 
> 
> 3. (hw) pmu->perf_cpu_context->ctx only has events for this PMU (and sw 
>   events moved to this context).
> 
> 4. task->perf_event_ctxp has events for all PMUs. 
> 
> With this path, we keep the existing perf_cpu_context/perf_event_context
> logic as-is, which I think is simp.ler than the new logic (with extra
> *_pmu_context). And it should also solve the problem. 
> 
> Does this make sense? If this doesn't look too broken, I am happy to
> draft RFC for it. 
> 

I am not sure whether you missed this one, or found it totally insane. 
Could you please share your comments on it? My gut feeling is that this 
would be a simpler patch to solve the problem (two hw PMUs). (It might 
be less efficient though). 

Thanks,
Song