linux-kernel - Re: [PATCH 1/3 v2] perf/amd/uncore: Prepare L3 thread mask code for Family 19h support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <a78c303e-c988-20e0-9e30-6fdc63d5d75f@amd.com>
Date:   Mon, 23 Mar 2020 15:50:32 -0500
From:   Kim Phillips <kim.phillips@....com>
To:     Stephane Eranian <eranian@...gle.com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...nel.org>, Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Borislav Petkov <bp@...en8.de>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        "H. Peter Anvin" <hpa@...or.com>, Jiri Olsa <jolsa@...hat.com>,
        Mark Rutland <mark.rutland@....com>,
        Michael Petlan <mpetlan@...hat.com>,
        Namhyung Kim <namhyung@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>, x86 <x86@...nel.org>
Subject: Re: [PATCH 1/3 v2] perf/amd/uncore: Prepare L3 thread mask code for
 Family 19h support



On 3/18/20 4:26 PM, Stephane Eranian wrote:
> On Wed, Mar 18, 2020 at 1:43 PM Peter Zijlstra <peterz@...radead.org> wrote:
>>
>> On Wed, Mar 18, 2020 at 09:46:41AM -0500, Kim Phillips wrote:
>>
>>>> But this does not work with the cpumask programmed for the amd_l3 PMU. This mask
>>>> shows, as it should, one CPU/CCX. So that means that when I do:
>>>>
>>>> $ perf stat -a amd_l3/event=llc_event/
>>>>
>>>> This only collects on the CPUs listed in the cpumask: 0,4,8,12 ....
>>>> That means that L3 events generated by the other CPUs on the CCX are
>>>> not monitored.
>>>> I can easily see the problem by pinning a memory bound program to
>>>> CPU64, for instance.
>>>
>>> Right, the higher level code calls the driver with a single cpu==0
>>> call if the perf tool is invoked with a simple -a style system-wide.
> 
> No, it does not.
> 
> With -a, when -C is not passed, the perf tool picks up the cpumask for
> the PMU from sysfs:
> $ cat /proc/sys/devices/amd_l3/cpumask
> 
> You can easily verify this by running: strace -etrace=perf_event_open
> perf stat -a -e amd_l3/event=0x00/.
> This is the default common mode.

What I meant was that with -a, the driver only gets called with the
'base' cpu for each L3 PMU domain, i.e., 0, 4, 8, and so on.  With -C, it
gets called with all the CPUs the user specifies: these are different
behaviours, and the driver can't tell the difference between e.g., -a
or -C 0,4,8, etc.

> The problem is that here to get any meaningful result, you need to force a -C.
> The CPU in the cpumask is just the CPU to which to attach the event in
> order to access the correct uncore PMU.
> Here, you have one CPU per CCX which is expected and perfectly fine.
> 
> The thread_mask is a hardware filter on the uncore L3 PMU. If you set
> by default the thread_mask to 0xff, then
> you obtain a full system view with a simple -a, or per socket with
> --per-socket. So we need to find a way to
> make this common case work properly first. Expecting the users to know

OK, I'll send a patch to revert the thread filter feature until the above
issue is addressed.

> that for some amd_l3 events you need
> to force -C 0-255 is not practical. I also think that forcing the
> cpumask to 0-255 is not right solution. This is not how
> this is done for any other uncore PMU I know of and some do have the
> thread filter, such as the Skylake CHA.

Odd, the Intel uncore driver's cpumask is 0, so not sure if AMD's
is right to set it any more...

Thanks,

Kim

>>> If the tool is invoked with supplemental switches to -a, like -C 0-255,
>>> and -A, the driver gets called multiple times with all the unique cpu
>>> values.  The latter is the expected invocation style when measuring
>>> a benchmark pinned on a subset of cpus, i.e., when evaluating
>>> the driver, and is the more deterministic behaviour for the driver
>>> to have, given it cannot tell the difference otherwise.
>>
>> That seems to suggest it is all horribly broken.