[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a78c303e-c988-20e0-9e30-6fdc63d5d75f@amd.com>
Date: Mon, 23 Mar 2020 15:50:32 -0500
From: Kim Phillips <kim.phillips@....com>
To: Stephane Eranian <eranian@...gle.com>,
Peter Zijlstra <peterz@...radead.org>
Cc: Ingo Molnar <mingo@...nel.org>, Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Borislav Petkov <bp@...en8.de>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
"H. Peter Anvin" <hpa@...or.com>, Jiri Olsa <jolsa@...hat.com>,
Mark Rutland <mark.rutland@....com>,
Michael Petlan <mpetlan@...hat.com>,
Namhyung Kim <namhyung@...nel.org>,
LKML <linux-kernel@...r.kernel.org>, x86 <x86@...nel.org>
Subject: Re: [PATCH 1/3 v2] perf/amd/uncore: Prepare L3 thread mask code for
Family 19h support
On 3/18/20 4:26 PM, Stephane Eranian wrote:
> On Wed, Mar 18, 2020 at 1:43 PM Peter Zijlstra <peterz@...radead.org> wrote:
>>
>> On Wed, Mar 18, 2020 at 09:46:41AM -0500, Kim Phillips wrote:
>>
>>>> But this does not work with the cpumask programmed for the amd_l3 PMU. This mask
>>>> shows, as it should, one CPU/CCX. So that means that when I do:
>>>>
>>>> $ perf stat -a amd_l3/event=llc_event/
>>>>
>>>> This only collects on the CPUs listed in the cpumask: 0,4,8,12 ....
>>>> That means that L3 events generated by the other CPUs on the CCX are
>>>> not monitored.
>>>> I can easily see the problem by pinning a memory bound program to
>>>> CPU64, for instance.
>>>
>>> Right, the higher level code calls the driver with a single cpu==0
>>> call if the perf tool is invoked with a simple -a style system-wide.
>
> No, it does not.
>
> With -a, when -C is not passed, the perf tool picks up the cpumask for
> the PMU from sysfs:
> $ cat /proc/sys/devices/amd_l3/cpumask
>
> You can easily verify this by running: strace -etrace=perf_event_open
> perf stat -a -e amd_l3/event=0x00/.
> This is the default common mode.
What I meant was that with -a, the driver only gets called with the
'base' cpu for each L3 PMU domain, i.e., 0, 4, 8, and so on. With -C, it
gets called with all the CPUs the user specifies: these are different
behaviours, and the driver can't tell the difference between e.g., -a
or -C 0,4,8, etc.
> The problem is that here to get any meaningful result, you need to force a -C.
> The CPU in the cpumask is just the CPU to which to attach the event in
> order to access the correct uncore PMU.
> Here, you have one CPU per CCX which is expected and perfectly fine.
>
> The thread_mask is a hardware filter on the uncore L3 PMU. If you set
> by default the thread_mask to 0xff, then
> you obtain a full system view with a simple -a, or per socket with
> --per-socket. So we need to find a way to
> make this common case work properly first. Expecting the users to know
OK, I'll send a patch to revert the thread filter feature until the above
issue is addressed.
> that for some amd_l3 events you need
> to force -C 0-255 is not practical. I also think that forcing the
> cpumask to 0-255 is not right solution. This is not how
> this is done for any other uncore PMU I know of and some do have the
> thread filter, such as the Skylake CHA.
Odd, the Intel uncore driver's cpumask is 0, so not sure if AMD's
is right to set it any more...
Thanks,
Kim
>>> If the tool is invoked with supplemental switches to -a, like -C 0-255,
>>> and -A, the driver gets called multiple times with all the unique cpu
>>> values. The latter is the expected invocation style when measuring
>>> a benchmark pinned on a subset of cpus, i.e., when evaluating
>>> the driver, and is the more deterministic behaviour for the driver
>>> to have, given it cannot tell the difference otherwise.
>>
>> That seems to suggest it is all horribly broken.
Powered by blists - more mailing lists