linux-kernel - Re: [PATCH v2 3/6] perf pmu: Add support for event.cpus files in sysfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e8872317-8e18-48aa-9f23-b98af9345bed@linux.intel.com>
Date: Fri, 19 Jul 2024 09:55:59 -0400
From: "Liang, Kan" <kan.liang@...ux.intel.com>
To: Ian Rogers <irogers@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
 Arnaldo Carvalho de Melo <acme@...nel.org>,
 Namhyung Kim <namhyung@...nel.org>, Mark Rutland <mark.rutland@....com>,
 Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
 Jiri Olsa <jolsa@...nel.org>, Adrian Hunter <adrian.hunter@...el.com>,
 Bjorn Helgaas <bhelgaas@...gle.com>, Jonathan Corbet <corbet@....net>,
 James Clark <james.clark@....com>, Ravi Bangoria <ravi.bangoria@....com>,
 Dominique Martinet <asmadeus@...ewreck.org>, linux-kernel@...r.kernel.org,
 linux-perf-users@...r.kernel.org,
 Dhananjay Ugwekar <Dhananjay.Ugwekar@....com>, ananth.narayan@....com,
 gautham.shenoy@....com, kprateek.nayak@....com, sandipan.das@....com
Subject: Re: [PATCH v2 3/6] perf pmu: Add support for event.cpus files in
 sysfs



On 2024-07-18 4:50 p.m., Ian Rogers wrote:
> On Thu, Jul 18, 2024 at 10:48 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>>
>>
>>
>> On 2024-07-18 11:39 a.m., Ian Rogers wrote:
>>> On Thu, Jul 18, 2024 at 7:33 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2024-07-17 8:30 p.m., Ian Rogers wrote:
>>>>> If an event file exists in sysfs, check if a event.cpus file exists
>>>>> and read a perf_cpu_map from it if it does. This allows particular
>>>>> events to have a different set of CPUs compared to the PMU.
>>>>>
>>>>> One scenario where this could be useful is when a PMU is set up with a
>>>>> cpumask/events per SMT thread but some events record for all SMT
>>>>> threads. Programming an event on each SMT thread will cause
>>>>> unnecessary counters to be programmed and the aggregate value to be
>>>>> too large.
>>>>
>>>> If I understand the scenario correctly, I think the issue should have
>>>> been addressed since ICX, with the introduction of  the "SMT-aware
>>>> events". Is there a specific event which still has the issue on newer
>>>> platforms?
>>>
>>> Nothing comes to mind, but it is there on popular machines like Skylake.
>>>
>>>>>
>>>>> Another scenario where this could be useful if when a PMU has
>>>>> historically had a cpumask at the package level, but now newer per
>>>>> die, core or CPU information is available.
>>>>
>>>> The traditional way to support new counters with a different scope is to
>>>> add a new PMU.
>>>>
>>>>>
>>>>> Additional context for the motivation is in these patches and
>>>>> conversation:
>>>>> https://lore.kernel.org/lkml/20240711102436.4432-1-Dhananjay.Ugwekar@amd.com/
>>>>
>>>> I don't see the advantages of the new event.cpus way.
>>>> The aggregation should be the same.
>>>
>>> Agreed. My concern is that we may end up with a pattern of
>>> <pmu>_per_pkg, <pmu>_per_core, <pmu>_per_cpu, <pmu>_per_l3, etc. just
>>> for the sake of the cpumask.
>>
>> The cstate PMUs already do so, e.g., cstate_core, cstate_pkg.
>>
>> From another perspective, it discloses the scope information in a PMU
>> name. If a user only cares about the information of a package, they just
>> need to focus on everything under <pmu>_pkg. Otherwise, they have to go
>> through all the events.
>>
>>>
>>>> The RAPL counters are free-running counters. So there is no difference
>>>> whether grouping or not grouping.
>>>
>>> Should the RAPL counters return true for perf_pmu__is_software? In the
>>> tool we currently return false and won't allow grouping, these events
>>> with other hardware events. The intent in perf_pmu__is_software was to
>>> emulate the way the kernel allows/disallows groups - software context
>>> events can be in a hardware context but otherwise I don't believe the
>>> grouping is allowed.
>>
>> No, it's not allowed for the RAPL counters.
>>
>> If the motivation is to find another way to group counters with
>> different scope, it may not work.
>>
>> We once tried to mix the perf_invalid_context PMUs in a group. But it's
>> denied.
>> https://yhbt.net/lore/all/20150415172856.GA5029@twins.programming.kicks-ass.net/
>>
>> The event.cpus still faces the same issues.
> 
> Why so? The events would share the same perf_event_context, I thought
> the perf_event_open would succeed.

I think it breaks what groups are as well. At least, you cannot
guarantee that two events can be co-scheduled on the same CPU. Even
worse, there could be no events on some CPU.
The first thing that pops up is the sample read feature. On CPU_A, the
event_A is the leader event, but on CPU_B, the event_B could be the
leader event if event_A's cpumask doesn't include the CPU_B.
There could be more similar features we have to special handle.

> 
>>>
>>>> But it makes the kernel driver complex, since it has to maintain at
>>>> least two different cpumasks.
>>>
>>> Two drivers each maintaining a cpumask or 1 driver maintaining 2
>>> cpumasks, it seems like there is chance for code reuse in both cases.
>>> I'm not seeing how it adds to complexity particularly.
>>
>> Yes, there are some cleanup opportunities for the cpumask/hotplug codes.
>>
>> We may even abstracts some generic interfaces for pkg or core level PMUs.
>>
>> Eventually, the complexity/duplication should be able to be reduced.
>>
>>>
>>>> The tool becomes complex either, since it has to take care of the
>>>> per-event CPU override case.
>>>
>>> I'm not sure I agree with this. What we need for perf_event_open is a
>>> perf_event_attr, we dress these up as evsels which also have the
>>> ability to be for >1 CPU or thread. Longer term I think evsels can
>>> also have >1 PMU, for the wildcard cases like uncore memory
>>> controllers - this would remove the need for resorting evsels except
>>> for topdown events which have thrown us a giant bundle of challenges.
>>> Anyway, so an evsel is perf_event_attr information paired with CPU
>>> information, it makes sense to me that the parser should do this
>>> pairing. What's harder in the evsel/evlist setup is trying to fix CPU
>>> maps up not in parsing, like with __perf_evlist__propagate_maps where
>>> the parsing is trying to leave crumbs around (like system_wide,
>>> has_user_cpus, is_pmu_core) so the map propagation works properly.
>>>
>>>> The json file must also be updated to add a
>>>> new field cpumask.
>>>
>>> Yeah, I don't like this as it means we end up putting CPU information
>>> into the json that isn't the same for every CPU variant of the same
>>> architecture model. Maybe we can have some kind of "scope" enum value
>>> in the json and then when the scope differs from the PMU's, core scope
>>> vs the PMU's hyperthread scope, then in the tool we can figure out the
>>> cpumask from the topology in sysfs. Maybe we should just always use
>>> the topology and get rid of cpumask files in sysfs, replacing them
>>> with "scope" files. Will Deacon pushed back on having ARM PMUs
>>> supporting hot plugging
>>> (https://lore.kernel.org/lkml/20240701142222.GA2691@willie-the-truck/)
>>> where the main thing hot plugging handler needs to maintain is set the
>>> cpumask.
>>
>> Not just the cpumask but also migrate the context for some PMUs, see
>> perf_pmu_migrate_context().
> 
> Will do, thanks.
> 
>> It seems we really need a shared cpumask in the generic code, so the
>> drivers don't need to handle the hotplug everywhere. I will think about it.
> 
> Thanks. There are other problems on ARM PMUs like having no or empty
> cpumasks, which the tool take to mean open the event on every online
> CPU (there is no cpus or cpumask file on core PMUs historically, so we
> adopt this behavior when the cpumask is empty or not present). 

The no cpus/cpumasks and empty cpumasks should be different.
No cpus/cpumasks file means that the counters/events are available for
all the CPUs.
But if it's an empty cpus/cpumasks file, it means that there are no
proper CPUs. It may happen on a hybrid machine when all e-core CPUs are
hot-removed. Since it's possible that the CPUs can be hot-added later,
the kernel driver doesn't unregister the cpu_atom PMU.

> I've
> been thinking to expand `tools/perf/tests/pmu.c` with basic PMU sanity
> tests. Some ideas:
> 

Thanks.

> 1) some kind of cpumask sanity check - we could open an event with the
> cpumask and see if it yields multiplexing.. which would highlight the
> ARM no cpumask bug

The multiplexing is triggered when there are not enough counters. It
should not related to the cpumask.

For the PMU without cpumask, I think the test case should try to open an
event on all CPUs and check if the open succeeds.
For the PMU with cpumask, the test case should try to open an event on
the masked CPUs and check if the open succeeds.

The behavior of opening an event on unmasked CPUs seems not defined.
For uncore, it's OK. The kernel treats the CPUs from the same socket
exactly the same. But I'm not sure about the other PMUs.

> 2) do the /sys/devices/<pmu>/events/event.(unit|scale|per-pkg|snapshot)
> files parse correctly and have a corresponding event.
> 3) keep adding opening events on the PMU to a group to make sure that
> when counters are exhausted the perf_event_open fails (I've seen this
> bug on AMD)
> 4) are the values in the type file unique
> 

The rest sounds good to me.

Thanks,
Kan