[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <64030eab-6e95-494a-ab72-bc33792ef723@linux.intel.com>
Date: Thu, 18 Jul 2024 13:47:47 -0400
From: "Liang, Kan" <kan.liang@...ux.intel.com>
To: Ian Rogers <irogers@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Namhyung Kim <namhyung@...nel.org>, Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>, Adrian Hunter <adrian.hunter@...el.com>,
Bjorn Helgaas <bhelgaas@...gle.com>, Jonathan Corbet <corbet@....net>,
James Clark <james.clark@....com>, Ravi Bangoria <ravi.bangoria@....com>,
Dominique Martinet <asmadeus@...ewreck.org>, linux-kernel@...r.kernel.org,
linux-perf-users@...r.kernel.org,
Dhananjay Ugwekar <Dhananjay.Ugwekar@....com>, ananth.narayan@....com,
gautham.shenoy@....com, kprateek.nayak@....com, sandipan.das@....com
Subject: Re: [PATCH v2 3/6] perf pmu: Add support for event.cpus files in
sysfs
On 2024-07-18 11:39 a.m., Ian Rogers wrote:
> On Thu, Jul 18, 2024 at 7:33 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>>
>>
>>
>> On 2024-07-17 8:30 p.m., Ian Rogers wrote:
>>> If an event file exists in sysfs, check if a event.cpus file exists
>>> and read a perf_cpu_map from it if it does. This allows particular
>>> events to have a different set of CPUs compared to the PMU.
>>>
>>> One scenario where this could be useful is when a PMU is set up with a
>>> cpumask/events per SMT thread but some events record for all SMT
>>> threads. Programming an event on each SMT thread will cause
>>> unnecessary counters to be programmed and the aggregate value to be
>>> too large.
>>
>> If I understand the scenario correctly, I think the issue should have
>> been addressed since ICX, with the introduction of the "SMT-aware
>> events". Is there a specific event which still has the issue on newer
>> platforms?
>
> Nothing comes to mind, but it is there on popular machines like Skylake.
>
>>>
>>> Another scenario where this could be useful if when a PMU has
>>> historically had a cpumask at the package level, but now newer per
>>> die, core or CPU information is available.
>>
>> The traditional way to support new counters with a different scope is to
>> add a new PMU.
>>
>>>
>>> Additional context for the motivation is in these patches and
>>> conversation:
>>> https://lore.kernel.org/lkml/20240711102436.4432-1-Dhananjay.Ugwekar@amd.com/
>>
>> I don't see the advantages of the new event.cpus way.
>> The aggregation should be the same.
>
> Agreed. My concern is that we may end up with a pattern of
> <pmu>_per_pkg, <pmu>_per_core, <pmu>_per_cpu, <pmu>_per_l3, etc. just
> for the sake of the cpumask.
The cstate PMUs already do so, e.g., cstate_core, cstate_pkg.
>From another perspective, it discloses the scope information in a PMU
name. If a user only cares about the information of a package, they just
need to focus on everything under <pmu>_pkg. Otherwise, they have to go
through all the events.
>
>> The RAPL counters are free-running counters. So there is no difference
>> whether grouping or not grouping.
>
> Should the RAPL counters return true for perf_pmu__is_software? In the
> tool we currently return false and won't allow grouping, these events
> with other hardware events. The intent in perf_pmu__is_software was to
> emulate the way the kernel allows/disallows groups - software context
> events can be in a hardware context but otherwise I don't believe the
> grouping is allowed.
No, it's not allowed for the RAPL counters.
If the motivation is to find another way to group counters with
different scope, it may not work.
We once tried to mix the perf_invalid_context PMUs in a group. But it's
denied.
https://yhbt.net/lore/all/20150415172856.GA5029@twins.programming.kicks-ass.net/
The event.cpus still faces the same issues.
>
>> But it makes the kernel driver complex, since it has to maintain at
>> least two different cpumasks.
>
> Two drivers each maintaining a cpumask or 1 driver maintaining 2
> cpumasks, it seems like there is chance for code reuse in both cases.
> I'm not seeing how it adds to complexity particularly.
Yes, there are some cleanup opportunities for the cpumask/hotplug codes.
We may even abstracts some generic interfaces for pkg or core level PMUs.
Eventually, the complexity/duplication should be able to be reduced.
>
>> The tool becomes complex either, since it has to take care of the
>> per-event CPU override case.
>
> I'm not sure I agree with this. What we need for perf_event_open is a
> perf_event_attr, we dress these up as evsels which also have the
> ability to be for >1 CPU or thread. Longer term I think evsels can
> also have >1 PMU, for the wildcard cases like uncore memory
> controllers - this would remove the need for resorting evsels except
> for topdown events which have thrown us a giant bundle of challenges.
> Anyway, so an evsel is perf_event_attr information paired with CPU
> information, it makes sense to me that the parser should do this
> pairing. What's harder in the evsel/evlist setup is trying to fix CPU
> maps up not in parsing, like with __perf_evlist__propagate_maps where
> the parsing is trying to leave crumbs around (like system_wide,
> has_user_cpus, is_pmu_core) so the map propagation works properly.
>
>> The json file must also be updated to add a
>> new field cpumask.
>
> Yeah, I don't like this as it means we end up putting CPU information
> into the json that isn't the same for every CPU variant of the same
> architecture model. Maybe we can have some kind of "scope" enum value
> in the json and then when the scope differs from the PMU's, core scope
> vs the PMU's hyperthread scope, then in the tool we can figure out the
> cpumask from the topology in sysfs. Maybe we should just always use
> the topology and get rid of cpumask files in sysfs, replacing them
> with "scope" files. Will Deacon pushed back on having ARM PMUs
> supporting hot plugging
> (https://lore.kernel.org/lkml/20240701142222.GA2691@willie-the-truck/)
> where the main thing hot plugging handler needs to maintain is set the
> cpumask.
Not just the cpumask but also migrate the context for some PMUs, see
perf_pmu_migrate_context().
It seems we really need a shared cpumask in the generic code, so the
drivers don't need to handle the hotplug everywhere. I will think about it.
Thanks,
Kan
Powered by blists - more mailing lists