linux-kernel - Re: [PATCH v2 3/6] perf pmu: Add support for event.cpus files in sysfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fUjEYwdOdmfa5N7b8OOLWDitJKBdeOr8-+UOYWA5+ehkA@mail.gmail.com>
Date: Fri, 19 Jul 2024 07:59:25 -0700
From: Ian Rogers <irogers@...gle.com>
To: "Liang, Kan" <kan.liang@...ux.intel.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Arnaldo Carvalho de Melo <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>, 
	Mark Rutland <mark.rutland@....com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, 
	Adrian Hunter <adrian.hunter@...el.com>, Bjorn Helgaas <bhelgaas@...gle.com>, 
	Jonathan Corbet <corbet@....net>, James Clark <james.clark@....com>, 
	Ravi Bangoria <ravi.bangoria@....com>, Dominique Martinet <asmadeus@...ewreck.org>, 
	linux-kernel@...r.kernel.org, linux-perf-users@...r.kernel.org, 
	Dhananjay Ugwekar <Dhananjay.Ugwekar@....com>, ananth.narayan@....com, gautham.shenoy@....com, 
	kprateek.nayak@....com, sandipan.das@....com
Subject: Re: [PATCH v2 3/6] perf pmu: Add support for event.cpus files in sysfs

On Fri, Jul 19, 2024 at 6:56 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>
>
>
> On 2024-07-18 4:50 p.m., Ian Rogers wrote:
> > On Thu, Jul 18, 2024 at 10:48 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
> >>
> >>
> >>
> >> On 2024-07-18 11:39 a.m., Ian Rogers wrote:
> >>> On Thu, Jul 18, 2024 at 7:33 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2024-07-17 8:30 p.m., Ian Rogers wrote:
> >>>>> If an event file exists in sysfs, check if a event.cpus file exists
> >>>>> and read a perf_cpu_map from it if it does. This allows particular
> >>>>> events to have a different set of CPUs compared to the PMU.
> >>>>>
> >>>>> One scenario where this could be useful is when a PMU is set up with a
> >>>>> cpumask/events per SMT thread but some events record for all SMT
> >>>>> threads. Programming an event on each SMT thread will cause
> >>>>> unnecessary counters to be programmed and the aggregate value to be
> >>>>> too large.
> >>>>
> >>>> If I understand the scenario correctly, I think the issue should have
> >>>> been addressed since ICX, with the introduction of  the "SMT-aware
> >>>> events". Is there a specific event which still has the issue on newer
> >>>> platforms?
> >>>
> >>> Nothing comes to mind, but it is there on popular machines like Skylake.
> >>>
> >>>>>
> >>>>> Another scenario where this could be useful if when a PMU has
> >>>>> historically had a cpumask at the package level, but now newer per
> >>>>> die, core or CPU information is available.
> >>>>
> >>>> The traditional way to support new counters with a different scope is to
> >>>> add a new PMU.
> >>>>
> >>>>>
> >>>>> Additional context for the motivation is in these patches and
> >>>>> conversation:
> >>>>> https://lore.kernel.org/lkml/20240711102436.4432-1-Dhananjay.Ugwekar@amd.com/
> >>>>
> >>>> I don't see the advantages of the new event.cpus way.
> >>>> The aggregation should be the same.
> >>>
> >>> Agreed. My concern is that we may end up with a pattern of
> >>> <pmu>_per_pkg, <pmu>_per_core, <pmu>_per_cpu, <pmu>_per_l3, etc. just
> >>> for the sake of the cpumask.
> >>
> >> The cstate PMUs already do so, e.g., cstate_core, cstate_pkg.
> >>
> >> From another perspective, it discloses the scope information in a PMU
> >> name. If a user only cares about the information of a package, they just
> >> need to focus on everything under <pmu>_pkg. Otherwise, they have to go
> >> through all the events.
> >>
> >>>
> >>>> The RAPL counters are free-running counters. So there is no difference
> >>>> whether grouping or not grouping.
> >>>
> >>> Should the RAPL counters return true for perf_pmu__is_software? In the
> >>> tool we currently return false and won't allow grouping, these events
> >>> with other hardware events. The intent in perf_pmu__is_software was to
> >>> emulate the way the kernel allows/disallows groups - software context
> >>> events can be in a hardware context but otherwise I don't believe the
> >>> grouping is allowed.
> >>
> >> No, it's not allowed for the RAPL counters.
> >>
> >> If the motivation is to find another way to group counters with
> >> different scope, it may not work.
> >>
> >> We once tried to mix the perf_invalid_context PMUs in a group. But it's
> >> denied.
> >> https://yhbt.net/lore/all/20150415172856.GA5029@twins.programming.kicks-ass.net/
> >>
> >> The event.cpus still faces the same issues.
> >
> > Why so? The events would share the same perf_event_context, I thought
> > the perf_event_open would succeed.
>
> I think it breaks what groups are as well. At least, you cannot
> guarantee that two events can be co-scheduled on the same CPU. Even
> worse, there could be no events on some CPU.
> The first thing that pops up is the sample read feature. On CPU_A, the
> event_A is the leader event, but on CPU_B, the event_B could be the
> leader event if event_A's cpumask doesn't include the CPU_B.
> There could be more similar features we have to special handle.

Thanks Kan. I'm not wondering about a case of 2 CPUs, say on CPU0 and
solely its perf event context, I want to know its core power and
package power as a group so I never record one without the other. That
grouping wouldn't be possible with 2 PMUs.

> >
> >>>
> >>>> But it makes the kernel driver complex, since it has to maintain at
> >>>> least two different cpumasks.
> >>>
> >>> Two drivers each maintaining a cpumask or 1 driver maintaining 2
> >>> cpumasks, it seems like there is chance for code reuse in both cases.
> >>> I'm not seeing how it adds to complexity particularly.
> >>
> >> Yes, there are some cleanup opportunities for the cpumask/hotplug codes.
> >>
> >> We may even abstracts some generic interfaces for pkg or core level PMUs.
> >>
> >> Eventually, the complexity/duplication should be able to be reduced.
> >>
> >>>
> >>>> The tool becomes complex either, since it has to take care of the
> >>>> per-event CPU override case.
> >>>
> >>> I'm not sure I agree with this. What we need for perf_event_open is a
> >>> perf_event_attr, we dress these up as evsels which also have the
> >>> ability to be for >1 CPU or thread. Longer term I think evsels can
> >>> also have >1 PMU, for the wildcard cases like uncore memory
> >>> controllers - this would remove the need for resorting evsels except
> >>> for topdown events which have thrown us a giant bundle of challenges.
> >>> Anyway, so an evsel is perf_event_attr information paired with CPU
> >>> information, it makes sense to me that the parser should do this
> >>> pairing. What's harder in the evsel/evlist setup is trying to fix CPU
> >>> maps up not in parsing, like with __perf_evlist__propagate_maps where
> >>> the parsing is trying to leave crumbs around (like system_wide,
> >>> has_user_cpus, is_pmu_core) so the map propagation works properly.
> >>>
> >>>> The json file must also be updated to add a
> >>>> new field cpumask.
> >>>
> >>> Yeah, I don't like this as it means we end up putting CPU information
> >>> into the json that isn't the same for every CPU variant of the same
> >>> architecture model. Maybe we can have some kind of "scope" enum value
> >>> in the json and then when the scope differs from the PMU's, core scope
> >>> vs the PMU's hyperthread scope, then in the tool we can figure out the
> >>> cpumask from the topology in sysfs. Maybe we should just always use
> >>> the topology and get rid of cpumask files in sysfs, replacing them
> >>> with "scope" files. Will Deacon pushed back on having ARM PMUs
> >>> supporting hot plugging
> >>> (https://lore.kernel.org/lkml/20240701142222.GA2691@willie-the-truck/)
> >>> where the main thing hot plugging handler needs to maintain is set the
> >>> cpumask.
> >>
> >> Not just the cpumask but also migrate the context for some PMUs, see
> >> perf_pmu_migrate_context().
> >
> > Will do, thanks.
> >
> >> It seems we really need a shared cpumask in the generic code, so the
> >> drivers don't need to handle the hotplug everywhere. I will think about it.
> >
> > Thanks. There are other problems on ARM PMUs like having no or empty
> > cpumasks, which the tool take to mean open the event on every online
> > CPU (there is no cpus or cpumask file on core PMUs historically, so we
> > adopt this behavior when the cpumask is empty or not present).
>
> The no cpus/cpumasks and empty cpumasks should be different.
> No cpus/cpumasks file means that the counters/events are available for
> all the CPUs.
> But if it's an empty cpus/cpumasks file, it means that there are no
> proper CPUs. It may happen on a hybrid machine when all e-core CPUs are
> hot-removed. Since it's possible that the CPUs can be hot-added later,
> the kernel driver doesn't unregister the cpu_atom PMU.
>
> > I've
> > been thinking to expand `tools/perf/tests/pmu.c` with basic PMU sanity
> > tests. Some ideas:
> >
>
> Thanks.
>
> > 1) some kind of cpumask sanity check - we could open an event with the
> > cpumask and see if it yields multiplexing.. which would highlight the
> > ARM no cpumask bug
>
> The multiplexing is triggered when there are not enough counters. It
> should not related to the cpumask.

Agreed. Here I'm thinking about the bugs I see in PMUs. One of them is
to always succeed in opening siblings within a group and for the
unavailable counters to just report "not counted". This breaks weak
groups used by metrics.

> For the PMU without cpumask, I think the test case should try to open an
> event on all CPUs and check if the open succeeds.
> For the PMU with cpumask, the test case should try to open an event on
> the masked CPUs and check if the open succeeds.

I agree without cpumask means all CPUs, the bug I see on ARM PMUs is
that they have uncore PMUs with no or empty cpumasks leading to the
uncore events being programmed on every CPU and over counting,
multiplexing counters and so on. I'm trying to devise tests so that
they can see they are broken.

> The behavior of opening an event on unmasked CPUs seems not defined.
> For uncore, it's OK. The kernel treats the CPUs from the same socket
> exactly the same. But I'm not sure about the other PMUs.

My understanding had been that for core PMUs a "perf stat -C" option
would choose the particular CPU to count the event on, for an uncore
PMU the -C option would override the cpumask's "default" value. We
have code to validate this:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evlist.c?h=perf-tools-next#n2522
But it seems now that overriding an uncore PMU's default CPU is
ignored. If you did:
$ perf stat -C 1 -e data_read -a sleep 0.1
Then the tool thinks data_read is on CPU1 and will set its thread
affinity to CPU1 to avoid IPIs. It seems to fix this we need to just
throw away the -C option.

> > 2) do the /sys/devices/<pmu>/events/event.(unit|scale|per-pkg|snapshot)
> > files parse correctly and have a corresponding event.
> > 3) keep adding opening events on the PMU to a group to make sure that
> > when counters are exhausted the perf_event_open fails (I've seen this
> > bug on AMD)
> > 4) are the values in the type file unique
> >
>
> The rest sounds good to me.

Cool. Let me know if you can think of more.

Thanks,
Ian

> Thanks,
> Kan