[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fXBYVH=qBJPm1d0-QEp8+HP3WtTLuukfF0g-2WU3gTofQ@mail.gmail.com>
Date: Tue, 16 Jul 2024 15:47:22 -0700
From: Ian Rogers <irogers@...gle.com>
To: Dhananjay Ugwekar <Dhananjay.Ugwekar@....com>
Cc: peterz@...radead.org, mingo@...hat.com, acme@...nel.org,
namhyung@...nel.org, mark.rutland@....com, alexander.shishkin@...ux.intel.com,
jolsa@...nel.org, adrian.hunter@...el.com, kan.liang@...ux.intel.com,
tglx@...utronix.de, bp@...en8.de, dave.hansen@...ux.intel.com, x86@...nel.org,
kees@...nel.org, gustavoars@...nel.org, rui.zhang@...el.com,
oleksandr@...alenko.name, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-hardening@...r.kernel.org,
ananth.narayan@....com, gautham.shenoy@....com, kprateek.nayak@....com,
ravi.bangoria@....com, sandipan.das@....com, linux-pm@...r.kernel.org
Subject: Re: [PATCH v4 00/11] Add per-core RAPL energy counter support for AMD CPUs
On Tue, Jul 16, 2024 at 1:42 AM Dhananjay Ugwekar
<Dhananjay.Ugwekar@....com> wrote:
>
> Hello Ian,
>
> On 7/15/2024 8:52 PM, Ian Rogers wrote:
> > On Mon, Jul 15, 2024 at 2:36 AM Dhananjay Ugwekar
> > <Dhananjay.Ugwekar@....com> wrote:
> >>
> >> Hello Ian,
> >>
> >> On 7/12/2024 3:53 AM, Ian Rogers wrote:
> >>> On Thu, Jul 11, 2024 at 3:25 AM Dhananjay Ugwekar
> >>> <Dhananjay.Ugwekar@....com> wrote:
> >>>>
> >>>> Currently the energy-cores event in the power PMU aggregates energy
> >>>> consumption data at a package level. On the other hand the core energy
> >>>> RAPL counter in AMD CPUs has a core scope (which means the energy
> >>>> consumption is recorded separately for each core). Earlier efforts to add
> >>>> the core event in the power PMU had failed [1], due to the difference in
> >>>> the scope of these two events. Hence, there is a need for a new core scope
> >>>> PMU.
> >>>>
> >>>> This patchset adds a new "power_per_core" PMU alongside the existing
> >>>> "power" PMU, which will be responsible for collecting the new
> >>>> "energy-per-core" event.
> >>>
> >>> Sorry for being naive, is the only reason for adding the new PMU for
> >>> the sake of the cpumask? Perhaps we can add per event cpumasks like
> >>> say `/sys/devices/power/events/energy-per-core.cpumask` which contains
> >>> the CPUs of the different cores in this case. There's supporting
> >>> hotplugging CPUs as an issue. Adding the tool support for this
> >>> wouldn't be hard and it may be less messy (although old perf tools on
> >>> new kernels wouldn't know about these files).
> >>
> >> I went over the two approaches and below are my thoughts,
> >>
> >> New PMU approach:
> >> Pros
> >> * It will work with older perf tools, hence these patches can be backported to an older kernel and the new per-core event will work there as well.
> >> Cons
> >> * More code changes in rapl.c
> >>
> >> Event specific cpumask approach:
> >> Pros
> >> * It might be easier to add diff scope events within the same PMU in future(although currently I'm not able to find such a usecase, apart from the RAPL pkg and core energy counters)
> >> Cons
> >> * Both new kernel and perf tool will be required to use the new per-core event.
> >>
> >> I feel that while the event-specific cpumask is a viable alternative to the new PMU addition approach, I dont see any clear pros to select that over the current approach. Please let me know if you have any design related concerns to the addition of new PMU or your concern is mostly about the amount of code changes in this approach.
> >
> > Thanks Dhananjay, and thanks for taking the time for an objective
> > discussion on the mailing list. I'm very supportive of seeing the work
> > you are enabling land.
> >
> > My concern comes from the tool side. If every PMU starts to have
> > variants for the sake of the cpumask what does this mean for
> > aggregation in the perf tool? There is another issue around event
> > grouping, you can't group events across PMUs, but my feeling is that
> > event grouping needs to be rethought. By default the power_per_core
> > events are going to be aggregated together by the perf tool, which
> > then loses their per_core-ness.
>
> Yea right, maybe we need to fix this behavior.
>
> >
> > I was trying to think of the same problem but in other PMUs. One
> > thought I had was the difference between hyperthread and core events.
> > At least on Intel, some events can only count for the whole core not
> > per hyperthread. The events don't have a cpu_per_core PMU, they just
> > use the regular cpu one, and so the cpumask is set to all online
> > hyperthreads. When a per-core event is programmed it will get
> > programmed on every hyperthread and so counted twice for the core.
> > This at the least wastes a counter, but it probably also yields twice
> > the expected count as every event is counted twice then aggregated. So
> > this is just wrong and the user is expected to be smart and fix it
> > (checking the x86 events there is a convention to use a ".ALL" or
> > ".ANY" suffix for core wide events iirc). If we had a cpumask for
> > these events then we could avoid the double setting, free up a counter
> > and avoid double counting. Were we to fix things the way it is done in
> > this patch series we'd add another PMU.
>
> Yes, this seems like a valid usecase for event-specific cpumasks.
>
> >
> > My feeling is that in the longer term a per event cpumask looks
> > cleaner. I think either way you need a new kernel for the new RAPL
> > events. The problem with an old perf tool and a new kernel, this
> > doesn't normally happen with distributions as they match the perf tool
> > to the kernel version needlessly losing features and fixes along the
> > way. If the new PMU is going to get backported through fixes.. then we
> > can do similar for reading the per event cpumask. I'd be tempted not
> > to do this and focus on the next LTS kernel, getting the kernel and
> > tool fixes in as necessary.
>
> Makes sense, even though this approach will require more effort but it seems
> to be worthwhile as it would help things down the line (make it easier to have
> heterogenous-scope events within a PMU). I'll need to go through the perf tool
> to see how we can design this. I'll get back with an RFC series probably once
> I have an initial design in mind.
Hi Dhananjay,
I can have a go at the perf tool side of this - I probably know the
way around the code best. Basically we need to do something similar to
how other "<event>.<setting>" values are parsed:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.c?h=perf-tools-next#n335
The cpumask handling in the perf tool is a little weird as there is a
summary value in the evlist. Anyway, I can do that if you want to spin
the RAPL/power PMU side of things.
Thanks,
Ian
> Thanks,
> Dhananjay
>
> >
> > Thanks,
> > Ian
> >
> >
> >> Thanks,
> >> Dhananjay
> >>
> >>>
> >>> Thanks,
> >>> Ian
> >>>
> >>>
> >>>> Tested the package level and core level PMU counters with workloads
> >>>> pinned to different CPUs.
> >>>>
> >>>> Results with workload pinned to CPU 1 in Core 1 on an AMD Zen4 Genoa
> >>>> machine:
> >>>>
> >>>> $ perf stat -a --per-core -e power_per_core/energy-per-core/ -- sleep 1
> >>>>
> >>>> Performance counter stats for 'system wide':
> >>>>
> >>>> S0-D0-C0 1 0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C1 1 5.72 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C2 1 0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C3 1 0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C4 1 0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C5 1 0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C6 1 0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C7 1 0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C8 1 0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C9 1 0.02 Joules power_per_core/energy-per-core/
> >>>> S0-D0-C10 1 0.02 Joules power_per_core/energy-per-core/
> >>>>
> >>>> [1]: https://lore.kernel.org/lkml/3e766f0e-37d4-0f82-3868-31b14228868d@linux.intel.com/
> >>>>
> >>>> This patchset applies cleanly on top of v6.10-rc7 as well as latest
> >>>> tip/master.
> >>>>
> >>>> v4 changes:
> >>>> * Add patch 11 which removes the unused function cpu_to_rapl_pmu()
> >>>> * Add Rui's rb tag for patch 1
> >>>> * Invert the pmu scope check logic in patch 2 (Peter)
> >>>> * Add comments explaining the scope check in patch 2 (Peter)
> >>>> * Use cpumask_var_t instead of cpumask_t in patch 5 (Peter)
> >>>> * Move renaming code to patch 8 (Rui)
> >>>> * Reorder the cleanup order of per-core and per-pkg PMU in patch 10 (Rui)
> >>>> * Add rapl_core_hw_unit variable to store the per-core PMU unit in patch
> >>>> 10 (Rui)
> >>>>
> >>>> PS: Scope check logic is still kept the same (i.e., all Intel systems being
> >>>> considered as die scope), Rui will be modifying it to limit the die-scope
> >>>> only to Cascadelake-AP in a future patch on top of this patchset.
> >>>>
> >>>> v3 changes:
> >>>> * Patch 1 added to introduce the logical_core_id which is unique across
> >>>> the system (Prateek)
> >>>> * Use the unique topology_logical_core_id() instead of
> >>>> topology_core_id() (which is only unique within a package on tested
> >>>> AMD and Intel systems) in Patch 10
> >>>>
> >>>> v2 changes:
> >>>> * Patches 6,7,8 added to split some changes out of the last patch
> >>>> * Use container_of to get the rapl_pmus from event variable (Rui)
> >>>> * Set PERF_EV_CAP_READ_ACTIVE_PKG flag only for pkg scope PMU (Rui)
> >>>> * Use event id 0x1 for energy-per-core event (Rui)
> >>>> * Use PERF_RAPL_PER_CORE bit instead of adding a new flag to check for
> >>>> per-core counter hw support (Rui)
> >>>>
> >>>> Dhananjay Ugwekar (10):
> >>>> perf/x86/rapl: Fix the energy-pkg event for AMD CPUs
> >>>> perf/x86/rapl: Rename rapl_pmu variables
> >>>> perf/x86/rapl: Make rapl_model struct global
> >>>> perf/x86/rapl: Move cpumask variable to rapl_pmus struct
> >>>> perf/x86/rapl: Add wrapper for online/offline functions
> >>>> perf/x86/rapl: Add an argument to the cleanup and init functions
> >>>> perf/x86/rapl: Modify the generic variable names to *_pkg*
> >>>> perf/x86/rapl: Remove the global variable rapl_msrs
> >>>> perf/x86/rapl: Add per-core energy counter support for AMD CPUs
> >>>> perf/x86/rapl: Remove the unused function cpu_to_rapl_pmu
> >>>>
> >>>> K Prateek Nayak (1):
> >>>> x86/topology: Introduce topology_logical_core_id()
> >>>>
> >>>> Documentation/arch/x86/topology.rst | 4 +
> >>>> arch/x86/events/rapl.c | 454 ++++++++++++++++++--------
> >>>> arch/x86/include/asm/processor.h | 1 +
> >>>> arch/x86/include/asm/topology.h | 1 +
> >>>> arch/x86/kernel/cpu/debugfs.c | 1 +
> >>>> arch/x86/kernel/cpu/topology_common.c | 1 +
> >>>> 6 files changed, 328 insertions(+), 134 deletions(-)
> >>>>
> >>>> --
> >>>> 2.34.1
> >>>>
Powered by blists - more mailing lists