[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aAaVH4W72fOzQhnh@agluck-desk3>
Date: Mon, 21 Apr 2025 11:57:35 -0700
From: "Luck, Tony" <tony.luck@...el.com>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: Fenghua Yu <fenghuay@...dia.com>,
Maciej Wieczor-Retman <maciej.wieczor-retman@...el.com>,
Peter Newman <peternewman@...gle.com>,
James Morse <james.morse@....com>, Babu Moger <babu.moger@....com>,
Drew Fustini <dfustini@...libre.com>,
Dave Martin <Dave.Martin@....com>,
Anil Keshavamurthy <anil.s.keshavamurthy@...el.com>,
linux-kernel@...r.kernel.org, patches@...ts.linux.dev
Subject: Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
On Fri, Apr 18, 2025 at 02:13:39PM -0700, Reinette Chatre wrote:
> One aspect that is only hinted to in the final documentation patch is
> how users are expected to use this feature. As I understand the number of
> monitor groups supported by resctrl is still guided by the number of RMIDs
> supported by L3 monitoring. This work hints that the telemetry feature may
> not match that number of RMIDs and a monitor group may thus exist but
> when a user attempts to ready any of these perf files it will return
> "unavailable".
>
> The series attempts to address it by placing the number of RMIDs available
> for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
> group is not exposed to user space (unless debugging enabled) the user does
> not know if a monitor group will support this feature or not. This seems awkward
> to me. Why not limit the number of monitor groups that can be created to the
> minimum number of RMIDs across these resources like what is done for CLOSid?
Reinette,
The mismatch between number of RMIDs supported by different components
is a thorny one, and may keep repeating since it feels like systems are
composed of a bunch of lego-like bricks snapped together from a box of
parts available to the h/w architect.
In this case we have three meanings for "number of RMIDs":
1) The number for legacy features enumerated by CPUID leaf 0xF.
2) The number of registers in MMIO space for each event. This is
enumerated in the XML files and is the value I placed into telem_entry::num_rmids.
3) The number of "h/w counters" (this isn't a strictly accurate
description of how things work, but serves as a useful analogy that
does describe the limitations) feeding to those MMIO registers. This is
enumerated in telemetry_region::num_rmids returned from the call to
intel_pmt_get_regions_by_feature()
If "1" is the smallest of these values, the OS will be limited in
which values can be written to the IA32_PQR_ASSOC MSR. Existing
code will do the right thing by limiting RMID allocation to this
value.
If "2" is greater than "1", then the extra MMIO registers will
sit unused.
If "2" is less than "1" my v3 returns the (problematic) -ENOENT
This can't happen in the CPU that debuts this feature, but the check
is there to prevent running past the end of the MMIO space in case
this does occur some day. I'll fix error path in next version to
make sure this end up with "Unavailable".
If "3" is less than "2" then the system will attach "h/w counters" to
MMIO registers in a "most recently used" algorithm. So if the number
of active RMIDs in some time interval is less than "3" the user will
get good values. But if the number of active RMIDs rises above "3"
then the user will see "Unavailable" returns as "h/w counters" are
reassigned to different RMIDs (making the feature really hard to use).
In the debut CPU the "energy" feature has sufficient "energy" counters
to avoid this. But not enough "perf" counters. I've pushed and the
next CPU with the feature will have enough "h/w counters".
My proposal for v4:
Add new options to the "rdt=" kernel boot parameter for "energy"
and "perf".
Treat the case where there are not enough "h/w counters" as an erratum
and do not enable the feature. User can override with "rdt=perf"
if they want the counters for some special case where they limit
the number of simultaneous active RMIDs.
User can use "rdt=!energy,!perf" if they don't want to see the
clutter of all the new files in each mon_data directory.
I'll maybe look at moving resctrl_mon_resource_init() to rdt_get_tree()
and add a "take min of all RMID limits". But since this is a "can't
happen" scenario I may skip this if it starts to get complicated.
Which leaves what should be in info/PERF_PKG_MON/num_rmids? It's
possible that some CPU implementation will have different MMIO
register counts for "perf" and "energy". It's more than possible
that number of "h/w counters" will be different. But they share the
same info file. My v3 code reports the minimum of the number
of "h/w counters" which is the most conservative option. It tells
the user not to make more mon_data directories than this if they
want usable counters across *both* perf and energy. Though they
will actually keep getting good "energy" results even if then
go past this limit.
-Tony
>
Powered by blists - more mailing lists