linux-kernel - Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ccb33985-e1d8-449e-b39e-3fccb5fc0783@intel.com>
Date: Mon, 21 Apr 2025 15:59:15 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: "Luck, Tony" <tony.luck@...el.com>
CC: Fenghua Yu <fenghuay@...dia.com>, Maciej Wieczor-Retman
	<maciej.wieczor-retman@...el.com>, Peter Newman <peternewman@...gle.com>,
	James Morse <james.morse@....com>, Babu Moger <babu.moger@....com>, "Drew
 Fustini" <dfustini@...libre.com>, Dave Martin <Dave.Martin@....com>, "Anil
 Keshavamurthy" <anil.s.keshavamurthy@...el.com>,
	<linux-kernel@...r.kernel.org>, <patches@...ts.linux.dev>
Subject: Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring

Hi Tony,

On 4/21/25 11:57 AM, Luck, Tony wrote:
> On Fri, Apr 18, 2025 at 02:13:39PM -0700, Reinette Chatre wrote:
>> One aspect that is only hinted to in the final documentation patch is
>> how users are expected to use this feature. As I understand the number of
>> monitor groups supported by resctrl is still guided by the number of RMIDs
>> supported by L3 monitoring. This work hints that the telemetry feature may
>> not match that number of RMIDs and a monitor group may thus exist but
>> when a user attempts to ready any of these perf files it will return
>> "unavailable".
>>
>> The series attempts to address it by placing the number of RMIDs available
>> for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
>> group is not exposed to user space (unless debugging enabled) the user does
>> not know if a monitor group will support this feature or not. This seems awkward
>> to me. Why not limit the number of monitor groups that can be created to the
>> minimum number of RMIDs across these resources like what is done for CLOSid?
> 
> Reinette,
> 
> The mismatch between number of RMIDs supported by different components
> is a thorny one, and may keep repeating since it feels like systems are
> composed of a bunch of lego-like bricks snapped together from a box of
> parts available to the h/w architect.

With resctrl needing to support multiple architectures' way of doing things,
needing to support variety within an architecture just seems like another step.

> 
> In this case we have three meanings for "number of RMIDs":
> 
> 1) The number for legacy features enumerated by CPUID leaf 0xF.
> 
> 2) The number of registers in MMIO space for each event. This is
> enumerated in the XML files and is the value I placed into telem_entry::num_rmids.
> 
> 3) The number of "h/w counters" (this isn't a strictly accurate
> description of how things work, but serves as a useful analogy that
> does describe the limitations) feeding to those MMIO registers. This is
> enumerated in telemetry_region::num_rmids returned from the call to
> intel_pmt_get_regions_by_feature()

Thank you for explaining this. This was not clear to me.

> 
> If "1" is the smallest of these values, the OS will be limited in
> which values can be written to the IA32_PQR_ASSOC MSR. Existing
> code will do the right thing by limiting RMID allocation to this
> value.
> 
> If "2" is greater than "1", then the extra MMIO registers will
> sit unused.

This is also an issue with this implementation, no? resctrl will not
allow creating more monitor groups than "1".

> If "2" is less than "1" my v3 returns the (problematic) -ENOENT
> This can't happen in the CPU that debuts this feature, but the check
> is there to prevent running past the end of the MMIO space in case
> this does occur some day. I'll fix error path in next version to
> make sure this end up with "Unavailable".

This is a concern since this means the interface becomes a "try and see"
for user space. As I understand a later statement the idea is that
"2" should be used by user space to know how many "mon_groups" directories
should be created to get telemetry support. To me this looks to be
a space that will create a lot of confusion. The moment user space
creates "2" + 1 "mon_groups" directories it becomes a guessing game
of what any new monitor group actually supports. After crossing that
threshold I do not see a good way for going back since if user space
removes one "mon_data" directory it does get back to "2" but then needs to
rely on resctrl internals or debugging to know for sure what the new
monitor group supports.

> 
> If "3" is less than "2" then the system will attach "h/w counters" to
> MMIO registers in a "most recently used" algorithm. So if the number
> of active RMIDs in some time interval is less than "3" the user will
> get good values. But if the number of active RMIDs rises above "3"
> then the user will see "Unavailable" returns as "h/w counters" are
> reassigned to different RMIDs (making the feature really hard to use).

Could the next step be for the architecture to allow user space to
specify which hardware counters need to be assigned? With a new user
interface being created for such capability it may be worthwhile to
consider how it could be used/adapted for this feature. [1]

> 
> In the debut CPU the "energy" feature has sufficient "energy" counters
> to avoid this. But not enough "perf" counters. I've pushed and the
> next CPU with the feature will have enough "h/w counters".
> 
> My proposal for v4:
> 
> Add new options to the "rdt=" kernel boot parameter for "energy"
> and "perf".
> 
> Treat the case where there are not enough "h/w counters" as an erratum
> and do not enable the feature. User can override with "rdt=perf"
> if they want the counters for some special case where they limit
> the number of simultaneous active RMIDs.

This only seems to address the "3" is less than "2" issue. It is not
so obvious to me that it should be treated as an erratum. Although,
I could not tell from your description how obvious this issue will be
to user space. For example, is it clear that if user space
gets *any* value then it is "good" and "Unavailable" means ... "Unavailable", or
could a returned value mean "this is partial data that was collected
during timeframe with hardware counter re-assigned at some point"?

> 
> User can use "rdt=!energy,!perf" if they don't want to see the
> clutter of all the new files in each mon_data directory.
> 
> I'll maybe look at moving resctrl_mon_resource_init() to rdt_get_tree()
> and add a "take min of all RMID limits". But since this is a "can't
> happen" scenario I may skip this if it starts to get complicated.

I do not think that the "2" is less than "1" scenario should be 
ignored for reasons stated above and in review of this version.

What if we enhance resctrl's RMID assignment (setting aside for
a moment PMG assignment) to be directed by user space?

Below is an idea of an interface that can give user space 
control over what monitor groups are monitoring. This is very likely not
the ideal interface but I would like to present it as a start for
better ideas.

For example, monitor groups are by default created with most abundant
(and thus supporting fewest features on fewest resources) RMID.
The user is then presented with a new file (within each monitor group)
that lists all available features and which one(s) are active. For example,
let's consider hypothetical example where PERF_PKG perf has x RMID, PERF_PKG energy
has y RMID, and L3_MON has z RMID, with x < y < z. By default when user space
creates a monitor group resctrl will pick "abundant" RMID from range y + 1 to z
that only supports L3 monitoring:

# cat /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
[L3]
PERF_PKG:energy
PERF_PKG:perf

In above case there will be *no* mon_PERF_PKG_XX directories in 
/sys/fs/resctrl/mon_groups/m1/mon_data.

*If* user space wants perf/energy telemetry for this monitor
group then they can enable needed feature with clear understanding that
it is disruptive to all ongoing monitoring since a new RMID will be assigned.
For example, if user wants PERF_PKG:energy and PERF_PKG:perf then
user can do so with:

# echo PERF_PKG:perf > /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
# cat /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
[L3]
[PERF_PKG:energy]
[PERF_PKG:perf]

After the above all energy and perf files will appear in new mon_PERF_PKG_XX
directories.

User space can then have full control of what is monitored by which monitoring
group. If no RMIDs are available in a particular pool then user space can get
an "out of space" error and be the one to decide how it should be managed.

This also could be a way in which the "2" is larger than "1" scenario
can be addressed. 

> Which leaves what should be in info/PERF_PKG_MON/num_rmids? It's
> possible that some CPU implementation will have different MMIO
> register counts for "perf" and "energy". It's more than possible
> that number of "h/w counters" will be different. But they share the
> same info file. My v3 code reports the minimum of the number
> of "h/w counters" which is the most conservative option. It tells
> the user not to make more mon_data directories than this if they
> want usable counters across *both* perf and energy. Though they
> will actually keep getting good "energy" results even if then
> go past this limit.

num_rmids is a source of complications. It does not have a good equivalent
for MPAM and there has been a few attempts at proposing alternatives that may
be worth keeping in mind while making changes here:
https://lore.kernel.org/all/cbe665c2-fe83-e446-1696-7115c0f9fd76@arm.com/
https://lore.kernel.org/lkml/46767ca7-1f1b-48e8-8ce6-be4b00d129f9@intel.com/

> 
> -Tony
> 

Reinette


[1] https://lore.kernel.org/lkml/cover.1743725907.git.babu.moger@amd.com/

ps. I needed to go back an re-read the original cover-letter a couple of
times, while doing so I noticed one typo in the Background section: OOMMSM -> OOBMSM.