[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <679dcd01-05e5-476a-91dd-6d1d08637b3e@intel.com>
Date: Wed, 11 Feb 2026 14:22:55 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: Ben Horgan <ben.horgan@....com>
CC: "Moger, Babu" <bmoger@....com>, "Moger, Babu" <Babu.Moger@....com>, "Luck,
Tony" <tony.luck@...el.com>, Drew Fustini <fustini@...nel.org>,
"corbet@....net" <corbet@....net>, "Dave.Martin@....com"
<Dave.Martin@....com>, "james.morse@....com" <james.morse@....com>,
"tglx@...nel.org" <tglx@...nel.org>, "mingo@...hat.com" <mingo@...hat.com>,
"bp@...en8.de" <bp@...en8.de>, "dave.hansen@...ux.intel.com"
<dave.hansen@...ux.intel.com>, "x86@...nel.org" <x86@...nel.org>,
"hpa@...or.com" <hpa@...or.com>, "peterz@...radead.org"
<peterz@...radead.org>, "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
"dietmar.eggemann@....com" <dietmar.eggemann@....com>, "rostedt@...dmis.org"
<rostedt@...dmis.org>, "bsegall@...gle.com" <bsegall@...gle.com>,
"mgorman@...e.de" <mgorman@...e.de>, "vschneid@...hat.com"
<vschneid@...hat.com>, "akpm@...ux-foundation.org"
<akpm@...ux-foundation.org>, "pawan.kumar.gupta@...ux.intel.com"
<pawan.kumar.gupta@...ux.intel.com>, "pmladek@...e.com" <pmladek@...e.com>,
"feng.tang@...ux.alibaba.com" <feng.tang@...ux.alibaba.com>,
"kees@...nel.org" <kees@...nel.org>, "arnd@...db.de" <arnd@...db.de>,
"fvdl@...gle.com" <fvdl@...gle.com>, "lirongqing@...du.com"
<lirongqing@...du.com>, "bhelgaas@...gle.com" <bhelgaas@...gle.com>,
"seanjc@...gle.com" <seanjc@...gle.com>, "xin@...or.com" <xin@...or.com>,
"Shukla, Manali" <Manali.Shukla@....com>, "dapeng1.mi@...ux.intel.com"
<dapeng1.mi@...ux.intel.com>, "chang.seok.bae@...el.com"
<chang.seok.bae@...el.com>, "Limonciello, Mario" <Mario.Limonciello@....com>,
"naveen@...nel.org" <naveen@...nel.org>, "elena.reshetova@...el.com"
<elena.reshetova@...el.com>, "Lendacky, Thomas" <Thomas.Lendacky@....com>,
"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "peternewman@...gle.com"
<peternewman@...gle.com>, "eranian@...gle.com" <eranian@...gle.com>, "Shenoy,
Gautham Ranjal" <gautham.shenoy@....com>
Subject: Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and
context switch handling
Hi Ben,
On 2/11/26 8:40 AM, Ben Horgan wrote:
> On Tue, Feb 10, 2026 at 10:04:48AM -0800, Reinette Chatre wrote:
>> On 2/10/26 8:17 AM, Reinette Chatre wrote:
>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>>
>>>>
>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>>> Babu,
>>>>>>
>>>>>> I've read a bit more of the code now and I think I understand more.
>>>>>>
>>>>>> Some useful additions to your explanation.
>>>>>>
>>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>>
>>>>> Yes. Correct.
>>>
>>> Why limit it to one CTRL_MON group and why not support it for MON groups?
>>>
>>> Limiting it to a single CTRL group seems restrictive in a few ways:
>>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
>>> number of use cases that can be supported. Consider, for example, an existing
>>> "high priority" resource group and a "low priority" resource group. The user may
>>> just want to let the tasks in the "low priority" resource group run as "high priority"
>>> when in CPL0. This of course may depend on what resources are allocated, for example
>>> cache may need more care, but if, for example, user is only interested in memory
>>> bandwidth allocation this seems a reasonable use case?
>>> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
>>> capable of in terms of number of different control groups/CLOSID that can be
>>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
>>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
>>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
>>> example, create a resource group that contains tasks of interest and create
>>> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
>>> This will give user space better insight into system behavior and from what I can
>>> tell is supported by the feature but not enabled?
>>>
>>>>>
>>>>>> 2) It can't be the root/default group
>>>>>
>>>>> This is something I added to keep the default group in a un-disturbed,
>>>
>>> Why was this needed?
>>>
>>>>>
>>>>>> 3) It can't have sub monitor groups
>>>
>>> Why not?
>>>
>>>>>> 4) It can't be pseudo-locked
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
>>>>>> would avoid any additional context switch overhead as the PLZA MSR would never
>>>>>> need to change.
>>>>>
>>>>> Yes. That can be one use case.
>>>>>
>>>>>>
>>>>>> If that is the case, maybe for the PLZA group we should allow user to
>>>>>> do:
>>>>>>
>>>>>> # echo '*' > tasks
>>>
>>> Dedicating a resource group to "PLZA" seems restrictive while also adding many
>>> complications since this designation makes resource group behave differently and
>>> thus the files need to get extra "treatments" to handle this "PLZA" designation.
>>>
>>> I am wondering if it will not be simpler to introduce just one new file, for example
>>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
>>> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
>>> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
>>> resource group to manage user space and kernel space allocations while also supporting
>>> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
>>> use case where user space can create a new resource group with certain allocations but the
>>> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
>>> the resource group's allocations when in CPL0.
>
> If there is a "tasks_cpl0" then I'd expect a "cpus_cpl0" too.
That is reasonable, yes.
>> It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
>> with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
>> instead of CPL0 using something like "kernel" or ... ?
>
> Yes, PLZA does open up more possibilities for MPAM usage. I've talked to James
> internally and here are a few thoughts.
>
> If the user case is just that an option run all tasks with the same closid/rmid
> (partid/pmg) configuration when they are running in the kernel then I'd favour a
> mount option. The resctrl filesytem interface doesn't need to change and
I view mount options as an interface of last resort. Why would a mount option be needed
in this case? The existence of the file used to configure the feature seems sufficient?
Also ...
I do not think resctrl should unnecessarily place constraints on what the hardware
features are capable of. As I understand, both PLZA and MPAM supports use case where
tasks may use different CLOSID/RMID (PARTID/PMG) when running in the kernel. Limiting
this to only one CLOSID/PARTID seems like an unmotivated constraint to me at the moment.
This may be because I am not familiar with all the requirements here so please do
help with insight on how the hardware feature is intended to be used as it relates
to its design.
We have to be very careful when constraining a feature this much If resctrl does something
like this it essentially restricts what users could do forever.
> userspace software doesn't need to change. This could either take away a
> closid/rmid from userspace and dedicate it to the kernel or perhaps have a
> policy to have the default group as the kernel group. If you use the default
Similar to above I do not see PLZA or MPAM preventing sharing of CLOSID/RMID (PARTID/PMG)
between user space and kernel. I do not see a motivation for resctrl to place such
constraint.
> configuration, at least for MPAM, the kernel may not be running at the highest
> priority as a minimum bandwidth can be used to give a priority boost. (Once we
> have a resctrl schema for this.)
>
> It could be useful to have something a bit more featureful though. Is there a
> need for the two mappings, task->cpl0 config and task->cpl1 to be independent or
> would as task->(cp0 config, cp1 config) be sufficient? It seems awkward that
> it's not a single write to move a task. If a single mapping is sufficient, then
Moving a task in x86 is currently two writes by writing the CLOSID and RMID separately.
I think the MPAM approach is better and there may be opportunity to do this in a similar
way and both architectures use the same field(s) in the task_struct.
> as single new file, kernel_group,per CTRL_MON group (maybe MON groups) as
> suggested above but rather than a task that file could hold a path to the
> CTRL_MON/MON group that provides the kernel configuraion for tasks running in
> that group. So that this can be transparent to existing software an empty string
Something like this would force all tasks of a group to run with the same CLOSID/RMID
(PARTID/PMG) when in kernel space. This seems to restrict what the hardware supports
and may reduce the possible use case of this feature.
For example,
- There may be a scenario where there is a set of tasks with a particular allocation
when running in user space but when in kernel these tasks benefit from different
allocations. Consider for example below arrangement where tasks 1, 2, and 3 run in
user space with allocations from resource_groupA. While these tasks are ok with this
allocation when in user space they have different requirements when it comes to
kernel space. There may be a resource_groupB that allocates a lot of resources ("high
priority") that task 1 should use for kernel work and a resource_groupC that allocates
fewer resources that tasks 2 and 3 should use for kernel work ("medium priority").
resource_groupA:
schemata: <average allocations that work for tasks 1, 2, and 3 when in user space>
tasks when in user space: 1, 2, 3
resource_groupB:
schemata: <high priority allocations>
tasks when in kernel space: 1
resource_groupC:
schemata: <medium priority allocations>
tasks when in kernel space: 2, 3
If user space is forced to have the same tasks have the same user space and kernel
allocations then that will force user space to create additional resource groups that
will use up CLOSID/PARTID that is a scarce resource.
- There may be a scenario where the user is attempting to understand system behavior by
monitoring individual or subsets of tasks' bandwidth usage when in kernel space.
- From what I can tell PLZA also supports *different* allocations when in user vs
kernel space while using the *same* monitoring group for both. This does not seem
transferable to MPAM and would take more effort to support in resctrl but it is
a use case that the hardware enables.
When enabling a feature I would of course prefer not to add unnecessary complexity. Even so,
resctrl is expected to expose hardware capabilities to user space. There seems to be some
opinions on how user space will now and forever interact with these features that
are not clear to me so I would appreciate more insight in why these constraints are
appropriate.
Reinette
> can mean use the current group's when in the kernel (as well as for
> userspace). A slash, /, could be used to refer to the default group. This would
> give something like the below under /sys/fs/resctrl.
>
> .
> ├── cpus
> ├── tasks
> ├── ctrl1
> │ ├── cpus
> │ ├── kernel_group -> mon_groups/mon1
> │ └── tasks
> ├── kernel_group -> ctrl1
> └── mon_groups
> └── mon1
> ├── cpus
> ├── kernel_group -> ctrl1
> └── tasks
>
>>
>> I have not read anything about the RISC-V side of this yet.
>>
>> Reinette
>>
>>>
>>> Reinette
>>>
>>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
>>
>
> Thanks,
>
> Ben
Powered by blists - more mailing lists