[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dd5ba9e5-9809-4792-966a-e35368ab89f0@intel.com>
Date: Thu, 16 Oct 2025 09:31:45 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Dave Martin <Dave.Martin@....com>
CC: "Luck, Tony" <tony.luck@...el.com>, <linux-kernel@...r.kernel.org>, "James
Morse" <james.morse@....com>, Thomas Gleixner <tglx@...utronix.de>, "Ingo
Molnar" <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
<dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, "Jonathan
Corbet" <corbet@....net>, <x86@...nel.org>, <linux-doc@...r.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
per-arch
Hi Dave,
On 10/15/25 8:47 AM, Dave Martin wrote:
> Hi Reinette,
>
> On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 10/13/25 7:36 AM, Dave Martin wrote:
>>> Hi Reinette,
>>>
>>> On Fri, Oct 10, 2025 at 09:48:21AM -0700, Reinette Chatre wrote:
>>>> Hi Dave,
>>>>
>>>> On 9/30/25 8:40 AM, Dave Martin wrote:
>>>>> On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote:
>>>>>> On 9/29/25 6:56 AM, Dave Martin wrote:
>>>
>>> [...]
>>>
>>>>>> 1) Commented schema are "inactive"
>>>>>> This is unclear to me. In the MB example the commented lines show the
>>>>>> finer grained controls. Since the original MB resource is an approximation
>>>>>> and the hardware must already be configured to support it, would the #-prefixed
>>>>>> lines not show the actual "active" configuration?
>>>>>
>>>>> They would show the active configuration (possibly more precisely than
>>>>> "MB" does).
>>>>
>>>> That is how I see it also. This is specific to MB as we try to maintain
>>>> backward compatibility.
>>>>
>>>> If we are going to make user interface changes to resource allocation then
>>>> ideally it should consider all known future usage. I am trying to navigate
>>>> and understand the discussion on how resctrl can support MPAM and this
>>>> RDT region aware requirements.
>>>>
>>>> I scanned the MPAM spec and from what I understand a resource may support
>>>> multiple controls at the same time, each with its own properties, and then
>>>> there was this:
>>>>
>>>> When multiple partitioning controls are active, each affects the partition’s
>>>> bandwidth usage. However, some combinations of controls may not make sense,
>>>> because the regulation of that pair of controls cannot be made to work in concert.
>>>>
>>>> resctrl may thus present an "active configuration" that is not a configuration
>>>> that "makes sense" ... this may be ok as resctrl would present what hardware
>>>> supports combined with what user requested.
>>>
>>> This is analogous to what the MPAM spec says, though if resctrl offers
>>> two different schemata for the same hardware control, the control cannot be
>>> configured with both values simultaneously.
>>>
>>> For the MPAM hardware controls affecting the same hardware resource,
>>> they can be programmed to combinations of values that have no sensible
>>> interpretation, and the values can be read back just fine. The
>>> performance effects may not be what the user expected / wanted, but
>>> this is not directly visible to resctrl.
>>>
>>> So, if we offer independent schemata for MBW_MIN and MBW_MAX, the user
>>> can program MBW_MIN=75% and MBW_MAX=25% for the same PARTID, and that
>>> will read back just as programmed. The architecture does not promise
>>> what the performance effect of this will be, but resctrl does not need
>>> to care.
>>
>> The same appears to be true for Intel RDT where the spec warns ("Undesirable
>> and undefined performance effects may result if cap programming guidelines
>> are not followed.") but does not seem to prevent such configurations.
>
> Right. We _could_ block such a configuration from reaching the hardware,
> if the arch backend overrides the MIN limit when the MAX limit is
> written and vice-versa, when not doing to would result in crossed-over
> bounds.
>
> If software wants to program both bounds, then that would be fine: in:
>
> # cat <<-EOF >/sys/fs/resctrl/schemata
> MB_MAX: 0=128
> EOF
>
> # cat <<-EOF >/sys/fs/resctrl/schemata
> MB_MIN: 0=256
> MB_MAX: 0=1024
> EOF
>
> ... internally programming some value >=256 before programming the
> hardware with the new min bound would not stop the final requested
> change to MB_MAX from working as userspace expected.
>
> (There will be inevitable regulation glitches unless the hardware
> provides a way to program both bounds atomically. MPAM doesn't; I
> don't think RDT does either?)
>
>
> But we only _need_ to do this if the hardware architecture forbids
> programming cross bounds or says that it is unsafe to do so. So, I am
> thinking that the generic code doesn't need to handle this.
>
> [...]
Sounds reasonable to me.
...
>>>>> MB: 0=50, 1=50
>>>>> # MB_HW: 0=32, 1=32
>>>>> # MB_MIN: 0=16, 1=16
>>>>> # MB_MAX: 0=32, 1=32
>>>>
>>>> Could/should resctrl uncomment the lines after userspace modified them?
>>>
>>> The '#' wasn't meant to be a state that gets turned on and off.
>>
>> Thank you for clarifying.
>>
>>> Rather, userspace would use this to indicate which entries are
>>> intentionally being modified.
>>
>> I see. I assume that we should not see many of these '#' entries and expect
>> the ones we do see to shadow the legacy schemata entries. New schemata entries
>> (that do not shadow legacy ones) should not have the '#' prefix even if
>> their initial support does not include all controls.
>>> So long as the entries affecting a single resource are ordered so that
>>> each entry is strictly more specific than the previous entries (as
>>> illustrated above), then reading schemata and stripping all the hashes
>>> would allow a previous configuration to be restored; to change just one
>>> entry, userspace can uncomment just that one, or write only that entry
>>> (which is what I think we should recommend for new software).
>>
>> This is a good rule of thumb.
>
> To avoid printing entries in the wrong order, do we want to track some
> parent/child relationship between schemata.
>
> In the above example,
>
> * MB is the parent of MB_HW;
>
> * MB_HW is the parent of MB_MIN and MB_MAX.
>
> (for MPAM, at least).
Could you please elaborate this relationship? I envisioned the MB_HW to be
something similar to Intel RDT's "optimal" bandwidth setting ... something
that is expected to be somewhere between the "min" and the "max".
But, now I think I'm a bit lost in MPAM since it is not clear to me what
MB_HW represents ... would this be the "memory bandwidth portion
partitioning"? Although, that uses a completely different format from
"min" and "max".
>
> When schemata is read, parents should always be printed before their
> child schemata. But really, we just need to make sure that the
> rdt_schema_all list is correctly ordered.
>
>
> Do you think that this relationship needs to be reported to userspace?
You brought up the topic of relationships in
https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ that prompted me
to learn more from the MPAM spec where I learned and went on tangent about all
the other possible namespaces without circling back.
I was hoping that the namespace prefix would make the relationships clear,
something like <resource>_<control>, but I did not expect another layer in
the hierarchy like your example above. The idea of "parent" and "child" is
also not obvious to me at this point. resctrl gives us a "resource" to start
with and we are now discussing multiple controls per resource. Could you please
elaborate what you see as "parent" and "child"?
We do have the info directory available to express relationships and a
hierarchy is already starting to taking shape there.
>
> Since the "#" convention is for backward compatibility, maybe we should
> not use this for new schemata, and place the burden of managing
> conflicts onto userspace going forward. What do you think?
I agree. The way I understand this is that the '#' will only be used for
new controls that shadow the default/current controls of the legacy resources.
I do not expect that the prefix will be needed for new resources, even if
the initial support of a new resource does not include all possible controls.
>>>>> (For hardware-specific reasons, the MPAM driver currently internally
>>>>> programs the MIN bound to be a bit less than the MAX bound, when
>>>>> userspace writes an "MB" entry into schemata. The key thing is that
>>>>> writing MB may cause the MB_MIN/MB_MAX entries to change -- at the
>>>>> resctrl level, I don't that that we necessarily need to make promises
>>>>> about what they can change _to_. The exact effect of MIN and MAX
>>>>> bounds is likely to be hardware-dependent anyway.)
>>>>
>>>> MPAM has the "HARDLIM" distinction associated with these MAX values
>>>> and from what I can tell this is per PARTID. Is this something that needs
>>>> to be supported? To do this resctrl will need to support modifying
>>>> control properties per resource group.
>>>
>>> Possibly. Since this is a boolean control that determines how the
>>> MBW_MAX control is applied, we could perhaps present it as an
>>> additional schema -- if so, it's basically orthogonal.
>>>
>>> | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...]
>>>
>>> or
>>>
>>> | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...]
>>>
>>> Does this look reasonable?
>>
>> It does.
>
> OK -- note, I don't think we have any immediate plan to support this in
> the MPAM driver, but it may land eventually in some form.
>
ack.
...
>>>> MPAM allows per-PARTID configurations for secure/non-secure, physical/virtual,
>>>> ... ? Is it expected that MPAM's support of these should be exposed via resctrl?
>>>
>>> Probably not. These are best regarded as entirely separate instances
>>> of MPAM; the PARTID spaces are separate. The Non-secure physical
>>> address space is the only physical address space directly accessible to
>>> Linux -- for the others, we can't address the MMIO registers anyway.
>>>
>>> For now, the other address spaces are the firmware's problem.
>>
>> Thank you.
>
> No worries -- it's not too obvious from the spec!
>
>>>> Have you considered how to express if user wants hardware to have different
>>>> allocations for, for example, same PARTID at different execution levels?
>>>>
>>>> Reinette
>>>>
>>>> [1] https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
>>>
>>> MPAM doesn't allow different controls for a PARTID depending on the
>>> exception level, but it is possible to program different PARTIDs for
>>> hypervisor/kernel and userspace (i.e., EL2/EL1 and EL0).
>>
>> I misunderstood this from the spec. Thank you for clarifying.
>>
>>>
>>> I think that if we wanted to go down that road, we would want to expose
>>> additional "task IDs" in resctrlfs that can be placed into groups
>>> independently, say
>>>
>>> echo 14161:kernel >>.../some_group/tasks
>>> echo 14161:user >>.../other_group/tasks
>>>
>>> However, inside the kernel, the boundary between work done on behalf of
>>> a specific userspace task, work done on behalf of userspace in general,
>>> and autonomous work inside the kernel is fuzzy and not well defined.
>>>
>>> For this reason, we currently only configure the PARTID for EL0. For
>>> EL1 (and EL2 if the kernel uses it), we just use the default PARTID (0).
>>>
>>> Hopefully this is orthogonal to the discussion of schema descriptions,
>>> though ...?
>>
>> Yes.
>
> OK; I suggest that we put this on one side, for now, then.
>
> There is a discussion to be had on this, but it feels like a separate
> thing.
agreed.
>
>
> I'll try to pull the state of this discussion together -- maybe as a
> draft update to the documentation, describing the interface as proposed
> so far. Does that work for you?
It does. Thank you very much for taking this on.
Reinette
Powered by blists - more mailing lists