[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e788ca62-ec63-4552-978b-9569f369afd5@intel.com>
Date: Fri, 17 Oct 2025 08:59:45 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Dave Martin <Dave.Martin@....com>
CC: "Luck, Tony" <tony.luck@...el.com>, <linux-kernel@...r.kernel.org>, "James
Morse" <james.morse@....com>, Thomas Gleixner <tglx@...utronix.de>, "Ingo
Molnar" <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
<dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, "Jonathan
Corbet" <corbet@....net>, <x86@...nel.org>, <linux-doc@...r.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
per-arch
Hi Dave,
On 10/17/25 7:17 AM, Dave Martin wrote:
> Hi Reinette,
>
> On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 10/15/25 8:47 AM, Dave Martin wrote:
>>> Hi Reinette,
>>>
>>> On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote:
>>>> Hi Dave,
>>>>
>>>> On 10/13/25 7:36 AM, Dave Martin wrote:
...
>>>>> So long as the entries affecting a single resource are ordered so that
>>>>> each entry is strictly more specific than the previous entries (as
>>>>> illustrated above), then reading schemata and stripping all the hashes
>>>>> would allow a previous configuration to be restored; to change just one
>>>>> entry, userspace can uncomment just that one, or write only that entry
>>>>> (which is what I think we should recommend for new software).
>>>>
>>>> This is a good rule of thumb.
>>>
>>> To avoid printing entries in the wrong order, do we want to track some
>>> parent/child relationship between schemata.
>>>
>>> In the above example,
>>>
>>> * MB is the parent of MB_HW;
>>>
>>> * MB_HW is the parent of MB_MIN and MB_MAX.
>>>
>>> (for MPAM, at least).
>>
>> Could you please elaborate this relationship? I envisioned the MB_HW to be
>> something similar to Intel RDT's "optimal" bandwidth setting ... something
>> that is expected to be somewhere between the "min" and the "max".
>>
>> But, now I think I'm a bit lost in MPAM since it is not clear to me what
>> MB_HW represents ... would this be the "memory bandwidth portion
>> partitioning"? Although, that uses a completely different format from
>> "min" and "max".
>
> I confess that I'm thinking with an MPAM mindset here.
>
> Some pseudocode might help to illustrate how these might interact:
>
> set_MB(partid, val) {
> set_MB_HW(partid, percent_to_hw_val(val));
> }
>
> set_MB_HW(partid, val) {
> set_MB_MAX(partid, val);
>
> /*
> * Hysteresis to avoid steady flows from ping-ponging
> * between low and high priority:
> */
> if (hardware_has_MB_MIN())
> set_MB_MIN(partid, val * 95%);
> }
>
> set_MB_MIN(partid, val) {
> mpam->MBW_MIN[partid] = val;
> }
>
> set_MB_MAX(partid, val) {
> mpam->MBW_MAX[partid] = val;
> }
>
> with
>
> get_MB(partid) {
> return hw_val_to_percent(get_MB_HW(partid));
> }
>
> get_MB_HW(partid) { return get_MB_MAX(partid); }
>
> get_MB_MIN(partid) { return mpam->MBW_MIN[partid]; }
>
> get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; }
>
>
> The parent/child relationship I suggested is basically the call-graph
> of this pseudocode. These could all be exposed as resctrl schemata,
> but the children provide finer / more broken-down control than the
> parents. Reading a parent provides a merged or approximated view of
> the configuration of the child schemata.
>
> In particular,
>
> set_child(partid, get_child(partid));
> get_parent(partid);
>
> yields the same result as
>
> get_parent(partid);
>
> but will not be true in general, if the roles of parent and child are
> reversed.
>
> I think still this holds true if implementing an "MB_HW" schema for
> newer revisions of RDT. The pseudocode would be different, but there
> will still be a tree-like call graph (?)
Thank you very much for the example. I missed in earlier examples that
MB_HW was being controlled via MB_MAX and MB_MIN.
I do not expect such a dependence or tree-like call graph for RDT where
the closest equivalent (termed "optimal") is programmed independently from
min and max.
>
>
> Going back to MPAM:
>
> Re MPAM memory bandwidth portion partitioning (a.k.a., MBW_PART or
> MBWPBM), this is a bitmap-type control, analogous to RDT CAT: memory
> bandwidth is split into discrete, non-overlapping chunks, and each
> PARTID is configured with a bitmap saying which chunks it can use.
> This could be done by time-slicing, or controlling which memory
> controllers/ports a PARTID can issue requests to, or something like
> that.
>
> If the MBW_MAX control isn't implemented, then the MPAM current driver
> maps this bitmap control onto the resctrl "MB" schema in a simple way,
> but we are considering dropping this, since the allocation model
> (explicit, static allocation of discrete resources) is not really the
> same as for RDT MBA (dynamic prioritisation based on recent resource
> consumption).
>
> Programming MBW_MAX=50% for four PARTIDs means that the PARTIDs contend
> on an equal footing for memory bandwidth until one exceeds 50% (when it
> will start to be penalised). Prorgamming bitmaps can't have the same
> effect. For example, with { 1100, 0110, 0011, 1001 }, no group can use
> more than 50% of the full bandwidth, whatever happens. Worse, certain
> pairs of groups are fully isolated from each other, while others are
> always in contention, not matter how little actual traffic is generated.
> This is potentially useful, but it's not the same as the MIN/MAX model.
>
> So, it may make more sense to expose this as a separate, bitmap schema.
>
> (The same goes for "Proportional stride" partitioning. It's another,
> different, control for memory bandwidth. As of today, I don't think
> that we have a reference platform for experimenting with either of
> these.)
Thank you.
>
>
>>> When schemata is read, parents should always be printed before their
>>> child schemata. But really, we just need to make sure that the
>>> rdt_schema_all list is correctly ordered.
>>>
>>>
>>> Do you think that this relationship needs to be reported to userspace?
>>
>> You brought up the topic of relationships in
>> https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ that prompted me
>> to learn more from the MPAM spec where I learned and went on tangent about all
>> the other possible namespaces without circling back.
>>
>> I was hoping that the namespace prefix would make the relationships clear,
>> something like <resource>_<control>, but I did not expect another layer in
>> the hierarchy like your example above. The idea of "parent" and "child" is
>> also not obvious to me at this point. resctrl gives us a "resource" to start
>> with and we are now discussing multiple controls per resource. Could you please
>> elaborate what you see as "parent" and "child"?
>
> See above -- the parent/child concept is not an MPAM thing; apologies
> if I didn't make that clear.
>
>> We do have the info directory available to express relationships and a
>> hierarchy is already starting to taking shape there.
>
> I'm wondering whether using a common prefix will be future-proof? It
> may not always be clear which part of a name counts as the common
> prefix.
Apologies for my cryptic response. I was actually musing that we already
discussed using the info directory to express relationships between
controls and resources and it does not seem a big leap to expand
this to express relationships between controls. Consider something
like below for MPAM:
info
└── MB
└── resource_schemata
└── MB
└── MB_HW
├── MB_MAX
└── MB_MIN
On RDT it may then look different:
info
└── MB
└── resource_schemata
└── MB
├── MB_HW
├── MB_MAX
└── MB_MIN
Having the resource name as common prefix does seem consistent and makes
clear to user space which controls apply to a resource.
>
> There were already discussions about appending a number to a schema
> name in order to control different memory regions -- that's another
> prefix/suffix relationship, if so...
>
> We could handle all of this by documenting all the relationships
> explicitly. But I'm thinking that it could be easier for maintanance
> if the resctrl core code has explicit knowledge of the relationships.
Not just for resctrl self but to make clear to user space which
controls impact others and which are independent.
> That said, using a common prefix is still a good idea. But maybe we
> shouldn't lean on it too heavily as a way of actually describing the
> relationships?
I do not think we can rely on order in schemata file though. For example,
I think MPAM's MB_HW is close enough to RDT's "optimal bandwidth" for RDT to
also use the MB_HW name (or maybe MPAM and RDT can both use MB_OPT?) in either
case the schemata may print something like below on both platforms (copied from
your original example) where for MPAM it implies a relationship but for RDT it
does not:
MB: 0=50, 1=50
# MB_HW: 0=32, 1=32
# MB_MIN: 0=31, 1=31
# MB_MAX: 0=32, 1=32
>>> Since the "#" convention is for backward compatibility, maybe we should
>>> not use this for new schemata, and place the burden of managing
>>> conflicts onto userspace going forward. What do you think?
>>
>> I agree. The way I understand this is that the '#' will only be used for
>> new controls that shadow the default/current controls of the legacy resources.
>> I do not expect that the prefix will be needed for new resources, even if
>> the initial support of a new resource does not include all possible controls.
>
> OK. Note, relating this to the above, the # could be interpreted as
> meaning "this is a child of some other schema; don't mess with it
> unless you know what you are doing".
Could it be made more specific to be "this is a child of a legacy schema created
before this new format existed; don't mess with it unless you know what you are
doing"?
That is, any schema created after this new format is established does not need
the '#' prefix even if there is a parent/child relationship?
>
> Older software doesn't understand the relationships, so this is just
> there to stop it from shooting itself in the foot.
ack.
By extension I assume that software that understands a schema that is introduced
after the "relationship" format is established can be expected to understand the
format and thus these new schemata do not require the '#' prefix. Even if
a new schema is introduced with a single control it can be followed by a new child
control without a '#' prefix a couple of kernel releases later. By this point it
should hopefully be understood by user space that it should not write entries it does
not understand.
>
> [...]
>
>>>>>> MPAM has the "HARDLIM" distinction associated with these MAX values
>>>>>> and from what I can tell this is per PARTID. Is this something that needs
>>>>>> to be supported? To do this resctrl will need to support modifying
>>>>>> control properties per resource group.
>>>>>
>>>>> Possibly. Since this is a boolean control that determines how the
>>>>> MBW_MAX control is applied, we could perhaps present it as an
>>>>> additional schema -- if so, it's basically orthogonal.
>>>>>
>>>>> | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...]
>>>>>
>>>>> or
>>>>>
>>>>> | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...]
>>>>>
>>>>> Does this look reasonable?
>>>>
>>>> It does.
>>>
>>> OK -- note, I don't think we have any immediate plan to support this in
>>> the MPAM driver, but it may land eventually in some form.
>>>
>>
>> ack.
>
> (Or, of course, anything else that achieves the same goal...)
Right ... I did not dig into syntax that could be made to match existing
schema formats etc. that can be filled in later.
...
>>> I'll try to pull the state of this discussion together -- maybe as a
>>> draft update to the documentation, describing the interface as proposed
>>> so far. Does that work for you?
>>
>> It does. Thank you very much for taking this on.
>>
>> Reinette
>
> OK, I'll aim to follow up on this next week.
Thank you very much.
Reinette
Powered by blists - more mailing lists