lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e788ca62-ec63-4552-978b-9569f369afd5@intel.com>
Date: Fri, 17 Oct 2025 08:59:45 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Dave Martin <Dave.Martin@....com>
CC: "Luck, Tony" <tony.luck@...el.com>, <linux-kernel@...r.kernel.org>, "James
 Morse" <james.morse@....com>, Thomas Gleixner <tglx@...utronix.de>, "Ingo
 Molnar" <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
	<dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, "Jonathan
 Corbet" <corbet@....net>, <x86@...nel.org>, <linux-doc@...r.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
 per-arch

Hi Dave,

On 10/17/25 7:17 AM, Dave Martin wrote:
> Hi Reinette,
> 
> On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 10/15/25 8:47 AM, Dave Martin wrote:
>>> Hi Reinette,
>>>
>>> On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote:
>>>> Hi Dave,
>>>>
>>>> On 10/13/25 7:36 AM, Dave Martin wrote:

...

>>>>> So long as the entries affecting a single resource are ordered so that
>>>>> each entry is strictly more specific than the previous entries (as
>>>>> illustrated above), then reading schemata and stripping all the hashes
>>>>> would allow a previous configuration to be restored; to change just one
>>>>> entry, userspace can uncomment just that one, or write only that entry
>>>>> (which is what I think we should recommend for new software).
>>>>
>>>> This is a good rule of thumb.
>>>
>>> To avoid printing entries in the wrong order, do we want to track some
>>> parent/child relationship between schemata.
>>>
>>> In the above example,
>>>
>>> 	* MB is the parent of MB_HW;
>>>
>>> 	* MB_HW is the parent of MB_MIN and MB_MAX.
>>>
>>> (for MPAM, at least).
>>
>> Could you please elaborate this relationship? I envisioned the MB_HW to be
>> something similar to Intel RDT's "optimal" bandwidth setting ... something
>> that is expected to be somewhere between the "min" and the "max".
>>
>> But, now I think I'm a bit lost in MPAM since it is not clear to me what
>> MB_HW represents ... would this be the "memory bandwidth portion
>> partitioning"? Although, that uses a completely different format from
>> "min" and "max".
> 
> I confess that I'm thinking with an MPAM mindset here.
> 
> Some pseudocode might help to illustrate how these might interact:
> 
> 	set_MB(partid, val) {
> 		set_MB_HW(partid, percent_to_hw_val(val));
> 	}
> 
> 	set_MB_HW(partid, val) {
> 		set_MB_MAX(partid, val);
> 
> 		/*
> 		 * Hysteresis to avoid steady flows from ping-ponging
> 		 * between low and high priority:
> 		 */
> 		if (hardware_has_MB_MIN())
> 			set_MB_MIN(partid, val * 95%);
> 	}
> 
> 	set_MB_MIN(partid, val) {
> 		mpam->MBW_MIN[partid] = val;
> 	}
> 
> 	set_MB_MAX(partid, val) {
> 		mpam->MBW_MAX[partid] = val;
> 	}
> 
> with
> 
> 	get_MB(partid) {
> 		return hw_val_to_percent(get_MB_HW(partid));
> 	}
> 
> 	get_MB_HW(partid) { return get_MB_MAX(partid); }
> 
> 	get_MB_MIN(partid) { return mpam->MBW_MIN[partid]; }
> 
> 	get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; }
> 
> 
> The parent/child relationship I suggested is basically the call-graph
> of this pseudocode.  These could all be exposed as resctrl schemata,
> but the children provide finer / more broken-down control than the
> parents.  Reading a parent provides a merged or approximated view of
> the configuration of the child schemata.
> 
> In particular,
> 
> 	set_child(partid, get_child(partid));
> 	get_parent(partid);
> 
> yields the same result as
> 
> 	get_parent(partid);
> 
> but will not be true in general, if the roles of parent and child are
> reversed.
> 
> I think still this holds true if implementing an "MB_HW" schema for
> newer revisions of RDT.  The pseudocode would be different, but there
> will still be a tree-like call graph (?)

Thank you very much for the example. I missed in earlier examples that
MB_HW was being controlled via MB_MAX and MB_MIN.
I do not expect such a dependence or tree-like call graph for RDT where
the closest equivalent (termed "optimal") is programmed independently from
min and max.

> 
> 
> Going back to MPAM:
> 
> Re MPAM memory bandwidth portion partitioning (a.k.a., MBW_PART or
> MBWPBM), this is a bitmap-type control, analogous to RDT CAT: memory
> bandwidth is split into discrete, non-overlapping chunks, and each
> PARTID is configured with a bitmap saying which chunks it can use.
> This could be done by time-slicing, or controlling which memory
> controllers/ports a PARTID can issue requests to, or something like
> that.
> 
> If the MBW_MAX control isn't implemented, then the MPAM current driver
> maps this bitmap control onto the resctrl "MB" schema in a simple way,
> but we are considering dropping this, since the allocation model
> (explicit, static allocation of discrete resources) is not really the
> same as for RDT MBA (dynamic prioritisation based on recent resource
> consumption).
> 
> Programming MBW_MAX=50% for four PARTIDs means that the PARTIDs contend
> on an equal footing for memory bandwidth until one exceeds 50% (when it
> will start to be penalised).  Prorgamming bitmaps can't have the same
> effect.  For example, with { 1100, 0110, 0011, 1001 }, no group can use
> more than 50% of the full bandwidth, whatever happens.  Worse, certain
> pairs of groups are fully isolated from each other, while others are
> always in contention, not matter how little actual traffic is generated.
> This is potentially useful, but it's not the same as the MIN/MAX model.
> 
> So, it may make more sense to expose this as a separate, bitmap schema.
> 
> (The same goes for "Proportional stride" partitioning.  It's another,
> different, control for memory bandwidth.  As of today, I don't think
> that we have a reference platform for experimenting with either of
> these.)

Thank you.

> 
> 
>>> When schemata is read, parents should always be printed before their
>>> child schemata.  But really, we just need to make sure that the
>>> rdt_schema_all list is correctly ordered.
>>>
>>>
>>> Do you think that this relationship needs to be reported to userspace?
>>
>> You brought up the topic of relationships in
>> https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ that prompted me
>> to learn more from the MPAM spec where I learned and went on tangent about all
>> the other possible namespaces without circling back.
>>
>> I was hoping that the namespace prefix would make the relationships clear,
>> something like <resource>_<control>, but I did not expect another layer in
>> the hierarchy like your example above. The idea of "parent" and "child" is
>> also not obvious to me at this point. resctrl gives us a "resource" to start
>> with and we are now discussing multiple controls per resource. Could you please
>> elaborate what you see as "parent" and "child"?
> 
> See above -- the parent/child concept is not an MPAM thing; apologies
> if I didn't make that clear.
> 
>> We do have the info directory available to express relationships and a
>> hierarchy is already starting to taking shape there.
> 
> I'm wondering whether using a common prefix will be future-proof?  It
> may not always be clear which part of a name counts as the common
> prefix.

Apologies for my cryptic response. I was actually musing that we already
discussed using the info directory to express relationships between
controls and resources and it does not seem a big leap to expand
this to express relationships between controls. Consider something
like below for MPAM:

info
└── MB
    └── resource_schemata
        └── MB
            └── MB_HW
                ├── MB_MAX
                └── MB_MIN


On RDT it may then look different:

info
└── MB
    └── resource_schemata
        └── MB
            ├── MB_HW
            ├── MB_MAX
            └── MB_MIN

Having the resource name as common prefix does seem consistent and makes
clear to user space which controls apply to a resource. 

> 
> There were already discussions about appending a number to a schema
> name in order to control different memory regions -- that's another
> prefix/suffix relationship, if so...
> 
> We could handle all of this by documenting all the relationships
> explicitly.  But I'm thinking that it could be easier for maintanance
> if the resctrl core code has explicit knowledge of the relationships.

Not just for resctrl self but to make clear to user space which
controls impact others and which are independent. 
> That said, using a common prefix is still a good idea.  But maybe we
> shouldn't lean on it too heavily as a way of actually describing the
> relationships?
I do not think we can rely on order in schemata file though. For example,
I think MPAM's MB_HW is close enough to RDT's "optimal bandwidth" for RDT to
also use the MB_HW name (or maybe MPAM and RDT can both use MB_OPT?) in either
case the schemata may print something like below on both platforms (copied from
your original example) where for MPAM it implies a relationship but for RDT it
does not:

MB: 0=50, 1=50
# MB_HW: 0=32, 1=32
# MB_MIN: 0=31, 1=31
# MB_MAX: 0=32, 1=32

 
>>> Since the "#" convention is for backward compatibility, maybe we should
>>> not use this for new schemata, and place the burden of managing
>>> conflicts onto userspace going forward.  What do you think?
>>
>> I agree. The way I understand this is that the '#' will only be used for
>> new controls that shadow the default/current controls of the legacy resources.
>> I do not expect that the prefix will be needed for new resources, even if
>> the initial support of a new resource does not include all possible controls.
> 
> OK.  Note, relating this to the above, the # could be interpreted as
> meaning "this is a child of some other schema; don't mess with it
> unless you know what you are doing".

Could it be made more specific to be "this is a child of a legacy schema created
before this new format existed; don't mess with it unless you know what you are
doing"?
That is, any schema created after this new format is established does not need
the '#' prefix even if there is a parent/child relationship?

> 
> Older software doesn't understand the relationships, so this is just
> there to stop it from shooting itself in the foot.

ack.

By extension I assume that software that understands a schema that is introduced
after the "relationship" format is established can be expected to understand the
format and thus these new schemata do not require the '#' prefix. Even if
a new schema is introduced with a single control it can be followed by a new child
control without a '#' prefix a couple of kernel releases later. By this point it
should hopefully be understood by user space that it should not write entries it does
not understand.

> 
> [...]
> 
>>>>>> MPAM has the "HARDLIM" distinction associated with these MAX values
>>>>>> and from what I can tell this is per PARTID. Is this something that needs
>>>>>> to be supported? To do this resctrl will need to support modifying
>>>>>> control properties per resource group.
>>>>>
>>>>> Possibly.  Since this is a boolean control that determines how the
>>>>> MBW_MAX control is applied, we could perhaps present it as an
>>>>> additional schema -- if so, it's basically orthogonal.
>>>>>
>>>>>  | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...]
>>>>>
>>>>> or
>>>>>
>>>>>  | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...]
>>>>>
>>>>> Does this look reasonable?
>>>>
>>>> It does.
>>>
>>> OK -- note, I don't think we have any immediate plan to support this in
>>> the MPAM driver, but it may land eventually in some form.
>>>
>>
>> ack.
> 
> (Or, of course, anything else that achieves the same goal...)

Right ... I did not dig into syntax that could be made to match existing
schema formats etc. that can be filled in later.

...

>>> I'll try to pull the state of this discussion together -- maybe as a
>>> draft update to the documentation, describing the interface as proposed
>>> so far.  Does that work for you?
>>
>> It does. Thank you very much for taking this on.
>>
>> Reinette
> 
> OK, I'll aim to follow up on this next week.

Thank you very much.

Reinette


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ