[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aPZj1nDVEYmYytY9@agluck-desk3>
Date: Mon, 20 Oct 2025 09:31:18 -0700
From: "Luck, Tony" <tony.luck@...el.com>
To: Dave Martin <Dave.Martin@....com>
CC: Reinette Chatre <reinette.chatre@...el.com>,
<linux-kernel@...r.kernel.org>, James Morse <james.morse@....com>, "Thomas
Gleixner" <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, "Borislav
Petkov" <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter
Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>, <x86@...nel.org>,
<linux-doc@...r.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
per-arch
On Mon, Oct 20, 2025 at 04:50:38PM +0100, Dave Martin wrote:
> Hi Reinette,
>
> On Fri, Oct 17, 2025 at 08:59:45AM -0700, Reinette Chatre wrote:
> > Hi Dave,
> >
> > On 10/17/25 7:17 AM, Dave Martin wrote:
> > > Hi Reinette,
> > >
> > > On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote:
> > >> Hi Dave,
> > >>
> > >> On 10/15/25 8:47 AM, Dave Martin wrote:
>
> [...]
>
> > >>> To avoid printing entries in the wrong order, do we want to track some
> > >>> parent/child relationship between schemata.
> > >>>
> > >>> In the above example,
> > >>>
> > >>> * MB is the parent of MB_HW;
> > >>>
> > >>> * MB_HW is the parent of MB_MIN and MB_MAX.
> > >>>
> > >>> (for MPAM, at least).
> > >>
> > >> Could you please elaborate this relationship? I envisioned the MB_HW to be
> > >> something similar to Intel RDT's "optimal" bandwidth setting ... something
> > >> that is expected to be somewhere between the "min" and the "max".
> > >>
> > >> But, now I think I'm a bit lost in MPAM since it is not clear to me what
> > >> MB_HW represents ... would this be the "memory bandwidth portion
> > >> partitioning"? Although, that uses a completely different format from
> > >> "min" and "max".
> > >
> > > I confess that I'm thinking with an MPAM mindset here.
> > >
> > > Some pseudocode might help to illustrate how these might interact:
> > >
> > > set_MB(partid, val) {
> > > set_MB_HW(partid, percent_to_hw_val(val));
>
> [...]
>
> > > get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; }
> > >
> > >
> > > The parent/child relationship I suggested is basically the call-graph
> > > of this pseudocode. These could all be exposed as resctrl schemata,
> > > but the children provide finer / more broken-down control than the
> > > parents. Reading a parent provides a merged or approximated view of
> > > the configuration of the child schemata.
> > >
> > > In particular,
> > >
> > > set_child(partid, get_child(partid));
> > > get_parent(partid);
> > >
> > > yields the same result as
> > >
> > > get_parent(partid);
> > >
> > > but will not be true in general, if the roles of parent and child are
> > > reversed.
> > >
> > > I think still this holds true if implementing an "MB_HW" schema for
> > > newer revisions of RDT. The pseudocode would be different, but there
> > > will still be a tree-like call graph (?)
> >
> > Thank you very much for the example. I missed in earlier examples that
> > MB_HW was being controlled via MB_MAX and MB_MIN.
> > I do not expect such a dependence or tree-like call graph for RDT where
> > the closest equivalent (termed "optimal") is programmed independently from
> > min and max.
>
> I hadn't realised that this RDT feature as three control thresholds.
>
> I'll comment in more detail on your sample info/ hierarchy, below.
>
> > >
> > > Going back to MPAM:
>
> [...]
>
> > > So, it may make more sense to expose [MBWPBM] as a separate, bitmap schema.
> > >
> > > (The same goes for "Proportional stride" partitioning. It's another,
> > > different, control for memory bandwidth. As of today, I don't think
> > > that we have a reference platform for experimenting with either of
> > > these.)
> >
> > Thank you.
> >
> > >
> > >
> > >>> When schemata is read, parents should always be printed before their
> > >>> child schemata. But really, we just need to make sure that the
> > >>> rdt_schema_all list is correctly ordered.
> > >>>
> > >>>
> > >>> Do you think that this relationship needs to be reported to userspace?
>
> [...]
>
> > >> We do have the info directory available to express relationships and a
> > >> hierarchy is already starting to taking shape there.
> > >
> > > I'm wondering whether using a common prefix will be future-proof? It
> > > may not always be clear which part of a name counts as the common
> > > prefix.
> >
> > Apologies for my cryptic response. I was actually musing that we already
> > discussed using the info directory to express relationships between
> > controls and resources and it does not seem a big leap to expand
> > this to express relationships between controls. Consider something
> > like below for MPAM:
> >
> > info
> > └── MB
> > └── resource_schemata
> > └── MB
> > └── MB_HW
> > ├── MB_MAX
> > └── MB_MIN
> >
> >
> > On RDT it may then look different:
> >
> > info
> > └── MB
> > └── resource_schemata
> > └── MB
> > ├── MB_HW
> > ├── MB_MAX
> > └── MB_MIN
> >
> > Having the resource name as common prefix does seem consistent and makes
> > clear to user space which controls apply to a resource.
>
> Ack.
>
> The above hierarchies make sense, but I wonder whether we should be
> forcing software to understand the MIN and MAX limits?
>
> I can still see a benefit in having MB_HW be a generic, software-
> defined control, even on RDT. Then, this can always be available,
> with similar behaviour, on all resctrl instances that support memory
> bandwidth controls. The precise set of child controls will vary per
> arch (and on MPAM at least, between different hardware
> implementations) -- so these look like they will work less well as a
> generic interface.
>
>
> Considering RDT: to avoid random regulation behaviour, RDT says that
> you need MIN <= OPT <= MAX, so a generic "MB_HW" control that does not
> require software to understand the individual MIN, OPT and MAX
> thresholds would still need to program all of these under the hood so
> as to avoid an invalid combination being set in the hardware.
>
> If I have understood the definition of the MARC table correctly, then
> there is a separate flag to report the presence of each of MIN, MAX and
> OPT, so software _might_ be expected to use a random subset of them(?)
> (If so, that's somewhat like the MPAM situation.)
>
> So, I wonder whether we could actually have the following on RDT?
>
> info
> ├── MB
> ┆ └── resource_schemata
> ├── MB
> ┆ └── MB_HW
> ├── MB_MAX
> ├── MB_MIN
> └── MB_OPT
>
> If MB_HW is programmed by software, then MB_MAX, MB_OPT and MB_MIN
> would be programmed with some reasonable default spread (or possibly,
> all with the same value).
>
> That way, software that wants independent control over MIN, OPT and MAX
> can have it (and sweat the problem of dealing with hardware where they
> aren't all implemented -- if that's a thing). But software that
> doesn't need this fine control gets a single MB_HW knob that is more-or-
> less portable between platforms.
>
> Does that makes sense, or is it an abstraction too far?
>
>
> (Going one step further, maybe we can actually put MPAM and RDT
> together with a 3-threshold model. For MPAM, we could possibly express
> the HARDLIM option using the extra threshold... that probably needs a
> bit more thought, though.)
>
> > > There were already discussions about appending a number to a schema
> > > name in order to control different memory regions -- that's another
> > > prefix/suffix relationship, if so...
> > >
> > > We could handle all of this by documenting all the relationships
> > > explicitly. But I'm thinking that it could be easier for maintanance
> > > if the resctrl core code has explicit knowledge of the relationships.
> >
> > Not just for resctrl self but to make clear to user space which
> > controls impact others and which are independent.
> > > That said, using a common prefix is still a good idea. But maybe we
> > > shouldn't lean on it too heavily as a way of actually describing the
> > > relationships?
> > I do not think we can rely on order in schemata file though. For example,
> > I think MPAM's MB_HW is close enough to RDT's "optimal bandwidth" for RDT to
> > also use the MB_HW name (or maybe MPAM and RDT can both use MB_OPT?) in either
> > case the schemata may print something like below on both platforms (copied from
> > your original example) where for MPAM it implies a relationship but for RDT it
> > does not:
> >
> > MB: 0=50, 1=50
> > # MB_HW: 0=32, 1=32
> > # MB_MIN: 0=31, 1=31
> > # MB_MAX: 0=32, 1=32
>
> This still DTRT though? If MB_HW maps into the "optimal bandwidth"
> control on RDT, then it is still safe to program it first, before
> MB_{MIN,MAX}.
>
> The contents of the schemata file won't be suffucient to figure out the
> relationships, but that wasn't my intention. We have info/ for that.
>
> Instead, the schemata file just needs to be ordered in a way that is
> compatible with those relationships, so that one line does not
> unintentionally clobber the effect of a subsequent line.
>
>
> My concern was that if we rely totally on manual maintenance to keep the
> schemata file in a compatible order, we'll probably get that wrong
> sooner or later...
>
> > >>> Since the "#" convention is for backward compatibility, maybe we should
> > >>> not use this for new schemata, and place the burden of managing
> > >>> conflicts onto userspace going forward. What do you think?
> > >>
> > >> I agree. The way I understand this is that the '#' will only be used for
> > >> new controls that shadow the default/current controls of the legacy resources.
> > >> I do not expect that the prefix will be needed for new resources, even if
> > >> the initial support of a new resource does not include all possible controls.
> > >
> > > OK. Note, relating this to the above, the # could be interpreted as
> > > meaning "this is a child of some other schema; don't mess with it
> > > unless you know what you are doing".
> >
> > Could it be made more specific to be "this is a child of a legacy schema created
> > before this new format existed; don't mess with it unless you know what you are
> > doing"?
> > That is, any schema created after this new format is established does not need
> > the '#' prefix even if there is a parent/child relationship?
>
> Yes, I think so.
>
> Except: if some schema is advertised and documented with no children,
> then is it reasonable for software to assume that it will never have
> children?
>
> I think that the answer is probably "yes", in which case would it make
> sense to # any schema that is a child of some schema that did not have
> children in some previous upstream kernel?
>
> > >
> > > Older software doesn't understand the relationships, so this is just
> > > there to stop it from shooting itself in the foot.
> >
> > ack.
> >
> > By extension I assume that software that understands a schema that is introduced
> > after the "relationship" format is established can be expected to understand the
> > format and thus these new schemata do not require the '#' prefix. Even if
> > a new schema is introduced with a single control it can be followed by a new child
> > control without a '#' prefix a couple of kernel releases later. By this point it
> > should hopefully be understood by user space that it should not write entries it does
> > not understand.
>
> Generally, yes.
>
> I think that boils down to: "OK, previously you could just tweak bits
> of the whole schemata file you read and write the whole thing back,
> and the effect would be what you inuitively expected. But in future
> different schemata in the file may not be independent of one another.
> We'll warn you which things might not be independent, but we may not
> describe exactly how they affect each other.
Changes to the schemata file are currently "staged" and then applied.
There's some filesystem level error/sanity checking during the parsing
phase, but maybe for MB some parts can also be delayed, and re-ordered
when architecture code applies the changes.
E.g. while filesystem code could check min <= opt <= max. Architecture
code would be responsible to write the values to h/w in a sane manner
(assuming architecture cares about transient effects when things don't
conform to the ordering).
E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60
Regardless of the order those requests appeared in the write(2) syscall
architecture bumps max to 60, then opt to 50, and finally min to 40.
>
> "So, from now on, only write the things that you actually want to set."
>
> Does that sound about right?
Users might still use their favorite editor on the schemata file and
so write everything, while only changing a subset. So if we don't go
for the full two-phase update I describe above this would be:
"only *change* the things that you actually want to set".
> [...]
>
> > >>>
> > >>> OK -- note, I don't think we have any immediate plan to support [HARDLIM] in
> > >>> the MPAM driver, but it may land eventually in some form.
> > >>>
> > >>
> > >> ack.
> > >
> > > (Or, of course, anything else that achieves the same goal...)
> >
> > Right ... I did not dig into syntax that could be made to match existing
> > schema formats etc. that can be filled in later.
>
> Ack
>
> > ...
> >
> > >>> I'll try to pull the state of this discussion together -- maybe as a
> > >>> draft update to the documentation, describing the interface as proposed
> > >>> so far. Does that work for you?
> > >>
> > >> It does. Thank you very much for taking this on.
> > >>
> > >> Reinette
> > >
> > > OK, I'll aim to follow up on this next week.
> >
> > Thank you very much.
> >
> > Reinette
>
> Cheers
> ---Dave
-Tony
Powered by blists - more mailing lists