[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aO/CEuyaIyZ5L28d@e133380.arm.com>
Date: Wed, 15 Oct 2025 16:47:30 +0100
From: Dave Martin <Dave.Martin@....com>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: "Luck, Tony" <tony.luck@...el.com>, linux-kernel@...r.kernel.org,
James Morse <james.morse@....com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
x86@...nel.org, linux-doc@...r.kernel.org
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
per-arch
Hi Reinette,
On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 10/13/25 7:36 AM, Dave Martin wrote:
> > Hi Reinette,
> >
> > On Fri, Oct 10, 2025 at 09:48:21AM -0700, Reinette Chatre wrote:
> >> Hi Dave,
> >>
> >> On 9/30/25 8:40 AM, Dave Martin wrote:
> >>> On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote:
> >>>> On 9/29/25 6:56 AM, Dave Martin wrote:
> >
> > [...]
> >
> >>>> 1) Commented schema are "inactive"
> >>>> This is unclear to me. In the MB example the commented lines show the
> >>>> finer grained controls. Since the original MB resource is an approximation
> >>>> and the hardware must already be configured to support it, would the #-prefixed
> >>>> lines not show the actual "active" configuration?
> >>>
> >>> They would show the active configuration (possibly more precisely than
> >>> "MB" does).
> >>
> >> That is how I see it also. This is specific to MB as we try to maintain
> >> backward compatibility.
> >>
> >> If we are going to make user interface changes to resource allocation then
> >> ideally it should consider all known future usage. I am trying to navigate
> >> and understand the discussion on how resctrl can support MPAM and this
> >> RDT region aware requirements.
> >>
> >> I scanned the MPAM spec and from what I understand a resource may support
> >> multiple controls at the same time, each with its own properties, and then
> >> there was this:
> >>
> >> When multiple partitioning controls are active, each affects the partition’s
> >> bandwidth usage. However, some combinations of controls may not make sense,
> >> because the regulation of that pair of controls cannot be made to work in concert.
> >>
> >> resctrl may thus present an "active configuration" that is not a configuration
> >> that "makes sense" ... this may be ok as resctrl would present what hardware
> >> supports combined with what user requested.
> >
> > This is analogous to what the MPAM spec says, though if resctrl offers
> > two different schemata for the same hardware control, the control cannot be
> > configured with both values simultaneously.
> >
> > For the MPAM hardware controls affecting the same hardware resource,
> > they can be programmed to combinations of values that have no sensible
> > interpretation, and the values can be read back just fine. The
> > performance effects may not be what the user expected / wanted, but
> > this is not directly visible to resctrl.
> >
> > So, if we offer independent schemata for MBW_MIN and MBW_MAX, the user
> > can program MBW_MIN=75% and MBW_MAX=25% for the same PARTID, and that
> > will read back just as programmed. The architecture does not promise
> > what the performance effect of this will be, but resctrl does not need
> > to care.
>
> The same appears to be true for Intel RDT where the spec warns ("Undesirable
> and undefined performance effects may result if cap programming guidelines
> are not followed.") but does not seem to prevent such configurations.
Right. We _could_ block such a configuration from reaching the hardware,
if the arch backend overrides the MIN limit when the MAX limit is
written and vice-versa, when not doing to would result in crossed-over
bounds.
If software wants to program both bounds, then that would be fine: in:
# cat <<-EOF >/sys/fs/resctrl/schemata
MB_MAX: 0=128
EOF
# cat <<-EOF >/sys/fs/resctrl/schemata
MB_MIN: 0=256
MB_MAX: 0=1024
EOF
... internally programming some value >=256 before programming the
hardware with the new min bound would not stop the final requested
change to MB_MAX from working as userspace expected.
(There will be inevitable regulation glitches unless the hardware
provides a way to program both bounds atomically. MPAM doesn't; I
don't think RDT does either?)
But we only _need_ to do this if the hardware architecture forbids
programming cross bounds or says that it is unsafe to do so. So, I am
thinking that the generic code doesn't need to handle this.
[...]
> >> To be specific, the original proposal [1] introduced a set of files for
> >> a numeric control and that seems to work for existing and upcoming
> >> schema that need a value in a range. Different controls need different
> >> parameters so to integrate this solution I think it needs another parameter
> >> (presented as a directory, a file, or within a file) that indicates the
> >> type of the control so that user space knows which files/parameters to expect
> >> and how to interpret them.
> >
> > Agreed. I wasn't meaning to imply that this proposal shouldn't be
> > integrated into something more general. If we want a richer
> > description than the current one, it makes sense to incorporate bitmap
> > controls -- this just wasn't my focus.
>
> Understood.
>
> >
> >> Since different controls have different parameters we need to consider
> >> whether it is easier to create/parse unique files for each control or
> >> present all the parameters within one file with another file noting the type
> >> of control.
> >
> > Separate files works quite well for low-tech tooling built using shell
> > scripts, and this seems to follow the sysfs philosophy. Since there is
> > no need to keep re-reading these parameters, simplicity feels more
> > important than efficiency?
> >
> > But we could equally have a single file with multiple pieces of
> > information in it.
> >
> > I don't have a strong view on this.
>
> If by sysfs philosophy you men "one value per file" then resctrl split from that from
> the beginning (with the schemata file). I am also not advocating for one or the other
> at this time but believe we have some flexibility when faced with implementation
> options/challenges.
Agreed -- it works either way.
[...]
> >> At this time I am envisioning the proposal to result in something like below where
> >> there is one resource directory and one directory per schema entry with a (added by me)
> >> "schema_type" file to help user find out what the schema type is to know which files are present:
> >>
> >> MB
> >> ├── bandwidth_gran
> >> ├── delay_linear
> >> ├── MB
> >> │ ├── map
> >> │ ├── max
> >> │ ├── min
> >> │ ├── scale
> >> │ ├── schema_type
> >> │ └── unit
> >> ├── MB_HW
> >> │ ├── map
[...]
> >> ├── min_bandwidth
> >> ├── num_closids
> >> └── thread_throttle_mode
> >
> > I see no reason not to do that. Either way, older userspace just
> > ignores the new files and directories.
> >
> > Perhaps add an intermediate subdirectory to clarify the relationship
> > between the resource dir and the individual schema descriptions?
> >
> > This may also avoid the new descriptions getting mixed up with the old
> > description files.
> >
> > Say,
> >
> > info
> > ├── MB
> > │ ├── resource_schemata
> > │ │ ├── MB
> > │ │ │ ├── map
> > │ │ │ ├── max
> > │ ┆ │ ├── min
> > │ │ ┆
> > ┆ │
> > ├── MB_HW
> > │ ├── map
> > │ ┆
> > ┆
>
> Looks good to me.
OK
> >> Something else related to control that caught my eye in MPAM spec is this gem:
> >> MPAM provides discoverable vendor extensions to permit partners
> >> to invent partitioning controls.
> >
> > Yup.
> >
> > Since we have no way to know what vendor-specific controls look like or
> > what they mean, we can't do much about this.
> >
> > So, it's the vendor's job to implement support for it, and we might
> > still say no (if there is no sane way to integrate it).
>
> ack.
>
> >
> >>> MB may be hard to describe in a useful way, though -- at least in the
> >>> MPAM case, where the number of steps does not divide into 100, and the
> >>> AMD cases where the meaning of the MB control values is different.
> >>
> >> Above I do assume that MB would be represented in a new interface since it
> >> is a schema entry, if that causes trouble then we could drop it.
> >
> > Since MB is described by the existing files and the documentation,
> > perhaps this it doesn't need an additional description.
> >
> > Alternatively though, could we just have a special schema_type for this,
> > and omit the other properties? This would mean that we at least have
> > an entry for every schema.
>
> We could do this, yes.
I guess I'll go with this approach, then, and see if anyone objects.
[...]
> >>> MB: 0=50, 1=50
> >>> # MB_HW: 0=32, 1=32
> >>> # MB_MIN: 0=16, 1=16
> >>> # MB_MAX: 0=32, 1=32
> >>
> >> Could/should resctrl uncomment the lines after userspace modified them?
> >
> > The '#' wasn't meant to be a state that gets turned on and off.
>
> Thank you for clarifying.
>
> > Rather, userspace would use this to indicate which entries are
> > intentionally being modified.
>
> I see. I assume that we should not see many of these '#' entries and expect
> the ones we do see to shadow the legacy schemata entries. New schemata entries
> (that do not shadow legacy ones) should not have the '#' prefix even if
> their initial support does not include all controls.
> > So long as the entries affecting a single resource are ordered so that
> > each entry is strictly more specific than the previous entries (as
> > illustrated above), then reading schemata and stripping all the hashes
> > would allow a previous configuration to be restored; to change just one
> > entry, userspace can uncomment just that one, or write only that entry
> > (which is what I think we should recommend for new software).
>
> This is a good rule of thumb.
To avoid printing entries in the wrong order, do we want to track some
parent/child relationship between schemata.
In the above example,
* MB is the parent of MB_HW;
* MB_HW is the parent of MB_MIN and MB_MAX.
(for MPAM, at least).
When schemata is read, parents should always be printed before their
child schemata. But really, we just need to make sure that the
rdt_schema_all list is correctly ordered.
Do you think that this relationship needs to be reported to userspace?
Since the "#" convention is for backward compatibility, maybe we should
not use this for new schemata, and place the burden of managing
conflicts onto userspace going forward. What do you think?
> >>> (For hardware-specific reasons, the MPAM driver currently internally
> >>> programs the MIN bound to be a bit less than the MAX bound, when
> >>> userspace writes an "MB" entry into schemata. The key thing is that
> >>> writing MB may cause the MB_MIN/MB_MAX entries to change -- at the
> >>> resctrl level, I don't that that we necessarily need to make promises
> >>> about what they can change _to_. The exact effect of MIN and MAX
> >>> bounds is likely to be hardware-dependent anyway.)
> >>
> >> MPAM has the "HARDLIM" distinction associated with these MAX values
> >> and from what I can tell this is per PARTID. Is this something that needs
> >> to be supported? To do this resctrl will need to support modifying
> >> control properties per resource group.
> >
> > Possibly. Since this is a boolean control that determines how the
> > MBW_MAX control is applied, we could perhaps present it as an
> > additional schema -- if so, it's basically orthogonal.
> >
> > | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...]
> >
> > or
> >
> > | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...]
> >
> > Does this look reasonable?
>
> It does.
OK -- note, I don't think we have any immediate plan to support this in
the MPAM driver, but it may land eventually in some form.
[...]
> >>> Regarding new userspce:
> >>>
> >>> Going forward, we can explicitly document that there should be no
> >>> conflicting or "passenger" entries in a schemata write: don't include
> >>> an entry for somehing that you don't explicitly want to set, and if
> >>> multiple entries affect the same resource, we don't promise what
> >>> happens.
> >>>
> >>> (But sadly, we can't impose that rule on existing software after the
> >>> fact.)
> >>
> >> It may thus not be worth it to make such a rule.
> >
> > Ack. Perhaps we could recommend it, though.
>
> We could, yes.
OK
[...]
> >> MPAM allows per-PARTID configurations for secure/non-secure, physical/virtual,
> >> ... ? Is it expected that MPAM's support of these should be exposed via resctrl?
> >
> > Probably not. These are best regarded as entirely separate instances
> > of MPAM; the PARTID spaces are separate. The Non-secure physical
> > address space is the only physical address space directly accessible to
> > Linux -- for the others, we can't address the MMIO registers anyway.
> >
> > For now, the other address spaces are the firmware's problem.
>
> Thank you.
No worries -- it's not too obvious from the spec!
> >> Have you considered how to express if user wants hardware to have different
> >> allocations for, for example, same PARTID at different execution levels?
> >>
> >> Reinette
> >>
> >> [1] https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
> >
> > MPAM doesn't allow different controls for a PARTID depending on the
> > exception level, but it is possible to program different PARTIDs for
> > hypervisor/kernel and userspace (i.e., EL2/EL1 and EL0).
>
> I misunderstood this from the spec. Thank you for clarifying.
>
> >
> > I think that if we wanted to go down that road, we would want to expose
> > additional "task IDs" in resctrlfs that can be placed into groups
> > independently, say
> >
> > echo 14161:kernel >>.../some_group/tasks
> > echo 14161:user >>.../other_group/tasks
> >
> > However, inside the kernel, the boundary between work done on behalf of
> > a specific userspace task, work done on behalf of userspace in general,
> > and autonomous work inside the kernel is fuzzy and not well defined.
> >
> > For this reason, we currently only configure the PARTID for EL0. For
> > EL1 (and EL2 if the kernel uses it), we just use the default PARTID (0).
> >
> > Hopefully this is orthogonal to the discussion of schema descriptions,
> > though ...?
>
> Yes.
OK; I suggest that we put this on one side, for now, then.
There is a discussion to be had on this, but it feels like a separate
thing.
I'll try to pull the state of this discussion together -- maybe as a
draft update to the documentation, describing the interface as proposed
so far. Does that work for you?
Cheers
--Dave
Powered by blists - more mailing lists