[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aXUK7XFsHl+gnwA/@x1>
Date: Sat, 24 Jan 2026 10:09:49 -0800
From: Drew Fustini <fustini@...nel.org>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: Dave Martin <Dave.Martin@....com>, linux-kernel@...r.kernel.org,
Babu Moger <babu.moger@....com>, Fenghua Yu <fenghuay@...dia.com>,
Tony Luck <tony.luck@...el.com>, James Morse <james.morse@....com>,
"Chen, Yu C" <yu.c.chen@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
x86@...nel.org
Subject: Re: [RFC] fs/resctrl: Generic schema description
On Tue, Dec 16, 2025 at 02:26:23PM -0800, Reinette Chatre wrote:
> Hi Babu and Fenghua,
>
> Could you please consider how the new AMD and MPAM features [2] may benefit
> from the new interfaces proposed here? More below ...
>
> On 10/24/25 4:12 AM, Dave Martin wrote:
> > Hi all,
> >
> > Going forward, a single resctrl resource (such as memory bandwidth) is
> > likely to require multiple schemata, either because we want to add new
> > schemata that provide finer control, or because the hardware has
> > multiple controls, covering different aspects of resource allocation.
> >
> > The fit between MPAM's memory bandwidth controls and the resctrl MB
> > schema is already awkward, and later Intel RDT features such as Region
> > Aware Memory Bandwidth Allocation are already pushing past what the MB
> > schema can describe. Both of these can involve multiple control
> > values and finer resolution than the 100 steps offered by the current
> > "MB" schema.
> >
> > The previous discussion went off in a few different directions [1], so
> > I want to focus back onto defining an extended schema description that
> > aims to cover the use cases that we know about or anticipate today, and
> > allows for future extension as needed.
> >
> > (A separate discussion is needed on how new schemata interact with
> > previously-defined schemata (such as the MB percentage schema).
> > suggest we pause that discussion for now, in the interests of getting
> > the schema description nailed down.)
> >
> >
> > Following on from the previous mail thread, I've tried to refine and
> > flesh out the proposal for schema descriptions a bit, as follows.
> >
> > Proposal:
> >
> > * Split resource names and schema names in resctrlfs.
> >
> > Resources will be named for the unique, existing schema for each
> > resource.
> >
> > The existing schema will keep its name (the same as the resource
> > name), and new schemata defined for a resource will include that
> > name as a prefix (at least, by default).
> >
> > So, for example, we will have an MB resource with a schema called
> > MB (the schema that we have already). But we may go on to define
> > additional schemata for the MB resource, with names such MB_MAX,
> > etc.
> >
> > * Stop adding new schema description information in the top-level
> > info/<resource>/ directory in resctrlfs.
> >
> > For backwards compatibilty, we can keep the existing property
> > files under the resource info directory to describe the previously
> > defined resource, but we seem to need something richer going
> > forward.
> >
> > * Add a hierarchy to list all the schemata for each resource, along
> > with their properties. So far, the proposal looks like this,
> > taking the MB resource as an example:
> >
> > info/
> > └─ MB/
> > └─ resource_schemata/
> > ├─ MB/
> > ├─ MB_MIN/
> > ├─ MB_MAX/
> > ┆
> >
> > Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
> > In this proposal, what these just dummy schema names for
> > illustration purposes. The important thing is that they all
> > control aspects of the "MB" resource, and that there can be more
> > than one of them.
> >
> > It may be appropriate to have a nested hierarchy, where some
> > schemata are presented as children of other schemata if they
> > affect the same hardware controls. For now, let's put this issue
> > on one side, and consider what properties should be advertsed for
> > each schema.
> >
> > * Current properties that I think we might want are:
> >
> > info/
> > └─ SOME_RESOURCE/
> > └─ resource_schemata/
> > ├─ SOME_SCHEMA/
> > ┆ ├─ type
> > ├─ min
> > ├─ max
> > ├─ tolerance
> > ├─ resolution
> > ├─ scale
> > └─ unit
> >
> > (I've tweaked the properties a bit since previous postings.
> > "type" replaces "map"; "scale" is now the unit multiplier;
> > "resolution" is now a scaling divisor -- details below.)
> >
> > I assume that we expose the properties in individual files, but we
> > could also combine them into a single description file per schema,
> > per resource or (possibly) a single global file.
> > (I don't have a strong view on the best option.)
> >
> >
> > Either way, the following set of properties may be a reasonable
> > place to start:
> >
> >
> > type: the schema type, followed by optional flag specifiers:
> >
> > - "scalar": a single-valued numeric control
> >
> > A mandatory flag indicates how the control value written to
> > the schemata file is converted to an amount of resource for
> > hardware regulation.
> >
> > The flag "linear" indicates a linear mapping.
> >
> > In this case, the amount of resource E that is actually
> > allocated is derived from the control value C written to the
> > schemata file as follows:
> >
> > E = C * scale * unit / resolution
> >
> > Other flags values could be defined later, if we encounter
> > hardware with non-linear controls.
> >
> > - "bitmap": a bitmap control
> >
> > The optional flag "sparse" is present if the control accepts
> > sparse bitmaps.
> >
> > In this case, E = bitmap_weight(C) * scale * unit / resolution.
> >
> > As before, each bit controls access to a specific chunk of
> > resource in the hardware, such as a group of cache lines. All
> > chunks are equally sized.
> >
> > (Different CTRL_MON groups may still contend within the
> > allocation E, when they have bits in common between their
> > bitmaps.)
> >
> > min:
> >
> > - For a scalar schema, the minimum value that can be written to
> > the control when writing the schemata file.
> >
> > - For a bitmap schema, a bitmap of the minimum weight that the
> > schema accepts: if an empty bitmap is accepted, this can be 0.
> > Otherwise, if bitmaps with a single bit set are acceptable,
> > this can just have the lowest-order bit set.
> >
> > Most commonly, the value will probably be "1".
> >
> > For bitmap schemata, we might report this in hex. In the
> > interest of generic parsing, we could include a "0x" prefix if
> > so.
> >
> > max:
> >
> > - For a scalar schema, the maximum value that can be written to
> > the control when writing the schemata file.
> >
> > - For a bitmap schema, the mask with all bits set.
> >
> > Possibly reported in hex for bitmap schemata (as for "min").
> >
> > tolerance:
> >
> > (See below for discussion on this.)
> >
> > - "0": the control is exact
> >
> > - "1": the effective control value is within ±1 of the control
> > value written to the schemata file. (Similary, positive "n" ->
> > ±n.)
> >
> > A negative value could be used to indicate that the tolerance
> > is unknown. (Possibly we could also just omit the property,
> > though it seems better to warn userspace explicitly if we
> > don't know.)
> >
> > Tests might make use of this parameter in order to determine
> > how picky to be about exact measurement results.
> >
> > resolution:
> >
> > - For a proportional scalar schema: the number of divisions that
> > the whole resource is divided into. (See below for
> > "proportional scalar schema.)
> >
> > Typically, this will be the same as the "max" value.
> >
> > - For an absolute scalar schema: the divisor applied to the
> > control value.
> >
> > - For a bitmap schema: the size of the bitmap in bits.
> >
> > scale:
> >
> > - For a scalar schema: the scale-up multiplier applied to
> > "unit".
> >
> > - For a bitmap schema: probably "1".
> >
> > unit:
> >
> > - The base unit of the quantity measured by the control value.
> >
> > The special unit "all" denotes a proportional schema. In this
> > case, the resource is a finite, physical thing such as a cache
> > or maxed-out data throughput of a memory controller. The
> > entire physical resource is available for allocation, and the
> > control value indicates what proportion of it is allocated.
> >
> > Bitmap schemata will probably all be proportional and use the
> > unit "all". (This applies to cache bitmaps, at least.)
> >
> > Absolute schemata will require specification of the base unit
> > here, say, "MBps". The "scale" parameter can be used to avoid
> > proliferation of unit strings:
> >
> > For example, {scale=1000, unit="MBps"} would be equivalent to
> > {scale=1, unit="GBps"}.
> >
> >
> > Note on the "tolerance" parameter:
> >
> > This is a new addition. On the MPAM side, the hardware has a choice
> > about how to interpret the control value in some edge-case situations.
> > We may not reasonably be able to probe for this, so it may be useful
> > to warn software that there is an uncertainty margin.
> >
> > We might also be able to use the "tolerance" parameter to accommodate
> > the rounding behaviour of the existing "MB" schema (otherwise, we
> > might want a special "type" for this schema, if it doesn't comply
> > closely enough).
> >
> >
> > If we want to deploy resctrl under virtualisation, resctrl on the host
> > could dynamically affect the actual amount of resource that is
> > available for allocation inside a VM.
> >
> > Whether or not we ever want to do that, it might be useful to have a
> > way to warn software that the effective control values hitting the
> > hardware may not be entirely predictable.
> >
> > Thoughts?
> >
> > Cheers
> > ---Dave
>
>
> One thing I was pondering is that resctrl currently uses L3 interchangeably
> as a scope and a resource but if instead that is separated then it should be
> easier to support interactions with resource at a different scope.
>
> I am concerned that, for example, support for Global Memory Bandwidth Allocation
> (GMBA) is planned to be done with a new resource. resctrl already has a
> "memory bandwidth allocation" resource and introducing a new resource to essentially
> manage the same resource, but at a different scope, sounds like a risk of fragmentation
> and duplication to me.
>
> What if the "resource control" instead gains a new property, for example, "scope" that
> essentially communicates to user space what a domain ID in the schemata file means.
>
> It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less
> MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface
> like below:
>
> info
> └── SMBA
> └── resource_schemata
> ├── SMBA
> │ ├── max
> │ ├── min
> │ ├── resolution
> │ ├── scale
> │ ├── scope <== contains "L3"
> │ ├── tolerance
> │ ├── type
> │ └── unit
> └── SMBA_NODE
> ├── max
> ├── min
> ├── resolution
> ├── scale
> ├── scope <== contains "NODE"
> ├── tolerance
> ├── type
> └── unit
>
> With an interface like above there is a single resource and allocating it at a different
> scope is just another control. This correlates to how other parts of resctrl is managed.
> For example, it can become explicit that the monitor groups' mon_data directory contains
> sub-directories organized by scope. For example:
>
> mon_data
> ├── mon_L3_00 <== monitoring data at scope L3
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_L3_01 <== monitoring data at scope L3
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_NODE_00 <== monitoring data at scope NODE
> │ └── mbm_total_bytes
> └── mon_NODE_01 <== monitoring data at scope NODE
> └── mbm_total_bytes
>
> What do you think?
I think that the ability to have different scopes for a resource would
work well for QoS on RISC-V. The CBQRI spec [1] defines bandwidth
controller operations which can be anywhere in the system. I've been
having trouble trying to decide what to do about a CBQRI-enabled memory
controller as all bandwidth monitoring is currently assumed to be L3.
Therefore, my RFC series [2] that adds resctrl support for RISC-V does
not support bandwidth monitoring, but I think scope concept could make
it work.
Thanks,
Drew
[1] https://github.com/riscv-non-isa/riscv-cbqri/releases/tag/v1.0
[2] https://lore.kernel.org/all/20260119-ssqosid-cbqri-v1-0-aa2a75153832@kernel.org/
Powered by blists - more mailing lists