[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <fb1e2686-237b-4536-acd6-15159abafcba@intel.com>
Date: Tue, 16 Dec 2025 14:26:23 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: Dave Martin <Dave.Martin@....com>, <linux-kernel@...r.kernel.org>, "Babu
Moger" <babu.moger@....com>, Fenghua Yu <fenghuay@...dia.com>,
<fustini@...nel.org>
CC: Tony Luck <tony.luck@...el.com>, James Morse <james.morse@....com>, "Chen,
Yu C" <yu.c.chen@...el.com>, Thomas Gleixner <tglx@...utronix.de>, "Ingo
Molnar" <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
<dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, "Jonathan
Corbet" <corbet@....net>, <x86@...nel.org>
Subject: Re: [RFC] fs/resctrl: Generic schema description
Hi Babu and Fenghua,
Could you please consider how the new AMD and MPAM features [2] may benefit
from the new interfaces proposed here? More below ...
On 10/24/25 4:12 AM, Dave Martin wrote:
> Hi all,
>
> Going forward, a single resctrl resource (such as memory bandwidth) is
> likely to require multiple schemata, either because we want to add new
> schemata that provide finer control, or because the hardware has
> multiple controls, covering different aspects of resource allocation.
>
> The fit between MPAM's memory bandwidth controls and the resctrl MB
> schema is already awkward, and later Intel RDT features such as Region
> Aware Memory Bandwidth Allocation are already pushing past what the MB
> schema can describe. Both of these can involve multiple control
> values and finer resolution than the 100 steps offered by the current
> "MB" schema.
>
> The previous discussion went off in a few different directions [1], so
> I want to focus back onto defining an extended schema description that
> aims to cover the use cases that we know about or anticipate today, and
> allows for future extension as needed.
>
> (A separate discussion is needed on how new schemata interact with
> previously-defined schemata (such as the MB percentage schema).
> suggest we pause that discussion for now, in the interests of getting
> the schema description nailed down.)
>
>
> Following on from the previous mail thread, I've tried to refine and
> flesh out the proposal for schema descriptions a bit, as follows.
>
> Proposal:
>
> * Split resource names and schema names in resctrlfs.
>
> Resources will be named for the unique, existing schema for each
> resource.
>
> The existing schema will keep its name (the same as the resource
> name), and new schemata defined for a resource will include that
> name as a prefix (at least, by default).
>
> So, for example, we will have an MB resource with a schema called
> MB (the schema that we have already). But we may go on to define
> additional schemata for the MB resource, with names such MB_MAX,
> etc.
>
> * Stop adding new schema description information in the top-level
> info/<resource>/ directory in resctrlfs.
>
> For backwards compatibilty, we can keep the existing property
> files under the resource info directory to describe the previously
> defined resource, but we seem to need something richer going
> forward.
>
> * Add a hierarchy to list all the schemata for each resource, along
> with their properties. So far, the proposal looks like this,
> taking the MB resource as an example:
>
> info/
> └─ MB/
> └─ resource_schemata/
> ├─ MB/
> ├─ MB_MIN/
> ├─ MB_MAX/
> ┆
>
> Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
> In this proposal, what these just dummy schema names for
> illustration purposes. The important thing is that they all
> control aspects of the "MB" resource, and that there can be more
> than one of them.
>
> It may be appropriate to have a nested hierarchy, where some
> schemata are presented as children of other schemata if they
> affect the same hardware controls. For now, let's put this issue
> on one side, and consider what properties should be advertsed for
> each schema.
>
> * Current properties that I think we might want are:
>
> info/
> └─ SOME_RESOURCE/
> └─ resource_schemata/
> ├─ SOME_SCHEMA/
> ┆ ├─ type
> ├─ min
> ├─ max
> ├─ tolerance
> ├─ resolution
> ├─ scale
> └─ unit
>
> (I've tweaked the properties a bit since previous postings.
> "type" replaces "map"; "scale" is now the unit multiplier;
> "resolution" is now a scaling divisor -- details below.)
>
> I assume that we expose the properties in individual files, but we
> could also combine them into a single description file per schema,
> per resource or (possibly) a single global file.
> (I don't have a strong view on the best option.)
>
>
> Either way, the following set of properties may be a reasonable
> place to start:
>
>
> type: the schema type, followed by optional flag specifiers:
>
> - "scalar": a single-valued numeric control
>
> A mandatory flag indicates how the control value written to
> the schemata file is converted to an amount of resource for
> hardware regulation.
>
> The flag "linear" indicates a linear mapping.
>
> In this case, the amount of resource E that is actually
> allocated is derived from the control value C written to the
> schemata file as follows:
>
> E = C * scale * unit / resolution
>
> Other flags values could be defined later, if we encounter
> hardware with non-linear controls.
>
> - "bitmap": a bitmap control
>
> The optional flag "sparse" is present if the control accepts
> sparse bitmaps.
>
> In this case, E = bitmap_weight(C) * scale * unit / resolution.
>
> As before, each bit controls access to a specific chunk of
> resource in the hardware, such as a group of cache lines. All
> chunks are equally sized.
>
> (Different CTRL_MON groups may still contend within the
> allocation E, when they have bits in common between their
> bitmaps.)
>
> min:
>
> - For a scalar schema, the minimum value that can be written to
> the control when writing the schemata file.
>
> - For a bitmap schema, a bitmap of the minimum weight that the
> schema accepts: if an empty bitmap is accepted, this can be 0.
> Otherwise, if bitmaps with a single bit set are acceptable,
> this can just have the lowest-order bit set.
>
> Most commonly, the value will probably be "1".
>
> For bitmap schemata, we might report this in hex. In the
> interest of generic parsing, we could include a "0x" prefix if
> so.
>
> max:
>
> - For a scalar schema, the maximum value that can be written to
> the control when writing the schemata file.
>
> - For a bitmap schema, the mask with all bits set.
>
> Possibly reported in hex for bitmap schemata (as for "min").
>
> tolerance:
>
> (See below for discussion on this.)
>
> - "0": the control is exact
>
> - "1": the effective control value is within ±1 of the control
> value written to the schemata file. (Similary, positive "n" ->
> ±n.)
>
> A negative value could be used to indicate that the tolerance
> is unknown. (Possibly we could also just omit the property,
> though it seems better to warn userspace explicitly if we
> don't know.)
>
> Tests might make use of this parameter in order to determine
> how picky to be about exact measurement results.
>
> resolution:
>
> - For a proportional scalar schema: the number of divisions that
> the whole resource is divided into. (See below for
> "proportional scalar schema.)
>
> Typically, this will be the same as the "max" value.
>
> - For an absolute scalar schema: the divisor applied to the
> control value.
>
> - For a bitmap schema: the size of the bitmap in bits.
>
> scale:
>
> - For a scalar schema: the scale-up multiplier applied to
> "unit".
>
> - For a bitmap schema: probably "1".
>
> unit:
>
> - The base unit of the quantity measured by the control value.
>
> The special unit "all" denotes a proportional schema. In this
> case, the resource is a finite, physical thing such as a cache
> or maxed-out data throughput of a memory controller. The
> entire physical resource is available for allocation, and the
> control value indicates what proportion of it is allocated.
>
> Bitmap schemata will probably all be proportional and use the
> unit "all". (This applies to cache bitmaps, at least.)
>
> Absolute schemata will require specification of the base unit
> here, say, "MBps". The "scale" parameter can be used to avoid
> proliferation of unit strings:
>
> For example, {scale=1000, unit="MBps"} would be equivalent to
> {scale=1, unit="GBps"}.
>
>
> Note on the "tolerance" parameter:
>
> This is a new addition. On the MPAM side, the hardware has a choice
> about how to interpret the control value in some edge-case situations.
> We may not reasonably be able to probe for this, so it may be useful
> to warn software that there is an uncertainty margin.
>
> We might also be able to use the "tolerance" parameter to accommodate
> the rounding behaviour of the existing "MB" schema (otherwise, we
> might want a special "type" for this schema, if it doesn't comply
> closely enough).
>
>
> If we want to deploy resctrl under virtualisation, resctrl on the host
> could dynamically affect the actual amount of resource that is
> available for allocation inside a VM.
>
> Whether or not we ever want to do that, it might be useful to have a
> way to warn software that the effective control values hitting the
> hardware may not be entirely predictable.
>
> Thoughts?
>
> Cheers
> ---Dave
One thing I was pondering is that resctrl currently uses L3 interchangeably
as a scope and a resource but if instead that is separated then it should be
easier to support interactions with resource at a different scope.
I am concerned that, for example, support for Global Memory Bandwidth Allocation
(GMBA) is planned to be done with a new resource. resctrl already has a
"memory bandwidth allocation" resource and introducing a new resource to essentially
manage the same resource, but at a different scope, sounds like a risk of fragmentation
and duplication to me.
What if the "resource control" instead gains a new property, for example, "scope" that
essentially communicates to user space what a domain ID in the schemata file means.
It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less
MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface
like below:
info
└── SMBA
└── resource_schemata
├── SMBA
│ ├── max
│ ├── min
│ ├── resolution
│ ├── scale
│ ├── scope <== contains "L3"
│ ├── tolerance
│ ├── type
│ └── unit
└── SMBA_NODE
├── max
├── min
├── resolution
├── scale
├── scope <== contains "NODE"
├── tolerance
├── type
└── unit
With an interface like above there is a single resource and allocating it at a different
scope is just another control. This correlates to how other parts of resctrl is managed.
For example, it can become explicit that the monitor groups' mon_data directory contains
sub-directories organized by scope. For example:
mon_data
├── mon_L3_00 <== monitoring data at scope L3
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_L3_01 <== monitoring data at scope L3
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_NODE_00 <== monitoring data at scope NODE
│ └── mbm_total_bytes
└── mon_NODE_01 <== monitoring data at scope NODE
└── mbm_total_bytes
What do you think?
Reinette
> [1] Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
> https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
[2] https://lpc.events/event/19/contributions/2093/attachments/1958/4172/resctrl%20Microconference%20LPC%202025%20Tokyo.pdf
Powered by blists - more mailing lists