linux-kernel - Re: [RFC] fs/resctrl: Generic schema description

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fb1e2686-237b-4536-acd6-15159abafcba@intel.com>
Date: Tue, 16 Dec 2025 14:26:23 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: Dave Martin <Dave.Martin@....com>, <linux-kernel@...r.kernel.org>, "Babu
 Moger" <babu.moger@....com>, Fenghua Yu <fenghuay@...dia.com>,
	<fustini@...nel.org>
CC: Tony Luck <tony.luck@...el.com>, James Morse <james.morse@....com>, "Chen,
 Yu C" <yu.c.chen@...el.com>, Thomas Gleixner <tglx@...utronix.de>, "Ingo
 Molnar" <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
	<dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, "Jonathan
 Corbet" <corbet@....net>, <x86@...nel.org>
Subject: Re: [RFC] fs/resctrl: Generic schema description

Hi Babu and Fenghua,

Could you please consider how the new AMD and MPAM features [2] may benefit
from the new interfaces proposed here? More below ...

On 10/24/25 4:12 AM, Dave Martin wrote:
> Hi all,
> 
> Going forward, a single resctrl resource (such as memory bandwidth) is
> likely to require multiple schemata, either because we want to add new
> schemata that provide finer control, or because the hardware has
> multiple controls, covering different aspects of resource allocation.
> 
> The fit between MPAM's memory bandwidth controls and the resctrl MB
> schema is already awkward, and later Intel RDT features such as Region
> Aware Memory Bandwidth Allocation are already pushing past what the MB
> schema can describe.  Both of these can involve multiple control
> values and finer resolution than the 100 steps offered by the current
> "MB" schema.
> 
> The previous discussion went off in a few different directions [1], so
> I want to focus back onto defining an extended schema description that
> aims to cover the use cases that we know about or anticipate today, and
> allows for future extension as needed.
> 
> (A separate discussion is needed on how new schemata interact with
> previously-defined schemata (such as the MB percentage schema). 
> suggest we pause that discussion for now, in the interests of getting
> the schema description nailed down.)
> 
> 
> Following on from the previous mail thread, I've tried to refine and
> flesh out the proposal for schema descriptions a bit, as follows.
> 
> Proposal:
> 
>   * Split resource names and schema names in resctrlfs.
> 
>     Resources will be named for the unique, existing schema for each
>     resource.
> 
>     The existing schema will keep its name (the same as the resource
>     name), and new schemata defined for a resource will include that
>     name as a prefix (at least, by default).
> 
>     So, for example, we will have an MB resource with a schema called
>     MB (the schema that we have already).  But we may go on to define
>     additional schemata for the MB resource, with names such MB_MAX,
>     etc.
> 
>   * Stop adding new schema description information in the top-level
>     info/<resource>/ directory in resctrlfs.
> 
>     For backwards compatibilty, we can keep the existing property
>     files under the resource info directory to describe the previously
>     defined resource, but we seem to need something richer going
>     forward.
> 
>   * Add a hierarchy to list all the schemata for each resource, along
>     with their properties.  So far, the proposal looks like this,
>     taking the MB resource as an example:
> 
> 	info/
> 	 └─ MB/
> 	     └─ resource_schemata/
> 	         ├─ MB/
> 	         ├─ MB_MIN/
> 	         ├─ MB_MAX/
> 	         ┆
> 
>     Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
>     In this proposal, what these just dummy schema names for
>     illustration purposes.  The important thing is that they all
>     control aspects of the "MB" resource, and that there can be more
>     than one of them.
> 
>     It may be appropriate to have a nested hierarchy, where some
>     schemata are presented as children of other schemata if they
>     affect the same hardware controls.  For now, let's put this issue
>     on one side, and consider what properties should be advertsed for
>     each schema.
> 
>   * Current properties that I think we might want are:
> 
> 	info/
> 	 └─ SOME_RESOURCE/
> 	     └─ resource_schemata/
> 	         ├─ SOME_SCHEMA/
> 	         ┆   ├─ type
> 	             ├─ min
> 	             ├─ max
> 	             ├─ tolerance
> 	             ├─ resolution
> 	             ├─ scale
> 	             └─ unit
> 
>     (I've tweaked the properties a bit since previous postings.
>     "type" replaces "map"; "scale" is now the unit multiplier;
>     "resolution" is now a scaling divisor -- details below.)
> 
>     I assume that we expose the properties in individual files, but we
>     could also combine them into a single description file per schema,
>     per resource or (possibly) a single global file.
>     (I don't have a strong view on the best option.)
> 
> 
>     Either way, the following set of properties may be a reasonable
>     place to start:
> 
> 
>     type: the schema type, followed by optional flag specifiers:
> 
>       - "scalar": a single-valued numeric control
> 
>         A mandatory flag indicates how the control value written to
>         the schemata file is converted to an amount of resource for
>         hardware regulation.
> 
> 	The flag "linear" indicates a linear mapping.
> 
> 	In this case, the amount of resource E that is actually
> 	allocated is derived from the control value C written to the
> 	schemata file as follows:
> 
>     	E = C * scale * unit / resolution
> 
> 	Other flags values could be defined later, if we encounter
> 	hardware with non-linear controls.
> 
>       - "bitmap": a bitmap control
> 
>         The optional flag "sparse" is present if the control accepts
>         sparse bitmaps.
> 
> 	In this case, E = bitmap_weight(C) * scale * unit / resolution.
> 
> 	As before, each bit controls access to a specific chunk of
> 	resource in the hardware, such as a group of cache lines.  All
> 	chunks are equally sized.
> 
> 	(Different CTRL_MON groups may still contend within the
> 	allocation E, when they have bits in common between their
> 	bitmaps.)
> 
>     min:
> 
>       - For a scalar schema, the minimum value that can be written to
>         the control when writing the schemata file.
> 
>       - For a bitmap schema, a bitmap of the minimum weight that the
>         schema accepts: if an empty bitmap is accepted, this can be 0.
>         Otherwise, if bitmaps with a single bit set are acceptable,
>         this can just have the lowest-order bit set.
> 
> 	Most commonly, the value will probably be "1".
> 
> 	For bitmap schemata, we might report this in hex.  In the
> 	interest of generic parsing, we could include a "0x" prefix if
> 	so.
> 
>     max:
> 
>       - For a scalar schema, the maximum value that can be written to
>         the control when writing the schemata file.
> 
>       - For a bitmap schema, the mask with all bits set.
> 
>         Possibly reported in hex for bitmap schemata (as for "min").
> 
>     tolerance:
> 
>         (See below for discussion on this.)
> 
>       - "0": the control is exact
>       
>       - "1": the effective control value is within ±1 of the control
>         value written to the schemata file.  (Similary, positive "n" ->
>         ±n.)
> 
>         A negative value could be used to indicate that the tolerance
>         is unknown.  (Possibly we could also just omit the property,
>         though it seems better to warn userspace explicitly if we
>         don't know.)
> 
> 	Tests might make use of this parameter in order to determine
> 	how picky to be about exact measurement results.
> 
>     resolution:
> 
>       - For a proportional scalar schema: the number of divisions that
>         the whole resource is divided into.  (See below for
>         "proportional scalar schema.)
> 
> 	Typically, this will be the same as the "max" value.
> 
>       - For an absolute scalar schema: the divisor applied to the
>         control value.
> 
>       - For a bitmap schema: the size of the bitmap in bits.
> 
>     scale:
> 
>       - For a scalar schema: the scale-up multiplier applied to
>         "unit".
> 
>       - For a bitmap schema: probably "1".
> 
>     unit:
> 
>       - The base unit of the quantity measured by the control value.
> 
>         The special unit "all" denotes a proportional schema.  In this
>         case, the resource is a finite, physical thing such as a cache
>         or maxed-out data throughput of a memory controller.  The
>         entire physical resource is available for allocation, and the
>         control value indicates what proportion of it is allocated.
> 
> 	Bitmap schemata will probably all be proportional and use the
> 	unit "all".  (This applies to cache bitmaps, at least.)
> 
> 	Absolute schemata will require specification of the base unit
> 	here, say, "MBps".  The "scale" parameter can be used to avoid
> 	proliferation of unit strings:
> 
> 	For example, {scale=1000, unit="MBps"} would be equivalent to
> 	{scale=1, unit="GBps"}.
> 
> 
> Note on the "tolerance" parameter:
> 
> This is a new addition.  On the MPAM side, the hardware has a choice
> about how to interpret the control value in some edge-case situations.
> We may not reasonably be able to probe for this, so it may be useful
> to warn software that there is an uncertainty margin.
> 
> We might also be able to use the "tolerance" parameter to accommodate
> the rounding behaviour of the existing "MB" schema (otherwise, we
> might want a special "type" for this schema, if it doesn't comply
> closely enough).
> 
> 
> If we want to deploy resctrl under virtualisation, resctrl on the host
> could dynamically affect the actual amount of resource that is
> available for allocation inside a VM.
> 
> Whether or not we ever want to do that, it might be useful to have a
> way to warn software that the effective control values hitting the
> hardware may not be entirely predictable.
> 
> Thoughts?
> 
> Cheers
> ---Dave


One thing I was pondering is that resctrl currently uses L3 interchangeably
as a scope and a resource but if instead that is separated then it should be
easier to support interactions with resource at a different scope.

I am concerned that, for example, support for Global Memory Bandwidth Allocation
(GMBA) is planned to be done with a new resource. resctrl already has a
"memory bandwidth allocation" resource and introducing a new resource to essentially
manage the same resource, but at a different scope, sounds like a risk of fragmentation
and duplication to me.

What if the "resource control" instead gains a new property, for example, "scope" that
essentially communicates to user space what a domain ID in the schemata file means.

It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less
MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface
like below:

info
└── SMBA
    └── resource_schemata
        ├── SMBA
        │   ├── max
        │   ├── min
        │   ├── resolution
        │   ├── scale
        │   ├── scope <== contains "L3"
        │   ├── tolerance
        │   ├── type
        │   └── unit
        └── SMBA_NODE
            ├── max
            ├── min
            ├── resolution
            ├── scale
            ├── scope <== contains "NODE"
            ├── tolerance
            ├── type
            └── unit

With an interface like above there is a single resource and allocating it at a different
scope is just another control. This correlates to how other parts of resctrl is managed.
For example, it can become explicit that the monitor groups' mon_data  directory contains
sub-directories organized by scope. For example:

mon_data
├── mon_L3_00       <== monitoring data at scope L3
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_L3_01       <== monitoring data at scope L3
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_NODE_00     <== monitoring data at scope NODE
│   └── mbm_total_bytes
└── mon_NODE_01     <== monitoring data at scope NODE
    └── mbm_total_bytes

What do you think?

Reinette
 
> [1] Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
> https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/

[2] https://lpc.events/event/19/contributions/2093/attachments/1958/4172/resctrl%20Microconference%20LPC%202025%20Tokyo.pdf