linux-kernel - Re: [RFC] fs/resctrl: Generic schema description

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aXUK7XFsHl+gnwA/@x1>
Date: Sat, 24 Jan 2026 10:09:49 -0800
From: Drew Fustini <fustini@...nel.org>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: Dave Martin <Dave.Martin@....com>, linux-kernel@...r.kernel.org,
	Babu Moger <babu.moger@....com>, Fenghua Yu <fenghuay@...dia.com>,
	Tony Luck <tony.luck@...el.com>, James Morse <james.morse@....com>,
	"Chen, Yu C" <yu.c.chen@...el.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
	x86@...nel.org
Subject: Re: [RFC] fs/resctrl: Generic schema description

On Tue, Dec 16, 2025 at 02:26:23PM -0800, Reinette Chatre wrote:
> Hi Babu and Fenghua,
> 
> Could you please consider how the new AMD and MPAM features [2] may benefit
> from the new interfaces proposed here? More below ...
> 
> On 10/24/25 4:12 AM, Dave Martin wrote:
> > Hi all,
> > 
> > Going forward, a single resctrl resource (such as memory bandwidth) is
> > likely to require multiple schemata, either because we want to add new
> > schemata that provide finer control, or because the hardware has
> > multiple controls, covering different aspects of resource allocation.
> > 
> > The fit between MPAM's memory bandwidth controls and the resctrl MB
> > schema is already awkward, and later Intel RDT features such as Region
> > Aware Memory Bandwidth Allocation are already pushing past what the MB
> > schema can describe.  Both of these can involve multiple control
> > values and finer resolution than the 100 steps offered by the current
> > "MB" schema.
> > 
> > The previous discussion went off in a few different directions [1], so
> > I want to focus back onto defining an extended schema description that
> > aims to cover the use cases that we know about or anticipate today, and
> > allows for future extension as needed.
> > 
> > (A separate discussion is needed on how new schemata interact with
> > previously-defined schemata (such as the MB percentage schema). 
> > suggest we pause that discussion for now, in the interests of getting
> > the schema description nailed down.)
> > 
> > 
> > Following on from the previous mail thread, I've tried to refine and
> > flesh out the proposal for schema descriptions a bit, as follows.
> > 
> > Proposal:
> > 
> >   * Split resource names and schema names in resctrlfs.
> > 
> >     Resources will be named for the unique, existing schema for each
> >     resource.
> > 
> >     The existing schema will keep its name (the same as the resource
> >     name), and new schemata defined for a resource will include that
> >     name as a prefix (at least, by default).
> > 
> >     So, for example, we will have an MB resource with a schema called
> >     MB (the schema that we have already).  But we may go on to define
> >     additional schemata for the MB resource, with names such MB_MAX,
> >     etc.
> > 
> >   * Stop adding new schema description information in the top-level
> >     info/<resource>/ directory in resctrlfs.
> > 
> >     For backwards compatibilty, we can keep the existing property
> >     files under the resource info directory to describe the previously
> >     defined resource, but we seem to need something richer going
> >     forward.
> > 
> >   * Add a hierarchy to list all the schemata for each resource, along
> >     with their properties.  So far, the proposal looks like this,
> >     taking the MB resource as an example:
> > 
> > 	info/
> > 	 └─ MB/
> > 	     └─ resource_schemata/
> > 	         ├─ MB/
> > 	         ├─ MB_MIN/
> > 	         ├─ MB_MAX/
> > 	         ┆
> > 
> >     Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
> >     In this proposal, what these just dummy schema names for
> >     illustration purposes.  The important thing is that they all
> >     control aspects of the "MB" resource, and that there can be more
> >     than one of them.
> > 
> >     It may be appropriate to have a nested hierarchy, where some
> >     schemata are presented as children of other schemata if they
> >     affect the same hardware controls.  For now, let's put this issue
> >     on one side, and consider what properties should be advertsed for
> >     each schema.
> > 
> >   * Current properties that I think we might want are:
> > 
> > 	info/
> > 	 └─ SOME_RESOURCE/
> > 	     └─ resource_schemata/
> > 	         ├─ SOME_SCHEMA/
> > 	         ┆   ├─ type
> > 	             ├─ min
> > 	             ├─ max
> > 	             ├─ tolerance
> > 	             ├─ resolution
> > 	             ├─ scale
> > 	             └─ unit
> > 
> >     (I've tweaked the properties a bit since previous postings.
> >     "type" replaces "map"; "scale" is now the unit multiplier;
> >     "resolution" is now a scaling divisor -- details below.)
> > 
> >     I assume that we expose the properties in individual files, but we
> >     could also combine them into a single description file per schema,
> >     per resource or (possibly) a single global file.
> >     (I don't have a strong view on the best option.)
> > 
> > 
> >     Either way, the following set of properties may be a reasonable
> >     place to start:
> > 
> > 
> >     type: the schema type, followed by optional flag specifiers:
> > 
> >       - "scalar": a single-valued numeric control
> > 
> >         A mandatory flag indicates how the control value written to
> >         the schemata file is converted to an amount of resource for
> >         hardware regulation.
> > 
> > 	The flag "linear" indicates a linear mapping.
> > 
> > 	In this case, the amount of resource E that is actually
> > 	allocated is derived from the control value C written to the
> > 	schemata file as follows:
> > 
> >     	E = C * scale * unit / resolution
> > 
> > 	Other flags values could be defined later, if we encounter
> > 	hardware with non-linear controls.
> > 
> >       - "bitmap": a bitmap control
> > 
> >         The optional flag "sparse" is present if the control accepts
> >         sparse bitmaps.
> > 
> > 	In this case, E = bitmap_weight(C) * scale * unit / resolution.
> > 
> > 	As before, each bit controls access to a specific chunk of
> > 	resource in the hardware, such as a group of cache lines.  All
> > 	chunks are equally sized.
> > 
> > 	(Different CTRL_MON groups may still contend within the
> > 	allocation E, when they have bits in common between their
> > 	bitmaps.)
> > 
> >     min:
> > 
> >       - For a scalar schema, the minimum value that can be written to
> >         the control when writing the schemata file.
> > 
> >       - For a bitmap schema, a bitmap of the minimum weight that the
> >         schema accepts: if an empty bitmap is accepted, this can be 0.
> >         Otherwise, if bitmaps with a single bit set are acceptable,
> >         this can just have the lowest-order bit set.
> > 
> > 	Most commonly, the value will probably be "1".
> > 
> > 	For bitmap schemata, we might report this in hex.  In the
> > 	interest of generic parsing, we could include a "0x" prefix if
> > 	so.
> > 
> >     max:
> > 
> >       - For a scalar schema, the maximum value that can be written to
> >         the control when writing the schemata file.
> > 
> >       - For a bitmap schema, the mask with all bits set.
> > 
> >         Possibly reported in hex for bitmap schemata (as for "min").
> > 
> >     tolerance:
> > 
> >         (See below for discussion on this.)
> > 
> >       - "0": the control is exact
> >       
> >       - "1": the effective control value is within ±1 of the control
> >         value written to the schemata file.  (Similary, positive "n" ->
> >         ±n.)
> > 
> >         A negative value could be used to indicate that the tolerance
> >         is unknown.  (Possibly we could also just omit the property,
> >         though it seems better to warn userspace explicitly if we
> >         don't know.)
> > 
> > 	Tests might make use of this parameter in order to determine
> > 	how picky to be about exact measurement results.
> > 
> >     resolution:
> > 
> >       - For a proportional scalar schema: the number of divisions that
> >         the whole resource is divided into.  (See below for
> >         "proportional scalar schema.)
> > 
> > 	Typically, this will be the same as the "max" value.
> > 
> >       - For an absolute scalar schema: the divisor applied to the
> >         control value.
> > 
> >       - For a bitmap schema: the size of the bitmap in bits.
> > 
> >     scale:
> > 
> >       - For a scalar schema: the scale-up multiplier applied to
> >         "unit".
> > 
> >       - For a bitmap schema: probably "1".
> > 
> >     unit:
> > 
> >       - The base unit of the quantity measured by the control value.
> > 
> >         The special unit "all" denotes a proportional schema.  In this
> >         case, the resource is a finite, physical thing such as a cache
> >         or maxed-out data throughput of a memory controller.  The
> >         entire physical resource is available for allocation, and the
> >         control value indicates what proportion of it is allocated.
> > 
> > 	Bitmap schemata will probably all be proportional and use the
> > 	unit "all".  (This applies to cache bitmaps, at least.)
> > 
> > 	Absolute schemata will require specification of the base unit
> > 	here, say, "MBps".  The "scale" parameter can be used to avoid
> > 	proliferation of unit strings:
> > 
> > 	For example, {scale=1000, unit="MBps"} would be equivalent to
> > 	{scale=1, unit="GBps"}.
> > 
> > 
> > Note on the "tolerance" parameter:
> > 
> > This is a new addition.  On the MPAM side, the hardware has a choice
> > about how to interpret the control value in some edge-case situations.
> > We may not reasonably be able to probe for this, so it may be useful
> > to warn software that there is an uncertainty margin.
> > 
> > We might also be able to use the "tolerance" parameter to accommodate
> > the rounding behaviour of the existing "MB" schema (otherwise, we
> > might want a special "type" for this schema, if it doesn't comply
> > closely enough).
> > 
> > 
> > If we want to deploy resctrl under virtualisation, resctrl on the host
> > could dynamically affect the actual amount of resource that is
> > available for allocation inside a VM.
> > 
> > Whether or not we ever want to do that, it might be useful to have a
> > way to warn software that the effective control values hitting the
> > hardware may not be entirely predictable.
> > 
> > Thoughts?
> > 
> > Cheers
> > ---Dave
> 
> 
> One thing I was pondering is that resctrl currently uses L3 interchangeably
> as a scope and a resource but if instead that is separated then it should be
> easier to support interactions with resource at a different scope.
> 
> I am concerned that, for example, support for Global Memory Bandwidth Allocation
> (GMBA) is planned to be done with a new resource. resctrl already has a
> "memory bandwidth allocation" resource and introducing a new resource to essentially
> manage the same resource, but at a different scope, sounds like a risk of fragmentation
> and duplication to me.
> 
> What if the "resource control" instead gains a new property, for example, "scope" that
> essentially communicates to user space what a domain ID in the schemata file means.
> 
> It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less
> MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface
> like below:
> 
> info
> └── SMBA
>     └── resource_schemata
>         ├── SMBA
>         │   ├── max
>         │   ├── min
>         │   ├── resolution
>         │   ├── scale
>         │   ├── scope <== contains "L3"
>         │   ├── tolerance
>         │   ├── type
>         │   └── unit
>         └── SMBA_NODE
>             ├── max
>             ├── min
>             ├── resolution
>             ├── scale
>             ├── scope <== contains "NODE"
>             ├── tolerance
>             ├── type
>             └── unit
> 
> With an interface like above there is a single resource and allocating it at a different
> scope is just another control. This correlates to how other parts of resctrl is managed.
> For example, it can become explicit that the monitor groups' mon_data  directory contains
> sub-directories organized by scope. For example:
> 
> mon_data
> ├── mon_L3_00       <== monitoring data at scope L3
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> ├── mon_L3_01       <== monitoring data at scope L3
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> ├── mon_NODE_00     <== monitoring data at scope NODE
> │   └── mbm_total_bytes
> └── mon_NODE_01     <== monitoring data at scope NODE
>     └── mbm_total_bytes
> 
> What do you think?

I think that the ability to have different scopes for a resource would
work well for QoS on RISC-V. The CBQRI spec [1] defines bandwidth
controller operations which can be anywhere in the system. I've been
having trouble trying to decide what to do about a CBQRI-enabled memory
controller as all bandwidth monitoring is currently assumed to be L3.

Therefore, my RFC series [2] that adds resctrl support for RISC-V does
not support bandwidth monitoring, but I think scope concept could make
it work.

Thanks,
Drew

[1] https://github.com/riscv-non-isa/riscv-cbqri/releases/tag/v1.0
[2] https://lore.kernel.org/all/20260119-ssqosid-cbqri-v1-0-aa2a75153832@kernel.org/