linux-kernel - Re: [RFC] fs/resctrl: Generic schema description

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aV6Ba/hboKcJjyhY@e133380.arm.com>
Date: Wed, 7 Jan 2026 15:53:15 +0000
From: Dave Martin <Dave.Martin@....com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Reinette Chatre <reinette.chatre@...el.com>,
	Babu Moger <babu.moger@....com>, Fenghua Yu <fenghuay@...dia.com>,
	Tony Luck <tony.luck@...el.com>, James Morse <james.morse@....com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
	x86@...nel.org, linux-kernel@...r.kernel.org, fustini@...nel.org
Subject: Re: [RFC] fs/resctrl: Generic schema description

Hi,

On Fri, Dec 26, 2025 at 06:38:52PM +0800, Chen, Yu C wrote:
> Hi Reinette and all,
> 
> On 12/17/2025 6:26 AM, Reinette Chatre wrote:
> > Hi Babu and Fenghua,
> > 
> > Could you please consider how the new AMD and MPAM features [2] may benefit
> > from the new interfaces proposed here? More below ...
> > 
> > On 10/24/25 4:12 AM, Dave Martin wrote:
> 
> [snip]
> 
> > 
> > One thing I was pondering is that resctrl currently uses L3 interchangeably
> > as a scope and a resource but if instead that is separated then it should be
> > easier to support interactions with resource at a different scope.
> > 
> > I am concerned that, for example, support for Global Memory Bandwidth Allocation
> > (GMBA) is planned to be done with a new resource. resctrl already has a
> > "memory bandwidth allocation" resource and introducing a new resource to essentially
> > manage the same resource, but at a different scope, sounds like a risk of fragmentation
> > and duplication to me.
> > 
> > What if the "resource control" instead gains a new property, for example, "scope" that
> > essentially communicates to user space what a domain ID in the schemata file means.
> > 
> > It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less
> > MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface
> > like below:
> > 
> > info
> > └── SMBA
> >      └── resource_schemata
> >          ├── SMBA
> >          │   ├── max
> >          │   ├── min
> >          │   ├── resolution
> >          │   ├── scale
> >          │   ├── scope <== contains "L3"

I guess we already have this confusion about domain IDs with monitoring
domains not necessarily being the same as control domains.

(The generic schema description does not try to address monitoring
domains, but the concept is still valid...)

"scope" seems a resaonable name.

What values would be expected here for the pre-existing schemata?

I'm thinking

	"L2" for L2_foo schemata

	"L3" for L3_foo

	"L3" for MB (at least for the old MB schema)


Is it worth splitting out the level as a separate value?  e.g.,

	scope = "cache"
	level = 3

Not all scopes will need a "level" parameter.

(This may not be sufficient for the region-aware case that Chenyu
outlines below.)

> >          │   ├── tolerance
> >          │   ├── type
> >          │   └── unit
> >          └── SMBA_NODE
> >              ├── max
> >              ├── min
> >              ├── resolution
> >              ├── scale
> >              ├── scope <== contains "NODE"
> 
> Would it be more user-friendly to explicitly show "node0, node1, ..."
> rather than "NODE"? After all, we can already infer the "NODE" type from
> the schemata name "SMBA_NODE".

I think that having an explicit declaration of the scope is probably
useful even for things that are included in the schema name.

Part of the reason for describing the schema explicitly is because
inferring everything from the name does not feel scalable as we add
more different schemata and resource types.

Having said that, the schema names should still provide a good clue as
to what the schema represents.


I'm not sure that we should simply list possible domain IDs here:

For MPAM, the domain IDs can be huge, random-looking numbers that do
not necessarily start from 0 (as currently implemented in the MPAM
driver).

In any case, we need not just names for the individual domain IDs, but
an idea of what they represent.


Maybe we could stick with opaque "scope" names as in Reinette's
proposal, and solve the problem of enumating the domain IDs separately.


For the commonly-used scopes, we probably don't need to bother, since
the enumeration is available elsewhere:

 * for NUMA nodes and cache IDs, /sys/devices/system/node/node*
   (or /sys/devices/system/node/possible) ?

 * for cache IDs at level <n>, the set of values present in all the
   files /sys/devices/system/cpu/cpu*/cache/index<n>/id ?

> >              ├── tolerance
> >              ├── type
> >              └── unit
> > 
> > With an interface like above there is a single resource and allocating it at a different
> > scope is just another control. This correlates to how other parts of resctrl is managed.
> > For example, it can become explicit that the monitor groups' mon_data  directory contains
> > sub-directories organized by scope. For example:
> > 
> > mon_data
> > ├── mon_L3_00       <== monitoring data at scope L3
> > │   ├── llc_occupancy
> > │   ├── mbm_local_bytes
> > │   └── mbm_total_bytes
> > ├── mon_L3_01       <== monitoring data at scope L3
> > │   ├── llc_occupancy
> > │   ├── mbm_local_bytes
> > │   └── mbm_total_bytes
> > ├── mon_NODE_00     <== monitoring data at scope NODE
> 
> Does this mean the domain ID is "0", which corresponds to node0?
> This seems to align with the presentation Fenghua's presentation at LPC,
> where he mentioned that for CPU-less resctrl, the domain ID changes
> from an L3 ID to a node ID.

In an ideal world, we would have a generic description for the monitors.
Coming up with a "scope" concept that works for monitoring domains
feels like something we should aim for, even if we don't yet describe
this explicitly for monitors.

Then, we could say that mon_L3_00 has

	scope = "cache"
	level = 3
	domain = 0

(assuming that the monitoring domain really does align with the cache
control domain).

> 
> > │   └── mbm_total_bytes
> > └── mon_NODE_01     <== monitoring data at scope NODE
> >      └── mbm_total_bytes
> > 
> 
> Please let me take this chance to elaborate on region-aware RDT
> in more detail. I am wondering if the interface could be further
> extended to support this feature.
> 
> A "region" can be defined as a set of physical addresses that
> belong to the same memory tier. The region ID is per socket
> (i.e., unique within a single socket). Suppose we have a 2-socket
> platform as follows:
> 
> 
> S0: 1LM Direct DDR ==> NUMA node 0
>  CXL HDM (Tier2)   ==> NUMA node 2
> S1: 1LM Direct DDR ==> NUMA node 1
>  CXL HDM (Tier2)   ==> NUMA node 3
> 
> region0 on S0 is node0, region1 on S0 is node2,
> region0 on S1 is node1, region1 on S1 is node3.
> 
> Let us assume that each socket has 2 LLC domains.
> For example, S0 has LLC domain0 and LLC domain1,
> S1 has LLC domain2 and LLC domain3.
> 
> We propose the following schemata:
> <resource name>_<region>_<control>
> for example,
> MB_REGION1_OPT:0=511;1=510;2=509;3=508
> it means, for LLC domain0 on S0, the throttle
> level for node2(because region1 on S0 is node2)
> is 511. For LLC domain2 on S1, the throttle
> level for node3(because region1 is node2 on
> S1 is node3) is 509.
> 
> Users could query the exact definition of REGION1
> by checking the info directory.
> 
> info
> └── MB
>       └── resource_schemata
>           ├── MB_REGION1_OPT
>           │   ├── max
>           │   ├── min
>           │   ├── resolution
>           │   ├── scale
>           │   ├── scope <== "0=node2;1=node3" (node2 on S0, node3 on S1)
>           │   ├── tolerance
>           │   ├── type
>           │   └── unit
> 
> 
> thanks,
> Chenyu

Hmmm, that's interesting.

If there is a grouping on NUMA nodes, is that advertised anywhere in
sysfs already?

Ideally, there would already be a definition of what "region 0" is in
terms of the NUMA topology, and we could just refer to it.

Cheers
---Dave