linux-kernel - Re: [RFC] fs/resctrl: Generic schema description

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d5c807b0-d2a6-4fd9-8ad4-a36b334e0b88@intel.com>
Date: Fri, 26 Dec 2025 18:38:52 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Reinette Chatre <reinette.chatre@...el.com>, Babu Moger
	<babu.moger@....com>, Fenghua Yu <fenghuay@...dia.com>, Dave Martin
	<Dave.Martin@....com>
CC: Tony Luck <tony.luck@...el.com>, James Morse <james.morse@....com>,
	"Thomas Gleixner" <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
	"Borislav Petkov" <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
	"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
	<x86@...nel.org>, <linux-kernel@...r.kernel.org>, <fustini@...nel.org>
Subject: Re: [RFC] fs/resctrl: Generic schema description

Hi Reinette and all,

On 12/17/2025 6:26 AM, Reinette Chatre wrote:
> Hi Babu and Fenghua,
> 
> Could you please consider how the new AMD and MPAM features [2] may benefit
> from the new interfaces proposed here? More below ...
> 
> On 10/24/25 4:12 AM, Dave Martin wrote:

[snip]

> 
> One thing I was pondering is that resctrl currently uses L3 interchangeably
> as a scope and a resource but if instead that is separated then it should be
> easier to support interactions with resource at a different scope.
> 
> I am concerned that, for example, support for Global Memory Bandwidth Allocation
> (GMBA) is planned to be done with a new resource. resctrl already has a
> "memory bandwidth allocation" resource and introducing a new resource to essentially
> manage the same resource, but at a different scope, sounds like a risk of fragmentation
> and duplication to me.
> 
> What if the "resource control" instead gains a new property, for example, "scope" that
> essentially communicates to user space what a domain ID in the schemata file means.
> 
> It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less
> MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface
> like below:
> 
> info
> └── SMBA
>      └── resource_schemata
>          ├── SMBA
>          │   ├── max
>          │   ├── min
>          │   ├── resolution
>          │   ├── scale
>          │   ├── scope <== contains "L3"
>          │   ├── tolerance
>          │   ├── type
>          │   └── unit
>          └── SMBA_NODE
>              ├── max
>              ├── min
>              ├── resolution
>              ├── scale
>              ├── scope <== contains "NODE"

Would it be more user-friendly to explicitly show "node0, node1, ..."
rather than "NODE"? After all, we can already infer the "NODE" type from
the schemata name "SMBA_NODE".

>              ├── tolerance
>              ├── type
>              └── unit
> 
> With an interface like above there is a single resource and allocating it at a different
> scope is just another control. This correlates to how other parts of resctrl is managed.
> For example, it can become explicit that the monitor groups' mon_data  directory contains
> sub-directories organized by scope. For example:
> 
> mon_data
> ├── mon_L3_00       <== monitoring data at scope L3
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> ├── mon_L3_01       <== monitoring data at scope L3
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> ├── mon_NODE_00     <== monitoring data at scope NODE

Does this mean the domain ID is "0", which corresponds to node0?
This seems to align with the presentation Fenghua's presentation at LPC,
where he mentioned that for CPU-less resctrl, the domain ID changes
from an L3 ID to a node ID.

> │   └── mbm_total_bytes
> └── mon_NODE_01     <== monitoring data at scope NODE
>      └── mbm_total_bytes
> 

Please let me take this chance to elaborate on region-aware RDT
in more detail. I am wondering if the interface could be further
extended to support this feature.

A "region" can be defined as a set of physical addresses that
belong to the same memory tier. The region ID is per socket
(i.e., unique within a single socket). Suppose we have a 2-socket
platform as follows:


S0: 1LM Direct DDR ==> NUMA node 0
  CXL HDM (Tier2)   ==> NUMA node 2
S1: 1LM Direct DDR ==> NUMA node 1
  CXL HDM (Tier2)   ==> NUMA node 3

region0 on S0 is node0, region1 on S0 is node2,
region0 on S1 is node1, region1 on S1 is node3.

Let us assume that each socket has 2 LLC domains.
For example, S0 has LLC domain0 and LLC domain1,
S1 has LLC domain2 and LLC domain3.

We propose the following schemata:
<resource name>_<region>_<control>
for example,
MB_REGION1_OPT:0=511;1=510;2=509;3=508
it means, for LLC domain0 on S0, the throttle
level for node2(because region1 on S0 is node2)
is 511. For LLC domain2 on S1, the throttle
level for node3(because region1 is node2 on
S1 is node3) is 509.

Users could query the exact definition of REGION1
by checking the info directory.

info
└── MB
       └── resource_schemata
           ├── MB_REGION1_OPT
           │   ├── max
           │   ├── min
           │   ├── resolution
           │   ├── scale
           │   ├── scope <== "0=node2;1=node3" (node2 on S0, node3 on S1)
           │   ├── tolerance
           │   ├── type
           │   └── unit


thanks,
Chenyu