linux-kernel - Re: [RFC] fs/resctrl: Generic schema description

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6e8d3645-cb0d-4bfe-a170-6306e3c60582@intel.com>
Date: Thu, 6 Nov 2025 09:45:59 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: Dave Martin <Dave.Martin@....com>
CC: <linux-kernel@...r.kernel.org>, Tony Luck <tony.luck@...el.com>, "James
 Morse" <james.morse@....com>, "Chen, Yu C" <yu.c.chen@...el.com>, "Thomas
 Gleixner" <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, "Borislav
 Petkov" <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter
 Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>, <x86@...nel.org>,
	Drew Fustini <dfustini@...libre.com>
Subject: Re: [RFC] fs/resctrl: Generic schema description

+Drew

On 11/4/25 2:26 PM, Reinette Chatre wrote:
> Hi Dave,
> 
> On 10/30/25 9:36 AM, Dave Martin wrote:
>> Hi Reinette,
>>
>> On Tue, Oct 28, 2025 at 04:17:05PM -0700, Reinette Chatre wrote:
>>> Hi Dave,
>>>
>>> On 10/24/25 4:12 AM, Dave Martin wrote:
>>>> Hi all,
>>>>
>>>> Going forward, a single resctrl resource (such as memory bandwidth) is
>>>> likely to require multiple schemata, either because we want to add new
>>>> schemata that provide finer control, or because the hardware has
>>>> multiple controls, covering different aspects of resource allocation.
>>>>
>>>> The fit between MPAM's memory bandwidth controls and the resctrl MB
>>>> schema is already awkward, and later Intel RDT features such as Region
>>>> Aware Memory Bandwidth Allocation are already pushing past what the MB
>>>> schema can describe.  Both of these can involve multiple control
>>>> values and finer resolution than the 100 steps offered by the current
>>>> "MB" schema.
>>>>
>>>> The previous discussion went off in a few different directions [1], so
>>>> I want to focus back onto defining an extended schema description that
>>>> aims to cover the use cases that we know about or anticipate today, and
>>>> allows for future extension as needed.
>>>>
>>>> (A separate discussion is needed on how new schemata interact with
>>>> previously-defined schemata (such as the MB percentage schema). 
>>>> suggest we pause that discussion for now, in the interests of getting
>>>> the schema description nailed down.)
>>>
>>> ok, but let's keep this as "open #1"
>>>
>>>> Following on from the previous mail thread, I've tried to refine and
>>>> flesh out the proposal for schema descriptions a bit, as follows.
>>>>
>>>> Proposal:
>>>>
>>>>   * Split resource names and schema names in resctrlfs.
>>>>
>>>>     Resources will be named for the unique, existing schema for each
>>>>     resource.
>>>
>>> Are you referring to the implementation or how things are exposed to user
>>> space? I am trying to understand how the existing L3CODE/L3DATA schemata
>>> fit in ... they are presented to user space as two separate resources since
>>> they each have their own directory in "info" while internally they are 
>>> schema of the L3 resource.
>>
>> Good question -- I didn't take into account here the fact that some
>> physical resources already have multiple schemata exposed to userspace.
>>
>> I've probably overformalised, here.  I'm not proposing to refactor the
>> arrangement of existing schemata and resources.	
>>
>> So we would continue to have
>> info/L3CODE/resource_schemata/L3CODE/ and
>> info/L3DATA/resource_schemata/L3DATA/.
>>
>>
>> I think that the decision to combine these under a single resctrl
>> resource internally is the most logical one, but I'm proposing just to
>> extend the info/ content, without unnecssary changes.
> 
> Thank you for confirming. This matches the way I was thinking about this work.
> 
>>
>> The current arrangement does have one shortcoming, which is that
>> software doesn't know (other than by built-in knowledge) that L3CODE
>> and L3DATA claim resource from the same hardware pool, so
>>
>> 	L3CODE:0=0001
>> 	L3DATA:0=0001
>>
>> implies that the transactions on the I-side and D-side contend for
>> cache lines (unless there are separate L3 I- and D-caches -- but I
>> don't think that's a thing on any relevant system...)
>>
>> So, we might want some way to indicate that L3CODE and L3DATA are
>> linked.  But I think that CDP is a unique case where we can reasonably
>> expect some built-in userspace knowledge.
> 
> I'll admit that it is not as obvious as this new interface would make it be
> for new schemata but userspace is not entirely left to its own devices. 
> resctrl will ensure that these resources do not overlap when, for example,
> a resource group is exclusive. For example, an L3CODE allocation in one
> resource group cannot be created to overlap with an L3DATA allocation in
> another when one of the resource groups is exclusive.
> 
>>
>> I didn't currently plan to address this, but it could come later if we
>> think it's important.
>>
>>> Just trying to understand if you are talking about reverting
>>> https://lore.kernel.org/all/20210728170637.25610-1-james.morse@arm.com/ ?
>>
>> No...
>>
>>> The current implementation appears to match this proposal so we may need to
>>> have special cases to keep CDP backwards compatible.
>>>
>>> SMBA may also need some extra care ... especially if other architectures start
>>> to allocate memory bandwidth to CXL resource via their "MB" resource.
>>
>> Perhaps.  I think it may be necessary to hack up and implementation of
>> these changes, to flush out things that don't quite fit.
> 
> Have you considered how MPAM may want to deal with different memory "types"?
> With SMBA there is a "CXL memory" resource while the MB resource has mostly
> been "anything that misses L3". From a user space perspective it is not obvious
> to me how users prefer to refer to different memory types.
> 
>>
>>>  
>>>>     The existing schema will keep its name (the same as the resource
>>>>     name), and new schemata defined for a resource will include that
>>>>     name as a prefix (at least, by default).
> 
> We may have to be explicit on expectations wrt which schema can be observed in
> which area (schemata file vs new info hierarchy). resctrl.rst currently contains:
> 	"schemata":
> 		A list of all the resources available to this group.
> With the above in existing documentation resctrl may be forced to always keep
> existing schema/resource in the schemata file and be careful when considering to
> drop them as mused in https://lore.kernel.org/lkml/aPkEb4CkJHZVDt0V@agluck-desk3/
> 
> Theoretically it may be possible in the future for it to vary which resources a
> resource group may allocate. Consider for example when resources support different
> numbers of CLOSID/PARTID and there is a desire to expose that to user space instead of
> constraining all resource groups to lowest CLOSID/PARTID. In such a scenario it should
> be clear to user space which resources it can allocate to a resource group so it is
> reasonable to expect the existing documentation for "schemata" being "A list of all
> the resources available to this group." to be respected.
> 
> On the flip side, it may not be required that a new schema in new info hierarchy always
> appears in the schemata file. Reason I think this is after seeing in MPAM that
> controls could be enabled/disabled (like MPAMCFG_MBW_PROP.EN for proportional-stride
> partitioning).
> 
> resctrl may thus have support for more partitioning controls than what is exposed by
> schemata file with ability for user space to choose which partitioning controls to expose
> in schemata file to use to manage a resource. It may then turn out that in addition to
> (read-only) schema "properties" there may also be (writable) schema "controls" (bad name
> since this would "control" a "partitioning control") where user space can modify behavior
> of a partitioning control.
> 
>>>>
>>>>     So, for example, we will have an MB resource with a schema called
>>>>     MB (the schema that we have already).  But we may go on to define
>>>>     additional schemata for the MB resource, with names such MB_MAX,
>>>>     etc.
>>>>
>>>>   * Stop adding new schema description information in the top-level
>>>>     info/<resource>/ directory in resctrlfs.
>>>>
>>>>     For backwards compatibilty, we can keep the existing property
>>>>     files under the resource info directory to describe the previously
>>>>     defined resource, but we seem to need something richer going
>>>>     forward.
> 
> ack.
> 
>>>>
>>>>   * Add a hierarchy to list all the schemata for each resource, along
>>>>     with their properties.  So far, the proposal looks like this,
>>>>     taking the MB resource as an example:
>>>>
>>>> 	info/
>>>> 	 └─ MB/
>>>> 	     └─ resource_schemata/
>>>> 	         ├─ MB/
>>>> 	         ├─ MB_MIN/
>>>> 	         ├─ MB_MAX/
>>>> 	         ┆
>>>>
>>>>     Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
>>>>     In this proposal, what these just dummy schema names for
>>>>     illustration purposes.  The important thing is that they all
>>>>     control aspects of the "MB" resource, and that there can be more
>>>>     than one of them.
>>>>
>>>>     It may be appropriate to have a nested hierarchy, where some
>>>>     schemata are presented as children of other schemata if they
>>>>     affect the same hardware controls.  For now, let's put this issue
>>>>     on one side, and consider what properties should be advertsed for
>>>>     each schema.
>>>
>>> ok to put this aside but I think we should keep including it, "open #2" ?
>>
>> Yes; I'm not abandoning this, but I wanted to focus on the schema
>> description, here.
> 
> Understood. There may be some connection with this work if there is a hierarchy
> since one schema's description may then be in terms of another. For example,
> the relationships described via pseudocode in https://lore.kernel.org/lkml/aPJP52jXJvRYAjjV@e133380.arm.com/
> 
> As a sidenote (related to the '#' prefix discussion), while trying to understand how
> this work may impact user expectations I did come across this in section
> "Reading/writing the schemata file" of resctrl.rst:
> 	When writing you only need to specify those values which you wish to change.
> 
> This seems quite close to addressing the concern raised in
> https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ :
> 	The reason why I think that this convention may be needed is that we
> 	never told (old) userspace what it was supposed to do with schemata 
> 	entries that it does not recognise.
>  
>>>>   * Current properties that I think we might want are:
>>>>
>>>> 	info/
>>>> 	 └─ SOME_RESOURCE/
>>>> 	     └─ resource_schemata/
>>>> 	         ├─ SOME_SCHEMA/
>>>> 	         ┆   ├─ type
>>>> 	             ├─ min
>>>> 	             ├─ max
>>>> 	             ├─ tolerance
>>>> 	             ├─ resolution
>>>> 	             ├─ scale
>>>> 	             └─ unit
>>>>
>>>>     (I've tweaked the properties a bit since previous postings.
>>>>     "type" replaces "map"; "scale" is now the unit multiplier;
>>>>     "resolution" is now a scaling divisor -- details below.)
>>>>
>>>>     I assume that we expose the properties in individual files, but we
>>>>     could also combine them into a single description file per schema,
>>>>     per resource or (possibly) a single global file.
>>>>     (I don't have a strong view on the best option.)
>>>>
>>>>
>>>>     Either way, the following set of properties may be a reasonable
>>>>     place to start:
>>>>
>>>>
>>>>     type: the schema type, followed by optional flag specifiers:
>>>>
>>>>       - "scalar": a single-valued numeric control
>>>>
>>>>         A mandatory flag indicates how the control value written to
>>>>         the schemata file is converted to an amount of resource for
>>>>         hardware regulation.
>>>>
>>>> 	The flag "linear" indicates a linear mapping.
>>>>
>>>> 	In this case, the amount of resource E that is actually
>>>> 	allocated is derived from the control value C written to the
>>>> 	schemata file as follows:
>>>>
>>>>     	E = C * scale * unit / resolution
>>>>
>>>> 	Other flags values could be defined later, if we encounter
>>>> 	hardware with non-linear controls.
>>>>
>>>>       - "bitmap": a bitmap control
>>>>
>>>>         The optional flag "sparse" is present if the control accepts
>>>>         sparse bitmaps.
>>>>
>>>> 	In this case, E = bitmap_weight(C) * scale * unit / resolution.
>>>>
>>>> 	As before, each bit controls access to a specific chunk of
>>>> 	resource in the hardware, such as a group of cache lines.  All
>>>> 	chunks are equally sized.
>>>>
>>>> 	(Different CTRL_MON groups may still contend within the
>>>> 	allocation E, when they have bits in common between their
>>>> 	bitmaps.)
>>>
>>> Would it not be simpler to have the files/properties depend on the
>>> schema type? It almost seems as though some of the properties are forced
>>> to have some meaning for bitmap when they do not seem to be needed. Instead,
>>> for a bitmap type there can be bitmap specific properties like, for example,
>>> bit_usage. This may also create more flexibility when there is a future
>>> mapping function needed that depends on some new property?
>>>
>>> Reinette
>>
>> Sure, there is no reason why the set of properties has to be identical
>> for different schema types.
>>
>> It turned out that a single set of properties fitted better than I
>> expected, so I presented things that way to see what people thought
>> about it.
>>
>> For bitmaps, there isn't a strong need to change the set of properties
>> already available in the top-level info/ directories.  These can be
>> adopted into the new info under resource_schemata/, but I might be
>> tempted to rename them to remove "cbm" string so that the names are
>> applicable to all bitmap- style resources.  I might also rename the
>> min_cbm_bits property if we can think of a more intuitive name -- it's
>> not obvious how this should apply to sparse bitmaps.
> 
> yes, this is a good time to rename things.
> 
>>
>>
>> Thinking about bit_usage, is that really per-schema?
> 
> Good point. This is per resource.
> 
> This may create complexity if multiple controls are available for a resource. For
> example, if there is a MB resource with both a proportional schema and a max then
> it sounds like it may be possible to program the proportional schema with 100% while
> setting the max to 50%. On the hardware side these values may be legal, albeit with
> unpredictable performance, but it will be difficult for resctrl to visualize the
> "bit_usage" of such an allocation.
> 
>>
>> If L3CODE and L3DATA are really allocating the same underlying
>> resource, I wonder whether their bit_usage should be combined,
>> somehow.
> 
> Related to earlier comment this is done internally by resctrl but not exposed to
> user space. I earlier mentioned how exclusive groups take this into account, there
> is also the bitmasks used when creating new resource groups. You will, for example,
> find in __init_one_rdt_domain() that their bit usage is combined as below:
> 
> 		if (resctrl_arch_get_cdp_enabled(r->rid))               
> 			peer_ctl = resctrl_arch_get_config(r, d, i, peer_type);  
> 		else                                                    
> 			peer_ctl = 0;                                   
> 		ctrl_val = resctrl_arch_get_config(r, d, i, s->conf_type);       
> 		used_b |= ctrl_val | peer_ctl;                     
> 
>>
>> This might be one for later, though.
>>
>> It doesn't look necessary to adopt all existing properties into the
>> extended schema description immediately -- if there are some that don't
>> quite fit, we could adopt them later on without breaking backwards
>> compatibilty.
> 
> It is not obvious to me that it will be simple to add a property to an
> existing schema type. We may be forced to create new schema type when needing to
> do so.
> 
> I also think there may be more schema types that will eventually need to be
> supported, for example MPAM's priority partitioning?
> 
> Reinette