linux-kernel - Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aNFliMZTTUiXyZzd@e133380.arm.com>
Date: Mon, 22 Sep 2025 16:04:40 +0100
From: Dave Martin <Dave.Martin@....com>
To: linux-kernel@...r.kernel.org
Cc: Tony Luck <tony.luck@...el.com>,
	Reinette Chatre <reinette.chatre@...el.com>,
	James Morse <james.morse@....com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
	x86@...nel.org, linux-doc@...r.kernel.org
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
 per-arch

Hi again,

On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:

[...]

> > Clamping to bw_min and bw_max still feels generic: leave it in the core
> > code, for now.
> 
> Sounds like MPAM may be ready to start the schema parsing discussion again?
> I understand that MPAM has a few more ways to describe memory bandwidth as
> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> schema format to user space, which seems like a good idea for new schema.

On this topic, specifically:


My own ideas in this area are a little different, though I agree with
the general idea.

Bitmap controls are distinct from numeric values, but for numbers, I'm
not sure that distinguishing percentages from other values is required,
since this is really just a specific case of a linear scale.

I imagined a generic numeric schema, described by a set of files like
the following in a schema's info directory:

	min: minimum value, e.g., 1
	max: maximum value, e.g., 1023
	scale: value that corresponds to one unit
	unit: quantified base unit, e.g., "100pc", "64MBps"
	map: mapping function name

If s is the value written in a schemata entry and p is the
corresponding physical amount of resource, then

	min <= s <= max

and

	p = map(s / scale) * unit

One reason why I prefer this scaling scheme over the floating-point
approach is that it can be exact (at least for currently known
platforms), and it doesn't require a new floating-point parser/
formatter to be written for this one thing in the kernel (which I
suspect is likely to be error-prone and poorly defined around
subtleties such as rounding behaviour).

"map" anticipates non-linear ramps, but this is only really here as a
forwards compatibility get-out.  For now, this might just be set to
"none", meaning the identity mapping (i.e., a no-op).  This may shadow
the existing the "delay_linear" parameter, but with more general
applicabillity if we need it.


The idea is that userspace reads the info files and then does the
appropriate conversions itself.  This might or might not be seen as a
burden, but would give exact control over the hardware configuration
with a generic interface, with possibly greater precision than the
existing schemata allow (when the hardware supports it), and without
having to second-guess the rounding that the kernel may or may not do
on the values.

For RDT MBA, we might have

	min: 10
	max: 100
	scale: 100
	unit: 100pc
	map: none

The schemata entry

	MB: 0=10, 1=100

would allocate the minimum possible bandwidth to domain 0, and 100%
bandwidth to domain 1.


For AMD SMBA, we might have:

	min: 1
	max: 100
	scale: 8
	unit: 1GBps

(if I've understood this correctly from resctrl.rst.)


For MPAM MBW_MAX with, say, 6 bits of resolution, we might have:

	min: 1
	max: 64
	scale: 64
	unit: 100pc
	map: none

The schemata entry

	MB: 0=1,1=64

would allocate the minimum possible bandwidth to domain 0, and 100%
bandwidth to domain 1.  This would probably need to be a new schema,
since we already have "MB" mimicking x86.

Exposing the hardware scale in this way would give userspace precise
control (including in sub-1% increments on capable hardware), without
having to second-guess the way the kernel will round the values.


> Is this something MPAM is still considering? For example, the minimum
> and maximum ranges that can be specified, is this something you already
> have some ideas for? Have you perhaps considered Tony's RFD [3] that includes
> discussion on how to handle min/max ranges for bandwidth? 

This seems to be a different thing.  I think James had some thoughts on
this already -- I haven't checked on his current idea, but one option
would be simply to expose this as two distinct schemata, say MB_MIN,
MB_MAX.

There's a question of how to cope with multiple different schemata
entries that shadow each other (i.e., control the same hardware
resource).


Would something like the following work?  A read from schemata might
produce something like this:

MB: 0=50, 1=50
# MB_HW: 0=32, 1=32
# MB_MIN: 0=31, 1=31
# MB_MAX: 0=32, 1=32

(Where MB_HW is the MPAM schema with 6-bit resolution that I
illustrated above, and MB_MIN and MB_MAX are similar schemata for the
specific MIN and MAX controls in the hardware.)

Userspace that does not understand the new entries would need to ignore
the commented lines, but can otherwise safely alter and write back the
schemata with the expected results.  The kernel would in turn ignore
the commented lines on write.  The commented lines are meaningful but
"inactive": they describe the current hardware configuration on read,
but (unless explicitly uncommented) won't change anything on write.

Software that understands the new entries can uncomment the conflicting
entries and write them back instead of (or in addition to) the
conflicting entries.  For example, userspace might write the following:

MB_MIN: 0=16, 1=16
MB_MAX: 0=32, 1=32

Which might then read back as follows:

MB: 0=50, 1=50
# MB_HW: 0=32, 1=32
# MB_MIN: 0=16, 1=16
# MB_MAX: 0=32, 1=32


I haven't tried to develop this idea further, for now.

I'd be interested in people's thoughts on it, though.

Cheers
---Dave