[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aNXJGw9r_k3BB4Xk@agluck-desk3>
Date: Thu, 25 Sep 2025 15:58:35 -0700
From: "Luck, Tony" <tony.luck@...el.com>
To: Dave Martin <Dave.Martin@....com>
CC: <linux-kernel@...r.kernel.org>, Reinette Chatre
<reinette.chatre@...el.com>, James Morse <james.morse@....com>, "Thomas
Gleixner" <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, "Borislav
Petkov" <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter
Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>, <x86@...nel.org>,
<linux-doc@...r.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
per-arch
On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
> Hi again,
>
> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
>
> [...]
>
> > > Clamping to bw_min and bw_max still feels generic: leave it in the core
> > > code, for now.
> >
> > Sounds like MPAM may be ready to start the schema parsing discussion again?
> > I understand that MPAM has a few more ways to describe memory bandwidth as
> > well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> > schema format to user space, which seems like a good idea for new schema.
>
> On this topic, specifically:
>
>
> My own ideas in this area are a little different, though I agree with
> the general idea.
>
> Bitmap controls are distinct from numeric values, but for numbers, I'm
> not sure that distinguishing percentages from other values is required,
> since this is really just a specific case of a linear scale.
>
> I imagined a generic numeric schema, described by a set of files like
> the following in a schema's info directory:
>
> min: minimum value, e.g., 1
> max: maximum value, e.g., 1023
> scale: value that corresponds to one unit
> unit: quantified base unit, e.g., "100pc", "64MBps"
> map: mapping function name
>
> If s is the value written in a schemata entry and p is the
> corresponding physical amount of resource, then
>
> min <= s <= max
>
> and
>
> p = map(s / scale) * unit
>
> One reason why I prefer this scaling scheme over the floating-point
> approach is that it can be exact (at least for currently known
> platforms), and it doesn't require a new floating-point parser/
> formatter to be written for this one thing in the kernel (which I
> suspect is likely to be error-prone and poorly defined around
> subtleties such as rounding behaviour).
>
> "map" anticipates non-linear ramps, but this is only really here as a
> forwards compatibility get-out. For now, this might just be set to
> "none", meaning the identity mapping (i.e., a no-op). This may shadow
> the existing the "delay_linear" parameter, but with more general
> applicabillity if we need it.
>
>
> The idea is that userspace reads the info files and then does the
> appropriate conversions itself. This might or might not be seen as a
> burden, but would give exact control over the hardware configuration
> with a generic interface, with possibly greater precision than the
> existing schemata allow (when the hardware supports it), and without
> having to second-guess the rounding that the kernel may or may not do
> on the values.
>
> For RDT MBA, we might have
>
> min: 10
> max: 100
> scale: 100
> unit: 100pc
> map: none
>
> The schemata entry
>
> MB: 0=10, 1=100
>
> would allocate the minimum possible bandwidth to domain 0, and 100%
> bandwidth to domain 1.
>
>
> For AMD SMBA, we might have:
>
> min: 1
> max: 100
> scale: 8
> unit: 1GBps
>
> (if I've understood this correctly from resctrl.rst.)
>
>
> For MPAM MBW_MAX with, say, 6 bits of resolution, we might have:
>
> min: 1
> max: 64
> scale: 64
> unit: 100pc
> map: none
>
> The schemata entry
>
> MB: 0=1,1=64
>
> would allocate the minimum possible bandwidth to domain 0, and 100%
> bandwidth to domain 1. This would probably need to be a new schema,
> since we already have "MB" mimicking x86.
>
> Exposing the hardware scale in this way would give userspace precise
> control (including in sub-1% increments on capable hardware), without
> having to second-guess the way the kernel will round the values.
>
>
> > Is this something MPAM is still considering? For example, the minimum
> > and maximum ranges that can be specified, is this something you already
> > have some ideas for? Have you perhaps considered Tony's RFD [3] that includes
> > discussion on how to handle min/max ranges for bandwidth?
>
> This seems to be a different thing. I think James had some thoughts on
> this already -- I haven't checked on his current idea, but one option
> would be simply to expose this as two distinct schemata, say MB_MIN,
> MB_MAX.
>
> There's a question of how to cope with multiple different schemata
> entries that shadow each other (i.e., control the same hardware
> resource).
>
>
> Would something like the following work? A read from schemata might
> produce something like this:
>
> MB: 0=50, 1=50
> # MB_HW: 0=32, 1=32
> # MB_MIN: 0=31, 1=31
> # MB_MAX: 0=32, 1=32
>
> (Where MB_HW is the MPAM schema with 6-bit resolution that I
> illustrated above, and MB_MIN and MB_MAX are similar schemata for the
> specific MIN and MAX controls in the hardware.)
>
> Userspace that does not understand the new entries would need to ignore
> the commented lines, but can otherwise safely alter and write back the
> schemata with the expected results. The kernel would in turn ignore
> the commented lines on write. The commented lines are meaningful but
> "inactive": they describe the current hardware configuration on read,
> but (unless explicitly uncommented) won't change anything on write.
>
> Software that understands the new entries can uncomment the conflicting
> entries and write them back instead of (or in addition to) the
> conflicting entries. For example, userspace might write the following:
>
> MB_MIN: 0=16, 1=16
> MB_MAX: 0=32, 1=32
>
> Which might then read back as follows:
>
> MB: 0=50, 1=50
> # MB_HW: 0=32, 1=32
> # MB_MIN: 0=16, 1=16
> # MB_MAX: 0=32, 1=32
>
>
> I haven't tried to develop this idea further, for now.
>
> I'd be interested in people's thoughts on it, though.
Applying this to Intel upcoming region aware memory bandwidth
that supports 255 steps and h/w min/max limits.
We would have info files with "min = 1, max = 255" and a schemata
file that looks like this to legacy apps:
MB: 0=50;1=75
#MB_HW: 0=128;1=191
#MB_MIN: 0=128;1=191
#MB_MAX: 0=128;1=191
But a newer app that is aware of the extensions can write:
# cat > schemata << 'EOF'
MB_HW: 0=10
MB_MIN: 0=10
MB_MAX: 0=64
EOF
which then reads back as:
MB: 0=4;1=75
#MB_HW: 0=10;1=191
#MB_MIN: 0=10;1=191
#MB_MAX: 0=64;1=191
with the legacy line updated with the rounded value of the MB_HW
supplied by the user. 10/255 = 3.921% ... so call it "4".
The region aware h/w supports separate bandwidth controls for each
region. We could hope (or perhaps update the spec to define) that
region0 is always node-local DDR memory and keep the "MB" tag for
that.
Then use some other tag naming for other regions. Remote DDR,
local CXL, remote CXL are the ones we think are next in the h/w
memory sequence. But the "region" concept would allow for other
options as other memory technologies come into use.
>
> Cheers
> ---Dave
Powered by blists - more mailing lists