linux-kernel - Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aNq11fmlac6dH4pH@agluck-desk3>
Date: Mon, 29 Sep 2025 09:37:41 -0700
From: "Luck, Tony" <tony.luck@...el.com>
To: Dave Martin <Dave.Martin@....com>
CC: <linux-kernel@...r.kernel.org>, Reinette Chatre
	<reinette.chatre@...el.com>, James Morse <james.morse@....com>, "Thomas
 Gleixner" <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, "Borislav
 Petkov" <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter
 Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>, <x86@...nel.org>,
	<linux-doc@...r.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
 per-arch

On Mon, Sep 29, 2025 at 02:56:19PM +0100, Dave Martin wrote:
> Hi Tony,
> 
> Thanks for taking at look at this -- comments below.
> 
> [...]
> 
> On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
> > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
> 
> [...]
> 
> > > Would something like the following work?  A read from schemata might
> > > produce something like this:
> > > 
> > > MB: 0=50, 1=50
> > > # MB_HW: 0=32, 1=32
> > > # MB_MIN: 0=31, 1=31
> > > # MB_MAX: 0=32, 1=32
> 
> [...]
> 
> > > I'd be interested in people's thoughts on it, though.
> > 
> > Applying this to Intel upcoming region aware memory bandwidth
> > that supports 255 steps and h/w min/max limits.
> 
> Following the MPAM example, would you also expect:
> 
> 	scale: 255
> 	unit: 100pc
> 
> ...?

Yes. 255 (or whatever "Q" value is provided in the ACPI table)
corresponds to no throttling, so 100% bandwidth.

> 
> > We would have info files with "min = 1, max = 255" and a schemata
> > file that looks like this to legacy apps:
> > 
> > MB: 0=50;1=75
> > #MB_HW: 0=128;1=191
> > #MB_MIN: 0=128;1=191
> > #MB_MAX: 0=128;1=191
> > 
> > But a newer app that is aware of the extensions can write:
> > 
> > # cat > schemata << 'EOF'
> > MB_HW: 0=10
> > MB_MIN: 0=10
> > MB_MAX: 0=64
> > EOF
> > 
> > which then reads back as:
> > MB: 0=4;1=75
> > #MB_HW: 0=10;1=191
> > #MB_MIN: 0=10;1=191
> > #MB_MAX: 0=64;1=191
> > 
> > with the legacy line updated with the rounded value of the MB_HW
> > supplied by the user. 10/255 = 3.921% ... so call it "4".
> 
> I'm suggesting that this always be rounded up, so that you have a
> guarantee that the steps are no smaller than the reported value.

Round up, rather than round-to-nearest, make sense. Though perhaps
only cosmetic as I would be surprised if anyone has a mix of tools
looking at the legacy schemata lines while programming using the
direct h/w controls.

> 
> (In this case, round-up and round-to-nearest give the same answer
> anyway, though!)
> 
> > 
> > The region aware h/w supports separate bandwidth controls for each
> > region. We could hope (or perhaps update the spec to define) that
> > region0 is always node-local DDR memory and keep the "MB" tag for
> > that.
> 
> Do you have concerns about existing software choking on the #-prefixed
> lines?

Do they even need a # prefix? We already mix lines for multiple
resources in the schemata file with a separate prefix for each resource.
The schemata file also allows writes to just update one resource (or
one domain in a single resource). The schemata file started with just
"L3". Then we added "L2", "MB", and "SMBA" with no concern that the
initial "L3" manipulating tools would be confused.

> > Then use some other tag naming for other regions. Remote DDR,
> > local CXL, remote CXL are the ones we think are next in the h/w
> > memory sequence. But the "region" concept would allow for other
> > options as other memory technologies come into use.
> 
> Would it be reasnable just to have a set of these schema instances, per
> region, so:
> 
> MB_HW: ... // implicitly region 0
> MB_HW_1: ...
> MB_HW_2: ...

Chen Yu is currently looking at putting the word "TIER" into the
name, since there's some precedent for describing memory in "tiers".

Whatever naming scheme is used, the important part is how will users
find out what each schemata line actually means/controls.

> etc.
> 
> Or, did you have something else in mind?
> 
> My thinking is that we avoid adding complexity in the schemata file if
> we treat mapping these schema instances onto the hardware topology as
> an orthogonal problem.  So long as we have unique names in the schemata
> file, we can describe elsewhere what they relate to in the hardware.

Yes, exactly this.

> 
> Cheers
> ---Dave

-Tony