[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPkEb4CkJHZVDt0V@agluck-desk3>
Date: Wed, 22 Oct 2025 09:21:03 -0700
From: "Luck, Tony" <tony.luck@...el.com>
To: Dave Martin <Dave.Martin@....com>
CC: Reinette Chatre <reinette.chatre@...el.com>,
<linux-kernel@...r.kernel.org>, James Morse <james.morse@....com>, "Thomas
Gleixner" <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, "Borislav
Petkov" <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter
Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>, <x86@...nel.org>,
<linux-doc@...r.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
per-arch
Hi Dave,
On Wed, Oct 22, 2025 at 03:58:08PM +0100, Dave Martin wrote:
> Hi Tony,
>
> On Tue, Oct 21, 2025 at 01:59:36PM -0700, Luck, Tony wrote:
> > Hi Dave,
> >
> > On Tue, Oct 21, 2025 at 03:37:35PM +0100, Dave Martin wrote:
> > > Hi Tony,
> > >
> > > On Mon, Oct 20, 2025 at 09:31:18AM -0700, Luck, Tony wrote:
>
> [...]
>
> > > > Changes to the schemata file are currently "staged" and then applied.
> > > > There's some filesystem level error/sanity checking during the parsing
> > > > phase, but maybe for MB some parts can also be delayed, and re-ordered
> > > > when architecture code applies the changes.
> > > >
> > > > E.g. while filesystem code could check min <= opt <= max. Architecture
> > > > code would be responsible to write the values to h/w in a sane manner
> > > > (assuming architecture cares about transient effects when things don't
> > > > conform to the ordering).
> > > >
> > > > E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60
> > > > Regardless of the order those requests appeared in the write(2) syscall
> > > > architecture bumps max to 60, then opt to 50, and finally min to 40.
> > >
> > > This could be sorted indeed be sorted out during staging, but I'm not
> > > sure that we can/should rely on it.
> > >
> > > If we treat the data coming from a single write() as a transaction, and
> > > stage the whole thing before executing it, that's fine. But I think
> > > this has to be viewed as an optimisation rather than guaranteed
> > > semantics.
> > >
> > >
> > > We told userspace that schemata is an S_IFREG regular file, so we have
> > > to accept a write() boundary anywhere in the stream.
> > >
> > > (In fact, resctrl chokes if a write boundary occurs in the middle of a
> > > line. In practice, stdio buffering and similar means that this issue
> > > turns out to be difficult to hit, except with shell scripts that try to
> > > emit a line piecemeal -- I have a partial fix for that knocking around,
> > > but this throws up other problems, so I gave up for the time being.)
> >
> > Is this worth the pain and complexity? Maybe just document the reality
> > of the implementation since day 1 of resctrl that each write(2) must
> > contain one or more lines, each terminated with "\n".
>
> <soapbox>
>
> We could, in the same way that a vendor could wire a UART directly to
> the pins of a regular mains power plug. They could stick a big label
> on it saying exactly how the pins should be hooked up to another low-
> voltage UART and not plugged into a mains power outlet... but you know
> what's going to happen.
The PDP 11/03 for undegraduate Comp Sci student use at my univeristy had allegedly
been student proofed against such things. Oral history said you could wire 240V
mains across input pins to get a 50 Hz clock. I didn't test this theory.
> The whole point of a file-like interface is that the user doesn't (or
> shouldn't) have to craft I/O directly at the syscall level. If they
> have to do that, then the reasons for not relying on ioctl() or a
> binary protocol melt away (like that UART).
>
> Because the easy, unsafe way of working with these files almost always
> works, people are almost certainly going to use it, even if we tell
> them not to (IMHO).
>
> </soapbox>
>
>
> That said, for practical purposes, the interface is reliable enough
> (for now). We probably shouldn't mess with it unless we can come up
> with something that is clearly better.
>
> (I have some ideas, but I think it's off-topic, here.)
Agreed off-topic ... but fixing it seems hard. What if I do:
# echo -n "L3:0=" > schemata
and then my control program dies?
> > There are already so many ways that the schemata file does not behave
> > like a regular S_IFREG file. E.g. accepting a write to just update
> > one domain in a resource: # echo L3:2=0xff > schemata
>
> That still feels basically file-like. I can write something into a
> file, then something else can read what I wrote, interpret it in any
> way it likes, and write back something different for me to read.
>
> In our case, it is as if after each write() the kernel magically reads
> and rewrites the file before userspace gets a chance to do anything
> else. This doesn't work as a protocol between userspace processes, but
> the kernel can pull tricks that are not available to userspace -- so it
> can be made to work for user <-> kernel protocols (modulo the issues
> about write() boundaries etc.)
>
> > So describe schemata in terms of writing "update commands" rather
> > than "Lines"?
>
> That's reasonable. In practice, each line written is a request to the
> kernel to do something, but it's already the case that the kernel
> doesn't necessarily do exactly what was asked for (due to rounding,
> etc.)
>
>
> Overall, I think the current state of play is that we need to consider
> the lines to be independent "commands", and execute them in the order
> given.
>
> That's the model I've been assuming here.
>
>
> > > We also cannot currently rely on userspace closing the fd between
> > > "transactions". We never told userspace to do that, previously. We
> > > could make a new requirement, but it feels unexpected/unreasonable (?)
> > >
> > > > >
> > > > > "So, from now on, only write the things that you actually want to set."
> > > > >
> > > > > Does that sound about right?
> > > >
> > > > Users might still use their favorite editor on the schemata file and
> > > > so write everything, while only changing a subset. So if we don't go
> > > > for the full two-phase update I describe above this would be:
> > > >
> > > > "only *change* the things that you actually want to set".
> >
> > I misremembered where the check for "did the user change the value"
> > happened. I thought it was during parsing, but it is actually in
> > resctrl_arch_update_domains() after all input parsing is complete
> > and resctrl is applying changes. So unless we change things to work
> > the way I hallucinated, then ordering does matter the way you
> > described.
>
> Ah, right.
>
> There would be different ways to do this, but yes, that was my
> understanding of how things work today.
>
> > >
> > > [...]
> > >
> > > > -Tony
> > >
> > > This works if the schemata file is output in the right order (and the
> > > user doesn't change the order):
> > >
> > > # cat schemata
> > > MB:0=100;1=100
> > > # MB_HW:0=1024;1=1024
> > >
> > > ->
> > >
> > > # cat <<EOF >schemata
> > > MB:0=100;1=100
> > > MB_HW:0=512,1=512
> > > EOF
> > >
> > > ... though it may still be inefficient, if the lines are not staged
> > > together. The hardware memory bandwidth controls may get programmed
> > > twice, here -- though the final result is probably what was intended.
> > >
> > > I'd still prefer that we tell people that they should be doing this:
> > > # cat <<EOF >schemata
> > > MB_HW:0=512,1=512
> > > EOF
> > >
> > > ...if they are really tyring to set MB_HW and don't care about the
> > > effect on MB?
> >
> > I'm starting to worry about this co-existence of old/new syntax for
> > Intel region aware. Life seems simple if there is only one MB_HW
> > connected to the legacy "MB". Updates to either will make both
> > appear with new values when the schemata is read. E.g.
> >
> > # cat schemata
> > MB:0=100
> > #MB_HW=255
> >
> > # echo MB:0=50 > schemata
> >
> > # cat schemata
> > MB:0=50
> > #MB_HW=127
> >
> > But Intel will have several MB_HW controls, one for each region.
> > [Schemata names TBD, but I'll just call them 0, 1, 2, 3 here]
> >
> > # cat schemata
> > MB:0=100
> > #MB_HW0=255
> > #MB_HW1=255
> > #MB_HW2=255
> > #MB_HW3=255
> >
> > If the user sets just one of the HW controls:
> >
> > # echo MB_HW1=64
> >
> > what should resctrl display for the legacy "MB:" line?
> >
> > -Tony
>
> Erm, good question. I hadn't though too carefully about the region-
> aware case.
>
> I think it's reasonable to expect software that writes MB_HW<n>
> independently to pay attention only to these specific schemata when
> reading back -- a bit like accessing a C union.
>
> # echo 'MB:0=100' >schemata
> # cat schemata
> ->
> MB:0=100
> # MB_HW:0=255
> # MB_HW0:0=255
> # MB_HW1:0=255
> # MB_HW2:0=255
> # MB_HW3:0=255
>
> # echo 'MB:0=100' >schemata
> # cat schemata
> ->
> MB:0=50
> # MB_HW:0=128
> # MB_HW0:0=128
> # MB_HW1:0=128
> # MB_HW2:0=128
> # MB_HW3:0=128
>
> # echo 'MB_HW:0=127' >schemata
> # cat schemata
> ->
> MB:0=50
> # MB_HW:0=127
> # MB_HW0:0=127
> # MB_HW1:0=127
> # MB_HW2:0=127
> # MB_HW3:0=127
>
> # echo 'MB_HW1:0=64' >schemata
> # cat schemata
> ->
> MB:0=???
> # MB_HW:0=???
> # MB_HW0:0=127
> # MB_HW1:0=64
> # MB_HW2:0=127
> # MB_HW3:0=127
>
> The rules for populating the ??? entries could be designed to be
> somewhat intuitive, or we could just do the easiest thing.
>
> So, could we just pick one, fixed, region to read the MB_HW value from?
> Say, MB_HW0:
>
> MB:0=50
> # MB_HW:0=127
> # MB_HW0:0=127
> # MB_HW1:0=64
> # MB_HW2:0=127
> # MB_HW3:0=127
>
> Or take the average across all regions:
>
> MB:0=44
> # MB_HW:0=111
> # MB_HW0:0=127
> # MB_HW1:0=64
> # MB_HW2:0=127
> # MB_HW3:0=127
>
> The latter may be more costly or complex to implement, and I don't
> know whether it is really useful. Software that knows about the
> MB_HW<n> entries also knows that once you have looked at these, MB_HW
> and MB tell you nothing else.
>
> What do you think?
>
> I'm wondering whether setting the MB_HW<n> independently may be quite a
> specialised use case, which not everyone will want/need to do, but
> that's an assumption on my part.
It's difficult to guess what users will want to do. But it is likely
the case that total available bandwidth to regions will be different
(local DDR > remote DDR > CXL). So while the system will boot up with
no throttling on any region, it may be useful to enforce more throttling
on access to the slower regions.
Rather than trying to make up some number to fill in the ?? for the MB:
line, another option would be to stop showing the legacy MB: line in schemata
as soon as the user shows they know about the direct HW access mode
by writing any of the HW lines.
Any sysadmin trying to mix and match legacy access with direct HW access
is going to run into problems very quickly. In the spirit of not giving
them the cable to connect mains to the UART, perhaps removing the
foot-gun from the table might be a good option?
> Cheers
> ---Dave
-Tony
Powered by blists - more mailing lists