[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aPoqbXmmhlbPRIb7@e133380.arm.com>
Date: Thu, 23 Oct 2025 15:04:22 +0100
From: Dave Martin <Dave.Martin@....com>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: Reinette Chatre <reinette.chatre@...el.com>,
linux-kernel@...r.kernel.org, James Morse <james.morse@....com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
x86@...nel.org, linux-doc@...r.kernel.org
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
per-arch
Hi Tony,
On Wed, Oct 22, 2025 at 09:21:03AM -0700, Luck, Tony wrote:
> Hi Dave,
>
> On Wed, Oct 22, 2025 at 03:58:08PM +0100, Dave Martin wrote:
> > Hi Tony,
> >
> > On Tue, Oct 21, 2025 at 01:59:36PM -0700, Luck, Tony wrote:
[...]
> > <soapbox>
> >
> > We could, in the same way that a vendor could wire a UART directly to
> > the pins of a regular mains power plug. They could stick a big label
> > on it saying exactly how the pins should be hooked up to another low-
> > voltage UART and not plugged into a mains power outlet... but you know
> > what's going to happen.
>
> The PDP 11/03 for undegraduate Comp Sci student use at my univeristy had allegedly
> been student proofed against such things. Oral history said you could wire 240V
> mains across input pins to get a 50 Hz clock. I didn't test this theory.
Now, there's an idea...
> > The whole point of a file-like interface is that the user doesn't (or
> > shouldn't) have to craft I/O directly at the syscall level. If they
> > have to do that, then the reasons for not relying on ioctl() or a
> > binary protocol melt away (like that UART).
> >
> > Because the easy, unsafe way of working with these files almost always
> > works, people are almost certainly going to use it, even if we tell
> > them not to (IMHO).
> >
> > </soapbox>
> >
> >
> > That said, for practical purposes, the interface is reliable enough
> > (for now). We probably shouldn't mess with it unless we can come up
> > with something that is clearly better.
> >
> > (I have some ideas, but I think it's off-topic, here.)
>
> Agreed off-topic ... but fixing it seems hard. What if I do:
>
> # echo -n "L3:0=" > schemata
>
> and then my control program dies?
Probably nothing?
In my hack for this, I buffered a partial line for each open struct file.
If the struct file survives the terminated program, something else
could append more to the incomplete line through any fd still open on
the struct file (as in my { { echo; ... echo; } >schememta; } shell
example).
Otherwise, when the file is closed with an incomplete line, an error
could be reported through close(). I implemented this, but it turns
out not to be a magic bullet -- lots of software doesn't check the
return value from close() / fclose(), and Linux's version of dup2()
just silently loses close-time errors on the fd being clobbered.
(dash, and probably other shells, undo redirections using dup2().
Dupping the victim fd before the dup2(), so that it can be closed
separately, can help -- as documented in the dup2() man page. But as
of today, most software probably doesn't do this. Some OSes seem to
have different dup2() behaviour that doesn't suffer from this problem.)
Anyway, all in all, I wasn't convinced that this approach created fewer
problems than it solved...
[...]
> > > I'm starting to worry about this co-existence of old/new syntax for
> > > Intel region aware. Life seems simple if there is only one MB_HW
> > > connected to the legacy "MB". Updates to either will make both
> > > appear with new values when the schemata is read. E.g.
> > >
> > > # cat schemata
> > > MB:0=100
> > > #MB_HW=255
> > >
> > > # echo MB:0=50 > schemata
> > >
> > > # cat schemata
> > > MB:0=50
> > > #MB_HW=127
> > >
> > > But Intel will have several MB_HW controls, one for each region.
> > > [Schemata names TBD, but I'll just call them 0, 1, 2, 3 here]
> > >
> > > # cat schemata
> > > MB:0=100
> > > #MB_HW0=255
> > > #MB_HW1=255
> > > #MB_HW2=255
> > > #MB_HW3=255
> > >
> > > If the user sets just one of the HW controls:
> > >
> > > # echo MB_HW1=64
> > >
> > > what should resctrl display for the legacy "MB:" line?
> > >
> > > -Tony
> >
> > Erm, good question. I hadn't though too carefully about the region-
> > aware case.
> >
> > I think it's reasonable to expect software that writes MB_HW<n>
> > independently to pay attention only to these specific schemata when
> > reading back -- a bit like accessing a C union.
> >
> > # echo 'MB:0=100' >schemata
> > # cat schemata
> > ->
> > MB:0=100
> > # MB_HW:0=255
> > # MB_HW0:0=255
> > # MB_HW1:0=255
> > # MB_HW2:0=255
> > # MB_HW3:0=255
> >
> > # echo 'MB:0=100' >schemata
> > # cat schemata
> > ->
> > MB:0=50
> > # MB_HW:0=128
> > # MB_HW0:0=128
> > # MB_HW1:0=128
> > # MB_HW2:0=128
> > # MB_HW3:0=128
> >
> > # echo 'MB_HW:0=127' >schemata
> > # cat schemata
> > ->
> > MB:0=50
> > # MB_HW:0=127
> > # MB_HW0:0=127
> > # MB_HW1:0=127
> > # MB_HW2:0=127
> > # MB_HW3:0=127
> >
> > # echo 'MB_HW1:0=64' >schemata
> > # cat schemata
> > ->
> > MB:0=???
> > # MB_HW:0=???
> > # MB_HW0:0=127
> > # MB_HW1:0=64
> > # MB_HW2:0=127
> > # MB_HW3:0=127
> >
> > The rules for populating the ??? entries could be designed to be
> > somewhat intuitive, or we could just do the easiest thing.
> >
> > So, could we just pick one, fixed, region to read the MB_HW value from?
> > Say, MB_HW0:
> >
> > MB:0=50
> > # MB_HW:0=127
> > # MB_HW0:0=127
> > # MB_HW1:0=64
> > # MB_HW2:0=127
> > # MB_HW3:0=127
> >
> > Or take the average across all regions:
> >
> > MB:0=44
> > # MB_HW:0=111
> > # MB_HW0:0=127
> > # MB_HW1:0=64
> > # MB_HW2:0=127
> > # MB_HW3:0=127
> >
> > The latter may be more costly or complex to implement, and I don't
> > know whether it is really useful. Software that knows about the
> > MB_HW<n> entries also knows that once you have looked at these, MB_HW
> > and MB tell you nothing else.
> >
> > What do you think?
> >
> > I'm wondering whether setting the MB_HW<n> independently may be quite a
> > specialised use case, which not everyone will want/need to do, but
> > that's an assumption on my part.
>
> It's difficult to guess what users will want to do. But it is likely
> the case that total available bandwidth to regions will be different
> (local DDR > remote DDR > CXL). So while the system will boot up with
> no throttling on any region, it may be useful to enforce more throttling
> on access to the slower regions.
>
> Rather than trying to make up some number to fill in the ?? for the MB:
> line, another option would be to stop showing the legacy MB: line in schemata
> as soon as the user shows they know about the direct HW access mode
> by writing any of the HW lines.
>
> Any sysadmin trying to mix and match legacy access with direct HW access
> is going to run into problems very quickly. In the spirit of not giving
> them the cable to connect mains to the UART, perhaps removing the
> foot-gun from the table might be a good option?
>
> -Tony
Quite possibly.
Ideally, we'd have some kind of generic interface, but (as with "MB")
there's always the risk that the hardware evolves in directions that
don't fit the abstraction.
For now, I will try to refocus the discussion back onto the schema
description topic. I think that's probably the easiest thing to get
nailed down before we try to figure out how to deal with the "shadow
schema" issue.
Cheers
---Dave
Powered by blists - more mailing lists