[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPjxAIudLd16aU4Z@e133380.arm.com>
Date: Wed, 22 Oct 2025 15:58:08 +0100
From: Dave Martin <Dave.Martin@....com>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: Reinette Chatre <reinette.chatre@...el.com>,
linux-kernel@...r.kernel.org, James Morse <james.morse@....com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
x86@...nel.org, linux-doc@...r.kernel.org
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
per-arch
Hi Tony,
On Tue, Oct 21, 2025 at 01:59:36PM -0700, Luck, Tony wrote:
> Hi Dave,
>
> On Tue, Oct 21, 2025 at 03:37:35PM +0100, Dave Martin wrote:
> > Hi Tony,
> >
> > On Mon, Oct 20, 2025 at 09:31:18AM -0700, Luck, Tony wrote:
[...]
> > > Changes to the schemata file are currently "staged" and then applied.
> > > There's some filesystem level error/sanity checking during the parsing
> > > phase, but maybe for MB some parts can also be delayed, and re-ordered
> > > when architecture code applies the changes.
> > >
> > > E.g. while filesystem code could check min <= opt <= max. Architecture
> > > code would be responsible to write the values to h/w in a sane manner
> > > (assuming architecture cares about transient effects when things don't
> > > conform to the ordering).
> > >
> > > E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60
> > > Regardless of the order those requests appeared in the write(2) syscall
> > > architecture bumps max to 60, then opt to 50, and finally min to 40.
> >
> > This could be sorted indeed be sorted out during staging, but I'm not
> > sure that we can/should rely on it.
> >
> > If we treat the data coming from a single write() as a transaction, and
> > stage the whole thing before executing it, that's fine. But I think
> > this has to be viewed as an optimisation rather than guaranteed
> > semantics.
> >
> >
> > We told userspace that schemata is an S_IFREG regular file, so we have
> > to accept a write() boundary anywhere in the stream.
> >
> > (In fact, resctrl chokes if a write boundary occurs in the middle of a
> > line. In practice, stdio buffering and similar means that this issue
> > turns out to be difficult to hit, except with shell scripts that try to
> > emit a line piecemeal -- I have a partial fix for that knocking around,
> > but this throws up other problems, so I gave up for the time being.)
>
> Is this worth the pain and complexity? Maybe just document the reality
> of the implementation since day 1 of resctrl that each write(2) must
> contain one or more lines, each terminated with "\n".
<soapbox>
We could, in the same way that a vendor could wire a UART directly to
the pins of a regular mains power plug. They could stick a big label
on it saying exactly how the pins should be hooked up to another low-
voltage UART and not plugged into a mains power outlet... but you know
what's going to happen.
The whole point of a file-like interface is that the user doesn't (or
shouldn't) have to craft I/O directly at the syscall level. If they
have to do that, then the reasons for not relying on ioctl() or a
binary protocol melt away (like that UART).
Because the easy, unsafe way of working with these files almost always
works, people are almost certainly going to use it, even if we tell
them not to (IMHO).
</soapbox>
That said, for practical purposes, the interface is reliable enough
(for now). We probably shouldn't mess with it unless we can come up
with something that is clearly better.
(I have some ideas, but I think it's off-topic, here.)
> There are already so many ways that the schemata file does not behave
> like a regular S_IFREG file. E.g. accepting a write to just update
> one domain in a resource: # echo L3:2=0xff > schemata
That still feels basically file-like. I can write something into a
file, then something else can read what I wrote, interpret it in any
way it likes, and write back something different for me to read.
In our case, it is as if after each write() the kernel magically reads
and rewrites the file before userspace gets a chance to do anything
else. This doesn't work as a protocol between userspace processes, but
the kernel can pull tricks that are not available to userspace -- so it
can be made to work for user <-> kernel protocols (modulo the issues
about write() boundaries etc.)
> So describe schemata in terms of writing "update commands" rather
> than "Lines"?
That's reasonable. In practice, each line written is a request to the
kernel to do something, but it's already the case that the kernel
doesn't necessarily do exactly what was asked for (due to rounding,
etc.)
Overall, I think the current state of play is that we need to consider
the lines to be independent "commands", and execute them in the order
given.
That's the model I've been assuming here.
> > We also cannot currently rely on userspace closing the fd between
> > "transactions". We never told userspace to do that, previously. We
> > could make a new requirement, but it feels unexpected/unreasonable (?)
> >
> > > >
> > > > "So, from now on, only write the things that you actually want to set."
> > > >
> > > > Does that sound about right?
> > >
> > > Users might still use their favorite editor on the schemata file and
> > > so write everything, while only changing a subset. So if we don't go
> > > for the full two-phase update I describe above this would be:
> > >
> > > "only *change* the things that you actually want to set".
>
> I misremembered where the check for "did the user change the value"
> happened. I thought it was during parsing, but it is actually in
> resctrl_arch_update_domains() after all input parsing is complete
> and resctrl is applying changes. So unless we change things to work
> the way I hallucinated, then ordering does matter the way you
> described.
Ah, right.
There would be different ways to do this, but yes, that was my
understanding of how things work today.
> >
> > [...]
> >
> > > -Tony
> >
> > This works if the schemata file is output in the right order (and the
> > user doesn't change the order):
> >
> > # cat schemata
> > MB:0=100;1=100
> > # MB_HW:0=1024;1=1024
> >
> > ->
> >
> > # cat <<EOF >schemata
> > MB:0=100;1=100
> > MB_HW:0=512,1=512
> > EOF
> >
> > ... though it may still be inefficient, if the lines are not staged
> > together. The hardware memory bandwidth controls may get programmed
> > twice, here -- though the final result is probably what was intended.
> >
> > I'd still prefer that we tell people that they should be doing this:
> > # cat <<EOF >schemata
> > MB_HW:0=512,1=512
> > EOF
> >
> > ...if they are really tyring to set MB_HW and don't care about the
> > effect on MB?
>
> I'm starting to worry about this co-existence of old/new syntax for
> Intel region aware. Life seems simple if there is only one MB_HW
> connected to the legacy "MB". Updates to either will make both
> appear with new values when the schemata is read. E.g.
>
> # cat schemata
> MB:0=100
> #MB_HW=255
>
> # echo MB:0=50 > schemata
>
> # cat schemata
> MB:0=50
> #MB_HW=127
>
> But Intel will have several MB_HW controls, one for each region.
> [Schemata names TBD, but I'll just call them 0, 1, 2, 3 here]
>
> # cat schemata
> MB:0=100
> #MB_HW0=255
> #MB_HW1=255
> #MB_HW2=255
> #MB_HW3=255
>
> If the user sets just one of the HW controls:
>
> # echo MB_HW1=64
>
> what should resctrl display for the legacy "MB:" line?
>
> -Tony
Erm, good question. I hadn't though too carefully about the region-
aware case.
I think it's reasonable to expect software that writes MB_HW<n>
independently to pay attention only to these specific schemata when
reading back -- a bit like accessing a C union.
# echo 'MB:0=100' >schemata
# cat schemata
->
MB:0=100
# MB_HW:0=255
# MB_HW0:0=255
# MB_HW1:0=255
# MB_HW2:0=255
# MB_HW3:0=255
# echo 'MB:0=100' >schemata
# cat schemata
->
MB:0=50
# MB_HW:0=128
# MB_HW0:0=128
# MB_HW1:0=128
# MB_HW2:0=128
# MB_HW3:0=128
# echo 'MB_HW:0=127' >schemata
# cat schemata
->
MB:0=50
# MB_HW:0=127
# MB_HW0:0=127
# MB_HW1:0=127
# MB_HW2:0=127
# MB_HW3:0=127
# echo 'MB_HW1:0=64' >schemata
# cat schemata
->
MB:0=???
# MB_HW:0=???
# MB_HW0:0=127
# MB_HW1:0=64
# MB_HW2:0=127
# MB_HW3:0=127
The rules for populating the ??? entries could be designed to be
somewhat intuitive, or we could just do the easiest thing.
So, could we just pick one, fixed, region to read the MB_HW value from?
Say, MB_HW0:
MB:0=50
# MB_HW:0=127
# MB_HW0:0=127
# MB_HW1:0=64
# MB_HW2:0=127
# MB_HW3:0=127
Or take the average across all regions:
MB:0=44
# MB_HW:0=111
# MB_HW0:0=127
# MB_HW1:0=64
# MB_HW2:0=127
# MB_HW3:0=127
The latter may be more costly or complex to implement, and I don't
know whether it is really useful. Software that knows about the
MB_HW<n> entries also knows that once you have looked at these, MB_HW
and MB tell you nothing else.
What do you think?
I'm wondering whether setting the MB_HW<n> independently may be quite a
specialised use case, which not everyone will want/need to do, but
that's an assumption on my part.
Cheers
---Dave
Powered by blists - more mailing lists