[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPearyfcnpJJ/e06@e133380.arm.com>
Date: Tue, 21 Oct 2025 15:37:35 +0100
From: Dave Martin <Dave.Martin@....com>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: Reinette Chatre <reinette.chatre@...el.com>,
linux-kernel@...r.kernel.org, James Morse <james.morse@....com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
x86@...nel.org, linux-doc@...r.kernel.org
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
per-arch
Hi Tony,
On Mon, Oct 20, 2025 at 09:31:18AM -0700, Luck, Tony wrote:
> On Mon, Oct 20, 2025 at 04:50:38PM +0100, Dave Martin wrote:
> > Hi Reinette,
> >
> > On Fri, Oct 17, 2025 at 08:59:45AM -0700, Reinette Chatre wrote:
[...]
> > > By extension I assume that software that understands a schema that is introduced
> > > after the "relationship" format is established can be expected to understand the
> > > format and thus these new schemata do not require the '#' prefix. Even if
> > > a new schema is introduced with a single control it can be followed by a new child
> > > control without a '#' prefix a couple of kernel releases later. By this point it
> > > should hopefully be understood by user space that it should not write entries it does
> > > not understand.
> >
> > Generally, yes.
> >
> > I think that boils down to: "OK, previously you could just tweak bits
> > of the whole schemata file you read and write the whole thing back,
> > and the effect would be what you inuitively expected. But in future
> > different schemata in the file may not be independent of one another.
> > We'll warn you which things might not be independent, but we may not
> > describe exactly how they affect each other.
>
> Changes to the schemata file are currently "staged" and then applied.
> There's some filesystem level error/sanity checking during the parsing
> phase, but maybe for MB some parts can also be delayed, and re-ordered
> when architecture code applies the changes.
>
> E.g. while filesystem code could check min <= opt <= max. Architecture
> code would be responsible to write the values to h/w in a sane manner
> (assuming architecture cares about transient effects when things don't
> conform to the ordering).
>
> E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60
> Regardless of the order those requests appeared in the write(2) syscall
> architecture bumps max to 60, then opt to 50, and finally min to 40.
This could be sorted indeed be sorted out during staging, but I'm not
sure that we can/should rely on it.
If we treat the data coming from a single write() as a transaction, and
stage the whole thing before executing it, that's fine. But I think
this has to be viewed as an optimisation rather than guaranteed
semantics.
We told userspace that schemata is an S_IFREG regular file, so we have
to accept a write() boundary anywhere in the stream.
(In fact, resctrl chokes if a write boundary occurs in the middle of a
line. In practice, stdio buffering and similar means that this issue
turns out to be difficult to hit, except with shell scripts that try to
emit a line piecemeal -- I have a partial fix for that knocking around,
but this throws up other problems, so I gave up for the time being.)
We also cannot currently rely on userspace closing the fd between
"transactions". We never told userspace to do that, previously. We
could make a new requirement, but it feels unexpected/unreasonable (?)
> >
> > "So, from now on, only write the things that you actually want to set."
> >
> > Does that sound about right?
>
> Users might still use their favorite editor on the schemata file and
> so write everything, while only changing a subset. So if we don't go
> for the full two-phase update I describe above this would be:
>
> "only *change* the things that you actually want to set".
[...]
> -Tony
This works if the schemata file is output in the right order (and the
user doesn't change the order):
# cat schemata
MB:0=100;1=100
# MB_HW:0=1024;1=1024
->
# cat <<EOF >schemata
MB:0=100;1=100
MB_HW:0=512,1=512
EOF
... though it may still be inefficient, if the lines are not staged
together. The hardware memory bandwidth controls may get programmed
twice, here -- though the final result is probably what was intended.
I'd still prefer that we tell people that they should be doing this:
# cat <<EOF >schemata
MB_HW:0=512,1=512
EOF
...if they are really tyring to set MB_HW and don't care about the
effect on MB?
Cheers
---Dave
Powered by blists - more mailing lists