[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aNU5nCklRhuc4u3X@e133380.arm.com>
Date: Thu, 25 Sep 2025 13:46:20 +0100
From: Dave Martin <Dave.Martin@....com>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: linux-kernel@...r.kernel.org, Tony Luck <tony.luck@...el.com>,
James Morse <james.morse@....com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
x86@...nel.org, linux-doc@...r.kernel.org
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
per-arch
Hi Reinette,
On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 9/22/25 7:39 AM, Dave Martin wrote:
> > Hi Reinette,
> >
> > Thanks for the review.
> >
> > On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> >> Hi Dave,
> >>
> >> nits:
> >> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
> >> resctrl code (and was established by Arm :)).
> >> Also please use upper case for acronym mba->MBA.
> >
> > Ack (the local custom in the MPAM code is to use "mba", but arguably,
> > the meaning is not quite the same -- I'll change it.)
>
> I am curious what the motivation is for the custom? Knowing this will help
> me to keep things consistent when the two worlds meet.
I think this has just evolved over time. On the x86 side, MBA is a
specific architectural feature, but on the MPAM side the architecture
doesn't really have a name for the same thing. Memory bandwidth is a
concept, but a few different types of control are defined for it, with
different names.
So, for the MPAM driver "mba" is more of a software concept than
something in a published spec: it's the glue that attaches to "MB"
resource as seen through resctrl.
(This isn't official though; it's just the mental model that I have
formed.)
>
> >>> The control value parser for the MB resource currently coerces the
> >>> memory bandwidth percentage value from userspace to be an exact
> >>> multiple of the bw_gran parameter.
> >>
> >> (to help be specific)
> >> "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"?
> >
> > "bw_gran" was intended as an informal shorthand for the abstract
> > parameter (exposed both in the field you mention and through the
> > bandiwidth_gran file in resctrl).
>
> I do not see a need for being abstract since the bandwidth_gran file exposes
> the field verbatim.
Sure; that was just my thought process.
> > I can rewrite it as per your suggestion, but this could be read as
> > excluding the bandwidth_gran file. Would it make sense just to write
> > it out longhand? For now, I've rewritten it as follows:
>
> Since the bandwidth_gran file exposes rdt_resource::resctrl_membw::bw_gran
> it is not clear to me how being specific excludes the bandwidth_gran file.
>
> >
> > | The control value parser for the MB resource currently coerces the
> > | memory bandwidth percentage value from userspace to be an exact
> > | multiple of the bandwidth granularity parameter.
>
> If you want to include the bandwidth_gran file then the above could be
> something like:
>
> The control value parser for the MB resource coerces the memory
> bandwidth percentage value from userspace to be an exact multiple
> of the bandwidth granularity parameter that is exposed by the
> bandwidth_gran resctrl file.
>
> I still think that replacing "the bandwidth granularity parameter" with
> "rdt_resource::resctrl_membw::bw_gran" will help to be more specific.
That's fine. I'll change as per your original suggestion.
> > |
> > | On MPAM systems, this results in somewhat worse-than-worst-case
> > | rounding, since the bandwidth granularity advertised to resctrl by the
> > | MPAM driver is in general only an approximation [...]
> >
> > (I'm happy to go with your suggestion if you're not keen on this,
> > though.)
> >
> >>> On MPAM systems, this results in somewhat worse-than-worst-case
> >>> rounding, since bw_gran is in general only an approximation to the
> >>> actual hardware granularity on these systems, and the hardware
> >>> bandwidth allocation control value is not natively a percentage --
> >>> necessitating a further conversion in the resctrl_arch_update_domains()
> >>> path, regardless of the conversion done at parse time.
> >>>
> >>> Allow the arch to provide its own parse-time conversion that is
> >>> appropriate for the hardware, and move the existing conversion to x86.
> >>> This will avoid accumulated error from rounding the value twice on MPAM
> >>> systems.
> >>>
> >>> Clarify the documentation, but avoid overly exact promises.
> >>>
> >>> Clamping to bw_min and bw_max still feels generic: leave it in the core
> >>> code, for now.
> >>
> >> Sounds like MPAM may be ready to start the schema parsing discussion again?
> >> I understand that MPAM has a few more ways to describe memory bandwidth as
> >> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> >> schema format to user space, which seems like a good idea for new schema.
> >
> > My own ideas in this area are a little different, though I agree with
> > the general idea.
>
> Should we expect a separate proposal from James?
At some point, yes. We still need to have a chat about it.
Right now, I was just throwing an idea out there.
> > I'll respond separately on that, to avoid this thread getting off-topic.
>
> Much appreciated.
>
> >
> > For this patch, my aim was to avoid changing anything unnecessarily.
>
> Understood. More below as I try to understand the details but it does not
> really sound as though the current interface works that great for MPAM. If I
> understand correctly this patch enables MPAM to use existing interface for
> its memory bandwidth allocations but doing so does not enable users to
> obtain benefit of hardware capabilities. For that users would want to use
> the new interface?
In ideal world, probably, yes.
Since not all use cases will care about full precision, the MB resource
(approximated for MPAM) should be fine for a lot of people, but I
expect that sooner or later somebody will want more exact control.
> >>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> >>> the other tests except for the NONCONT_CAT tests, which do not seem to
> >>> be supported in my configuration -- and have nothing to do with the
> >>> code touched by this patch).
> >>
> >> Is the NONCONT_CAT test failing (i.e printing "not ok")?
> >>
> >> The NONCONT_CAT tests may print error messages as debug information as part of
> >> running, but these errors are expected as part of the test. The test should accurately
> >> state whether it passed or failed though. For example, below attempts to write
> >> a non-contiguous CBM to a system that does not support non-contiguous masks.
> >> This fails as expected, error messages printed as debugging and thus the test passes
> >> with an "ok".
> >>
> >> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument
> >> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
> >> ok 5 L3_NONCONT_CAT: test
> >
> > I don't think that this was anything to do with my changes, but I don't
> > still seem to have the test output. (Since this test has to do with
> > bitmap schemata (?), it seemed unlikely to be affected by changes to
> > bw_validate().)
>
> I agree that this should not have anything to do with this patch. My concern
> is that I understood that the test failed for a feature that is not supported.
> If this is the case then there may be a problem with the test. The test should
> not fail if the feature is not supported but instead skip the test.
I'll try to capture more output from this when I re-run it, so that we
can figure out what this is.
> >>> Notes:
> >>>
> >>> I put the x86 version out of line in order to avoid having to move
> >>> struct rdt_resource and its dependencies into resctrl_types.h -- which
> >>> would create a lot of diff noise. Schemata writes from userspace have
> >>> a high overhead in any case.
> >>
> >> Sounds good, I expect compiler will inline.
> >
> > The function and caller are in separate translation units, so unless
> > LTO is used, I don't think the function will be inlined.
>
> Thanks, yes, indeed.
>
> >
> >>>
> >>> For MPAM the conversion will be a no-op, because the incoming
> >>> percentage from the core resctrl code needs to be converted to hardware
> >>> representation in the driver anyway.
> >>
> >> (addressed below)
> >>
> >>>
> >>> Perhaps _all_ the types should move to resctrl_types.h.
> >>
> >> Can surely consider when there is a good motivation.
> >>
> >>>
> >>> For now, I went for the smallest diffstat...
> >
> > I'll assume the motivation is not strong enough for now, but shout if
> > you disagree.
>
> I agree.
OK, I'll leave that as-is for now, then.
> >
> > [...]
> >
> >>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> >>> index c7949dd44f2f..a1d0469d6dfb 100644
> >>> --- a/Documentation/filesystems/resctrl.rst
> >>> +++ b/Documentation/filesystems/resctrl.rst
> >>> @@ -143,12 +143,11 @@ with respect to allocation:
> >>> user can request.
> >>>
> >>> "bandwidth_gran":
> >>> - The granularity in which the memory bandwidth
> >>> + The approximate granularity in which the memory bandwidth
> >>> percentage is allocated. The allocated
> >>> b/w percentage is rounded off to the next
> >>> - control step available on the hardware. The
> >>> - available bandwidth control steps are:
> >>> - min_bandwidth + N * bandwidth_gran.
> >>> + control step available on the hardware. The available
> >>> + steps are at least as small as this value.
> >>
> >> A bit difficult to parse for me.
> >> Is "at least as small as" same as "at least"?
> >
> > It was supposed to mean: "The available steps are no larger than this
> > value."
>
> This is clear to me, especially when compared with the planned addition to
> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting
> the paragraph below (more below).
>
> >
> > Formally My expectation is that this value is the smallest integer
> > number of percent which is not smaller than the apparent size of any
> > individual rounding step. Equivalently, this is the smallest number g
>
> Considering the two statements:
> - "The available steps are no larger than this value."
> - "this value ... is not smaller than the apparent size of any individual rounding step"
>
> The "not larger" and "not smaller" sounds like all these words just end up saying that
> this is the step size?
They are intended to be the same statement: A <= B versus
B >= A respectively.
But I'd be the first to admit that the wording is a bit twisted!
(I wouldn't be astonshed if I got something wrong somewhere.)
See below for an alternative way of describing this that might be more
intuitive.
>
> > for which writing "MB: 0=x" and "MB: 0=y" yield different
> > configurations for every in-range x and where y = x + g and y is also
> > in-range.
> >
> > That's a bit of a mouthful, though. If you can think of a more
> > succinct way of putting it, I'm open to suggestions!
> >
> >> Please note that the documentation has a section "Memory bandwidth Allocation
> >> and monitoring" that also contains these exact promises.
> >
> > Hmmm, somehow I completely missed that.
> >
> > Does the following make sense? Ideally, there would be a simpler way
> > to describe the discrepancy between the reported and actual values of
> > bw_gran...
> >
> > | Memory bandwidth Allocation and monitoring
> > | ==========================================
> > |
> > | [...]
> > |
> > | The minimum bandwidth percentage value for each cpu model is predefined
> > | and can be looked up through "info/MB/min_bandwidth". The bandwidth
> > | granularity that is allocated is also dependent on the cpu model and can
> > | be looked up at "info/MB/bandwidth_gran". The available bandwidth
> > | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> > | -to the next control step available on the hardware.
> > | +control steps are: min_bw + N * (bw_gran - e), where e is a
> > | +non-negative, hardware-defined real constant that is less than 1.
> > | +Intermediate values are rounded to the next control step available on
> > | +the hardware.
> > | +
> > | +At the time of writing, the constant e referred to in the preceding
> > | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
> > | +describes the step size exactly), but this may not be the case on other
> > | +hardware when the actual granularity is not an exact divisor of 100.
>
> Have you considered how to share the value of "e" with users?
Perhaps introducing this "e" as an explicit parameter is a bad idea and
overly formal. In practice, there are likely to various sources of
skid and approximation in the hardware, so exposing an actual value may
be counterproductive -- i.e., what usable guarantee is this providing
to userspace, if this is likely to be swamped by approximations
elsewhere?
Instead, maybe we can just say something like:
| The available steps are spaced at roughly equal intervals between the
| value reported by info/MB/min_bandwidth and 100%, inclusive. Reading
| info/MB/bandwidth_gran gives the worst-case precision of these
| interval steps, in per cent.
What do you think?
If that's adequate, then the wording under the definition of
"bandwidth_gran" could be aligned with this.
> >>> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> >>> index d98e0d2de09f..c5e73b75aaa0 100644
> >>> --- a/fs/resctrl/ctrlmondata.c
> >>> +++ b/fs/resctrl/ctrlmondata.c
> >>> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
> >>> return false;
> >>> }
> >>>
> >>> - *data = roundup(bw, (unsigned long)r->membw.bw_gran);
> >>> + *data = resctrl_arch_round_bw(bw, r);
> >>
> >> Please check that function comments remain accurate after changes (specifically
> >> if making the conversion more generic as proposed below).
> >
> > I hoped that the comment for this function was still applicable, though
> > it can probably be improved. How about the following?
> >
> > | - * hardware. The allocated bandwidth percentage is rounded to the next
> > | - * control step available on the hardware.
> > | + * hardware. The allocated bandwidth percentage is converted as
> > | + * appropriate for consumption by the specific hardware driver.
> >
> > [...]
>
> Looks good to me.
OK.
> >
> >>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> >>> index 6fb4894b8cfd..5b2a555cf2dd 100644
> >>> --- a/include/linux/resctrl.h
> >>> +++ b/include/linux/resctrl.h
> >>> @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
> >>> bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
> >>> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
> >>>
> >>> +/*
> >>> + * Round a bandwidth control value to the nearest value acceptable to
> >>> + * the arch code for resource r:
> >>> + */
> >>> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
> >>> +
> >>
> >> I do not think that resctrl should make any assumptions on what the
> >> architecture's conversion does (i.e "round"). That architecture needs to be
> >> asked to "round a bandwidth control value" also sounds strange since resctrl really
> >> should be able to do something like rounding itself. As I understand from
> >> the notes this will be a no-op for MPAM making this even more confusing.
> >>
> >> How about naming the helper something like resctrl_arch_convert_bw()?
> >> (Open to other ideas of course).
> >>
> >> If you make such a change, please check that subject of patch still fits.
> >
> > I struggled a bit with the name. Really, this is converting the value
> > to an intermediate form (which might or might not involve rounding).
> > For historical reasons, this is a value suitable for writing directly
> > to the relevant x86 MSR without any further interpretation.
> >
> > For MPAM, it is convenient to do this conversion later, rather than
> > during parsing of the value.
> >
> >
> > Would a name like resctrl_arch_preconvert_bw() be acceptable?
>
> Yes.
>
> >
> > This isn't more informative than your suggestion regarding what the
> > conversion is expected to do, but may convey the expectation that the
> > output value may still not be in its final (i.e., hardware) form.
>
> Sounds good, yes.
OK, I'll hack that in.
>
> >
> >> I think that using const to pass data to architecture is great, thanks.
> >>
> >> Reinette
> >
> > I try to constify by default when straightforward to do so, since the
> > compiler can then find which cases need to change; the reverse
> > direction is harder to automate...
>
> Could you please elaborate what you mean with "reverse direction"?
I just meant that over-consting tends to result in violations of the
language that the compiler will detect, but under-consting doesn't:
static void foo(int *nonconstp, const int *constp)
{
*constp = 0; // compiler error
(*nonconstp); // silently accpeted, though it could have been const
}
So, the compiler will tell you places where const needs to be removed
(or something else needs to change), but to find places where const
could be _added_, you have to hunt them down yourself, or use some
other tool that is probably not part of the usual workflow.
Cheers
---Dave
Powered by blists - more mailing lists