[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aNW1vAd6Jhq6IkyJ@agluck-desk3>
Date: Thu, 25 Sep 2025 14:35:56 -0700
From: "Luck, Tony" <tony.luck@...el.com>
To: Reinette Chatre <reinette.chatre@...el.com>
CC: Dave Martin <Dave.Martin@....com>, <linux-kernel@...r.kernel.org>, "James
Morse" <james.morse@....com>, Thomas Gleixner <tglx@...utronix.de>, "Ingo
Molnar" <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
<dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, "Jonathan
Corbet" <corbet@....net>, <x86@...nel.org>, <linux-doc@...r.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
per-arch
On Thu, Sep 25, 2025 at 01:53:37PM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 9/25/25 5:46 AM, Dave Martin wrote:
> > On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
> >> On 9/22/25 7:39 AM, Dave Martin wrote:
> >>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> >>>> Hi Dave,
> >>>>
> >>>> nits:
> >>>> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
> >>>> resctrl code (and was established by Arm :)).
> >>>> Also please use upper case for acronym mba->MBA.
> >>>
> >>> Ack (the local custom in the MPAM code is to use "mba", but arguably,
> >>> the meaning is not quite the same -- I'll change it.)
> >>
> >> I am curious what the motivation is for the custom? Knowing this will help
> >> me to keep things consistent when the two worlds meet.
> >
> > I think this has just evolved over time. On the x86 side, MBA is a
> > specific architectural feature, but on the MPAM side the architecture
> > doesn't really have a name for the same thing. Memory bandwidth is a
> > concept, but a few different types of control are defined for it, with
> > different names.
> >
> > So, for the MPAM driver "mba" is more of a software concept than
> > something in a published spec: it's the glue that attaches to "MB"
> > resource as seen through resctrl.
> >
> > (This isn't official though; it's just the mental model that I have
> > formed.)
>
> I see. Thank you for the details. My mental model is simpler: write acronyms
> in upper case.
>
> ...
>
> >>>>> On MPAM systems, this results in somewhat worse-than-worst-case
> >>>>> rounding, since bw_gran is in general only an approximation to the
> >>>>> actual hardware granularity on these systems, and the hardware
> >>>>> bandwidth allocation control value is not natively a percentage --
> >>>>> necessitating a further conversion in the resctrl_arch_update_domains()
> >>>>> path, regardless of the conversion done at parse time.
> >>>>>
> >>>>> Allow the arch to provide its own parse-time conversion that is
> >>>>> appropriate for the hardware, and move the existing conversion to x86.
> >>>>> This will avoid accumulated error from rounding the value twice on MPAM
> >>>>> systems.
> >>>>>
> >>>>> Clarify the documentation, but avoid overly exact promises.
> >>>>>
> >>>>> Clamping to bw_min and bw_max still feels generic: leave it in the core
> >>>>> code, for now.
> >>>>
> >>>> Sounds like MPAM may be ready to start the schema parsing discussion again?
> >>>> I understand that MPAM has a few more ways to describe memory bandwidth as
> >>>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> >>>> schema format to user space, which seems like a good idea for new schema.
> >>>
> >>> My own ideas in this area are a little different, though I agree with
> >>> the general idea.
> >>
> >> Should we expect a separate proposal from James?
> >
> > At some point, yes. We still need to have a chat about it.
> >
> > Right now, I was just throwing an idea out there.
>
> Thank you very much for doing so. We are digesting it.
>
>
> ...
>
> >>> For this patch, my aim was to avoid changing anything unnecessarily.
> >>
> >> Understood. More below as I try to understand the details but it does not
> >> really sound as though the current interface works that great for MPAM. If I
> >> understand correctly this patch enables MPAM to use existing interface for
> >> its memory bandwidth allocations but doing so does not enable users to
> >> obtain benefit of hardware capabilities. For that users would want to use
> >> the new interface?
> >
> > In ideal world, probably, yes.
> >
> > Since not all use cases will care about full precision, the MB resource
> > (approximated for MPAM) should be fine for a lot of people, but I
> > expect that sooner or later somebody will want more exact control.
>
> ack.
>
> >
> >>>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> >>>>> the other tests except for the NONCONT_CAT tests, which do not seem to
> >>>>> be supported in my configuration -- and have nothing to do with the
> >>>>> code touched by this patch).
> >>>>
> >>>> Is the NONCONT_CAT test failing (i.e printing "not ok")?
> >>>>
> >>>> The NONCONT_CAT tests may print error messages as debug information as part of
> >>>> running, but these errors are expected as part of the test. The test should accurately
> >>>> state whether it passed or failed though. For example, below attempts to write
> >>>> a non-contiguous CBM to a system that does not support non-contiguous masks.
> >>>> This fails as expected, error messages printed as debugging and thus the test passes
> >>>> with an "ok".
> >>>>
> >>>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument
> >>>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
> >>>> ok 5 L3_NONCONT_CAT: test
> >>>
> >>> I don't think that this was anything to do with my changes, but I don't
> >>> still seem to have the test output. (Since this test has to do with
> >>> bitmap schemata (?), it seemed unlikely to be affected by changes to
> >>> bw_validate().)
> >>
> >> I agree that this should not have anything to do with this patch. My concern
> >> is that I understood that the test failed for a feature that is not supported.
> >> If this is the case then there may be a problem with the test. The test should
> >> not fail if the feature is not supported but instead skip the test.
> >
> > I'll try to capture more output from this when I re-run it, so that we
> > can figure out what this is.
>
> Thank you.
>
>
> ...
>
> >>>
> >>> [...]
> >>>
> >>>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> >>>>> index c7949dd44f2f..a1d0469d6dfb 100644
> >>>>> --- a/Documentation/filesystems/resctrl.rst
> >>>>> +++ b/Documentation/filesystems/resctrl.rst
> >>>>> @@ -143,12 +143,11 @@ with respect to allocation:
> >>>>> user can request.
> >>>>>
> >>>>> "bandwidth_gran":
> >>>>> - The granularity in which the memory bandwidth
> >>>>> + The approximate granularity in which the memory bandwidth
> >>>>> percentage is allocated. The allocated
> >>>>> b/w percentage is rounded off to the next
> >>>>> - control step available on the hardware. The
> >>>>> - available bandwidth control steps are:
> >>>>> - min_bandwidth + N * bandwidth_gran.
> >>>>> + control step available on the hardware. The available
> >>>>> + steps are at least as small as this value.
> >>>>
> >>>> A bit difficult to parse for me.
> >>>> Is "at least as small as" same as "at least"?
> >>>
> >>> It was supposed to mean: "The available steps are no larger than this
> >>> value."
> >>
> >> This is clear to me, especially when compared with the planned addition to
> >> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting
> >> the paragraph below (more below).
> >>
> >>>
> >>> Formally My expectation is that this value is the smallest integer
> >>> number of percent which is not smaller than the apparent size of any
> >>> individual rounding step. Equivalently, this is the smallest number g
> >>
> >> Considering the two statements:
> >> - "The available steps are no larger than this value."
> >> - "this value ... is not smaller than the apparent size of any individual rounding step"
> >>
> >> The "not larger" and "not smaller" sounds like all these words just end up saying that
> >> this is the step size?
> >
> > They are intended to be the same statement: A <= B versus
> > B >= A respectively.
>
> This is what I understood from the words ... and that made me think that it
> can be simplified to A = B ... but no need to digress ... onto the alternatives below ...
>
> >
> > But I'd be the first to admit that the wording is a bit twisted!
> > (I wouldn't be astonshed if I got something wrong somewhere.)
> >
> > See below for an alternative way of describing this that might be more
> > intuitive.
> >
> >>
> >>> for which writing "MB: 0=x" and "MB: 0=y" yield different
> >>> configurations for every in-range x and where y = x + g and y is also
> >>> in-range.
> >>>
> >>> That's a bit of a mouthful, though. If you can think of a more
> >>> succinct way of putting it, I'm open to suggestions!
> >>>
> >>>> Please note that the documentation has a section "Memory bandwidth Allocation
> >>>> and monitoring" that also contains these exact promises.
> >>>
> >>> Hmmm, somehow I completely missed that.
> >>>
> >>> Does the following make sense? Ideally, there would be a simpler way
> >>> to describe the discrepancy between the reported and actual values of
> >>> bw_gran...
> >>>
> >>> | Memory bandwidth Allocation and monitoring
> >>> | ==========================================
> >>> |
> >>> | [...]
> >>> |
> >>> | The minimum bandwidth percentage value for each cpu model is predefined
> >>> | and can be looked up through "info/MB/min_bandwidth". The bandwidth
> >>> | granularity that is allocated is also dependent on the cpu model and can
> >>> | be looked up at "info/MB/bandwidth_gran". The available bandwidth
> >>> | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> >>> | -to the next control step available on the hardware.
> >>> | +control steps are: min_bw + N * (bw_gran - e), where e is a
> >>> | +non-negative, hardware-defined real constant that is less than 1.
> >>> | +Intermediate values are rounded to the next control step available on
> >>> | +the hardware.
> >>> | +
> >>> | +At the time of writing, the constant e referred to in the preceding
> >>> | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
> >>> | +describes the step size exactly), but this may not be the case on other
> >>> | +hardware when the actual granularity is not an exact divisor of 100.
> >>
> >> Have you considered how to share the value of "e" with users?
> >
> > Perhaps introducing this "e" as an explicit parameter is a bad idea and
> > overly formal. In practice, there are likely to various sources of
> > skid and approximation in the hardware, so exposing an actual value may
> > be counterproductive -- i.e., what usable guarantee is this providing
> > to userspace, if this is likely to be swamped by approximations
> > elsewhere?
> >
> > Instead, maybe we can just say something like:
> >
> > | The available steps are spaced at roughly equal intervals between the
> > | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading
> > | info/MB/bandwidth_gran gives the worst-case precision of these
> > | interval steps, in per cent.
> >
> > What do you think?
>
> I find "worst-case precision" a bit confusing, consider for example, what
> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
> the upper limit of these interval steps"? I believe this matches what you
> mentioned a couple of messages ago: "The available steps are no larger than this
> value."
>
> (and "per cent" -> "percent")
>
> >
> > If that's adequate, then the wording under the definition of
> > "bandwidth_gran" could be aligned with this.
>
> I think putting together a couple of your proposals and statements while making the
> text more accurate may work:
>
> "bandwidth_gran":
> The approximate granularity in which the memory bandwidth
> percentage is allocated. The allocated bandwidth percentage
> is rounded up to the next control step available on the
> hardware. The available hardware steps are no larger than
> this value.
>
> I assume "available" is needed because, even though the steps are not larger
> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
> to 100% range?
What values are allowed for "bandwidth_gran"? The "IntelĀ® Resource
Director Technology (IntelĀ® RDT) Architecture Specification"
https://cdrdv2.intel.com/v1/dl/getContent/789566
describes the upcoming region aware memory bandwidth allocation
controls as being a number from "1" to "Q" (enumerated in an ACPI
table). First implementation looks like Q == 255 which means a
granularity of 0.392% The spec has headroom to allow Q == 511.
I don't expect users to need that granularity at the high bandwidth
end of the range, but I do expect them to care for highly throttled
background/batch jobs to make sure they can't affect performance of
the high priority jobs.
I'd hate to have to round all low bandwidth controls to 1% steps.
>
> >>>> I think that using const to pass data to architecture is great, thanks.
> >>>>
> >>>> Reinette
> >>>
> >>> I try to constify by default when straightforward to do so, since the
> >>> compiler can then find which cases need to change; the reverse
> >>> direction is harder to automate...
> >>
> >> Could you please elaborate what you mean with "reverse direction"?
> >
> > I just meant that over-consting tends to result in violations of the
> > language that the compiler will detect, but under-consting doesn't:
> >
> > static void foo(int *nonconstp, const int *constp)
> > {
> > *constp = 0; // compiler error
> > (*nonconstp); // silently accpeted, though it could have been const
> > }
> >
> > So, the compiler will tell you places where const needs to be removed
> > (or something else needs to change), but to find places where const
> > could be _added_, you have to hunt them down yourself, or use some
> > other tool that is probably not part of the usual workflow.
>
> Got it, thanks.
>
> Reinette
>
-Tony
Powered by blists - more mailing lists