linux-kernel - Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aNU5nCklRhuc4u3X@e133380.arm.com>
Date: Thu, 25 Sep 2025 13:46:20 +0100
From: Dave Martin <Dave.Martin@....com>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: linux-kernel@...r.kernel.org, Tony Luck <tony.luck@...el.com>,
	James Morse <james.morse@....com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
	x86@...nel.org, linux-doc@...r.kernel.org
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
 per-arch

Hi Reinette,

On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
> Hi Dave,
> 
> On 9/22/25 7:39 AM, Dave Martin wrote:
> > Hi Reinette,
> > 
> > Thanks for the review.
> > 
> > On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> >> Hi Dave,
> >>
> >> nits:
> >> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
> >> resctrl code (and was established by Arm :)).
> >> Also please use upper case for acronym mba->MBA.
> > 
> > Ack (the local custom in the MPAM code is to use "mba", but arguably,
> > the meaning is not quite the same -- I'll change it.)
> 
> I am curious what the motivation is for the custom? Knowing this will help
> me to keep things consistent when the two worlds meet.

I think this has just evolved over time.  On the x86 side, MBA is a
specific architectural feature, but on the MPAM side the architecture
doesn't really have a name for the same thing.  Memory bandwidth is a
concept, but a few different types of control are defined for it, with
different names.

So, for the MPAM driver "mba" is more of a software concept than
something in a published spec: it's the glue that attaches to "MB"
resource as seen through resctrl.

(This isn't official though; it's just the mental model that I have
formed.)

> 
> >>> The control value parser for the MB resource currently coerces the
> >>> memory bandwidth percentage value from userspace to be an exact
> >>> multiple of the bw_gran parameter.
> >>
> >> (to help be specific)
> >> "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"?
> > 
> > "bw_gran" was intended as an informal shorthand for the abstract
> > parameter (exposed both in the field you mention and through the
> > bandiwidth_gran file in resctrl).
> 
> I do not see a need for being abstract since the bandwidth_gran file exposes
> the field verbatim.

Sure; that was just my thought process.

> > I can rewrite it as per your suggestion, but this could be read as
> > excluding the bandwidth_gran file.  Would it make sense just to write
> > it out longhand?  For now, I've rewritten it as follows:
> 
> Since the bandwidth_gran file exposes rdt_resource::resctrl_membw::bw_gran
> it is not clear to me how being specific excludes the bandwidth_gran file. 
> 
> > 
> >  | The control value parser for the MB resource currently coerces the
> >  | memory bandwidth percentage value from userspace to be an exact
> >  | multiple of the bandwidth granularity parameter.
> 
> If you want to include the bandwidth_gran file then the above could be
> something like:
> 
> 	The control value parser for the MB resource coerces the memory
> 	bandwidth percentage value from userspace to be an exact multiple
> 	of the bandwidth granularity parameter that is exposed by the
> 	bandwidth_gran resctrl file.
> 
> I still think that replacing "the bandwidth granularity parameter" with
> "rdt_resource::resctrl_membw::bw_gran" will help to be more specific.

That's fine.  I'll change as per your original suggestion.

> >  |
> >  | On MPAM systems, this results in somewhat worse-than-worst-case
> >  | rounding, since the bandwidth granularity advertised to resctrl by the
> >  | MPAM driver is in general only an approximation [...]
> > 
> > (I'm happy to go with your suggestion if you're not keen on this,
> > though.)
> > 
> >>> On MPAM systems, this results in somewhat worse-than-worst-case
> >>> rounding, since bw_gran is in general only an approximation to the
> >>> actual hardware granularity on these systems, and the hardware
> >>> bandwidth allocation control value is not natively a percentage --
> >>> necessitating a further conversion in the resctrl_arch_update_domains()
> >>> path, regardless of the conversion done at parse time.
> >>>
> >>> Allow the arch to provide its own parse-time conversion that is
> >>> appropriate for the hardware, and move the existing conversion to x86.
> >>> This will avoid accumulated error from rounding the value twice on MPAM
> >>> systems.
> >>>
> >>> Clarify the documentation, but avoid overly exact promises.
> >>>
> >>> Clamping to bw_min and bw_max still feels generic: leave it in the core
> >>> code, for now.
> >>
> >> Sounds like MPAM may be ready to start the schema parsing discussion again?
> >> I understand that MPAM has a few more ways to describe memory bandwidth as
> >> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> >> schema format to user space, which seems like a good idea for new schema.
> > 
> > My own ideas in this area are a little different, though I agree with
> > the general idea.
> 
> Should we expect a separate proposal from James?

At some point, yes.  We still need to have a chat about it.

Right now, I was just throwing an idea out there.

> > I'll respond separately on that, to avoid this thread getting off-topic.
> 
> Much appreciated.
> 
> > 
> > For this patch, my aim was to avoid changing anything unnecessarily.
> 
> Understood. More below as I try to understand the details but it does not
> really sound as though the current interface works that great for MPAM. If I
> understand correctly this patch enables MPAM to use existing interface for
> its memory bandwidth allocations but doing so does not enable users to 
> obtain benefit of hardware capabilities. For that users would want to use
> the new interface?

In ideal world, probably, yes.

Since not all use cases will care about full precision, the MB resource
(approximated for MPAM) should be fine for a lot of people, but I
expect that sooner or later somebody will want more exact control.

> >>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> >>> the other tests except for the NONCONT_CAT tests, which do not seem to
> >>> be supported in my configuration -- and have nothing to do with the
> >>> code touched by this patch).
> >>
> >> Is the NONCONT_CAT test failing (i.e printing "not ok")?
> >>
> >> The NONCONT_CAT tests may print error messages as debug information as part of
> >> running, but these errors are expected as part of the test. The test should accurately
> >> state whether it passed or failed though. For example, below attempts to write
> >> a non-contiguous CBM to a system that does not support non-contiguous masks.
> >> This fails as expected, error messages printed as debugging and thus the test passes
> >> with an "ok".
> >>
> >> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument                                      
> >> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected                             
> >> ok 5 L3_NONCONT_CAT: test                             
> > 
> > I don't think that this was anything to do with my changes, but I don't
> > still seem to have the test output.  (Since this test has to do with
> > bitmap schemata (?), it seemed unlikely to be affected by changes to
> > bw_validate().)
> 
> I agree that this should not have anything to do with this patch. My concern
> is that I understood that the test failed for a feature that is not supported.
> If this is the case then there may be a problem with the test. The test should
> not fail if the feature is not supported but instead skip the test.

I'll try to capture more output from this when I re-run it, so that we
can figure out what this is.

> >>> Notes:
> >>>
> >>> I put the x86 version out of line in order to avoid having to move
> >>> struct rdt_resource and its dependencies into resctrl_types.h -- which
> >>> would create a lot of diff noise.  Schemata writes from userspace have
> >>> a high overhead in any case.
> >>
> >> Sounds good, I expect compiler will inline.
> > 
> > The function and caller are in separate translation units, so unless
> > LTO is used, I don't think the function will be inlined.
> 
> Thanks, yes, indeed.
> 
> > 
> >>>
> >>> For MPAM the conversion will be a no-op, because the incoming
> >>> percentage from the core resctrl code needs to be converted to hardware
> >>> representation in the driver anyway.
> >>
> >> (addressed below)
> >>
> >>>
> >>> Perhaps _all_ the types should move to resctrl_types.h.
> >>
> >> Can surely consider when there is a good motivation.
> >>
> >>>
> >>> For now, I went for the smallest diffstat...
> > 
> > I'll assume the motivation is not strong enough for now, but shout if
> > you disagree.
> 
> I agree.

OK, I'll leave that as-is for now, then.

> > 
> > [...]
> > 
> >>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> >>> index c7949dd44f2f..a1d0469d6dfb 100644
> >>> --- a/Documentation/filesystems/resctrl.rst
> >>> +++ b/Documentation/filesystems/resctrl.rst
> >>> @@ -143,12 +143,11 @@ with respect to allocation:
> >>>  		user can request.
> >>>  
> >>>  "bandwidth_gran":
> >>> -		The granularity in which the memory bandwidth
> >>> +		The approximate granularity in which the memory bandwidth
> >>>  		percentage is allocated. The allocated
> >>>  		b/w percentage is rounded off to the next
> >>> -		control step available on the hardware. The
> >>> -		available bandwidth control steps are:
> >>> -		min_bandwidth + N * bandwidth_gran.
> >>> +		control step available on the hardware. The available
> >>> +		steps are at least as small as this value.
> >>
> >> A bit difficult to parse for me. 
> >> Is "at least as small as" same as "at least"?
> > 
> > It was supposed to mean: "The available steps are no larger than this
> > value."
> 
> This is clear to me, especially when compared with the planned addition to
> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting
> the paragraph below (more below).
> 
> > 
> > Formally My expectation is that this value is the smallest integer
> > number of percent which is not smaller than the apparent size of any
> > individual rounding step.  Equivalently, this is the smallest number g
> 
> Considering the two statements:
> - "The available steps are no larger than this value."
> - "this value ... is not smaller than the apparent size of any individual rounding step"
> 
> The "not larger" and "not smaller" sounds like all these words just end up saying that
> this is the step size?

They are intended to be the same statement: A <= B versus
B >= A respectively.

But I'd be the first to admit that the wording is a bit twisted!
(I wouldn't be astonshed if I got something wrong somewhere.)

See below for an alternative way of describing this that might be more
intuitive.

> 
> > for which writing "MB: 0=x" and "MB: 0=y" yield different
> > configurations for every in-range x and where y = x + g and y is also
> > in-range.
> > 
> > That's a bit of a mouthful, though.  If you can think of a more
> > succinct way of putting it, I'm open to suggestions!
> > 
> >> Please note that the documentation has a section "Memory bandwidth Allocation
> >> and monitoring" that also contains these exact promises.
> > 
> > Hmmm, somehow I completely missed that.
> > 
> > Does the following make sense?  Ideally, there would be a simpler way
> > to describe the discrepancy between the reported and actual values of
> > bw_gran...
> > 
> >  |  Memory bandwidth Allocation and monitoring
> >  |  ==========================================
> >  |
> >  |  [...]
> >  |
> >  |  The minimum bandwidth percentage value for each cpu model is predefined
> >  |  and can be looked up through "info/MB/min_bandwidth". The bandwidth
> >  |  granularity that is allocated is also dependent on the cpu model and can
> >  |  be looked up at "info/MB/bandwidth_gran". The available bandwidth
> >  | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> >  | -to the next control step available on the hardware.
> >  | +control steps are: min_bw + N * (bw_gran - e), where e is a
> >  | +non-negative, hardware-defined real constant that is less than 1.
> >  | +Intermediate values are rounded to the next control step available on
> >  | +the hardware.
> >  | +
> >  | +At the time of writing, the constant e referred to in the preceding
> >  | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
> >  | +describes the step size exactly), but this may not be the case on other
> >  | +hardware when the actual granularity is not an exact divisor of 100.
> 
> Have you considered how to share the value of "e" with users?

Perhaps introducing this "e" as an explicit parameter is a bad idea and
overly formal.  In practice, there are likely to various sources of
skid and approximation in the hardware, so exposing an actual value may
be counterproductive -- i.e., what usable guarantee is this providing
to userspace, if this is likely to be swamped by approximations
elsewhere?

Instead, maybe we can just say something like:

 | The available steps are spaced at roughly equal intervals between the
 | value reported by info/MB/min_bandwidth and 100%, inclusive.  Reading
 | info/MB/bandwidth_gran gives the worst-case precision of these
 | interval steps, in per cent.

What do you think?

If that's adequate, then the wording under the definition of
"bandwidth_gran" could be aligned with this.

> >>> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> >>> index d98e0d2de09f..c5e73b75aaa0 100644
> >>> --- a/fs/resctrl/ctrlmondata.c
> >>> +++ b/fs/resctrl/ctrlmondata.c
> >>> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
> >>>  		return false;
> >>>  	}
> >>>  
> >>> -	*data = roundup(bw, (unsigned long)r->membw.bw_gran);
> >>> +	*data = resctrl_arch_round_bw(bw, r);
> >>
> >> Please check that function comments remain accurate after changes (specifically
> >> if making the conversion more generic as proposed below).
> > 
> > I hoped that the comment for this function was still applicable, though
> > it can probably be improved.  How about the following?
> > 
> >  | - * hardware. The allocated bandwidth percentage is rounded to the next
> >  | - * control step available on the hardware.
> >  | + * hardware. The allocated bandwidth percentage is converted as
> >  | + * appropriate for consumption by the specific hardware driver.
> >  
> > [...]
> 
> Looks good to me.

OK.

> > 
> >>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> >>> index 6fb4894b8cfd..5b2a555cf2dd 100644
> >>> --- a/include/linux/resctrl.h
> >>> +++ b/include/linux/resctrl.h
> >>> @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
> >>>  bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
> >>>  int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
> >>>  
> >>> +/*
> >>> + * Round a bandwidth control value to the nearest value acceptable to
> >>> + * the arch code for resource r:
> >>> + */
> >>> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
> >>> +
> >>
> >> I do not think that resctrl should make any assumptions on what the
> >> architecture's conversion does (i.e "round"). That architecture needs to be
> >> asked to "round a bandwidth control value" also sounds strange since resctrl really
> >> should be able to do something like rounding itself. As I understand from
> >> the notes this will be a no-op for MPAM making this even more confusing.
> >>
> >> How about naming the helper something like resctrl_arch_convert_bw()?
> >> (Open to other ideas of course).
> >>
> >> If you make such a change, please check that subject of patch still fits.
> > 
> > I struggled a bit with the name.  Really, this is converting the value
> > to an intermediate form (which might or might not involve rounding).
> > For historical reasons, this is a value suitable for writing directly
> > to the relevant x86 MSR without any further interpretation.
> > 
> > For MPAM, it is convenient to do this conversion later, rather than
> > during parsing of the value.
> > 
> > 
> > Would a name like resctrl_arch_preconvert_bw() be acceptable?
> 
> Yes.
> 
> > 
> > This isn't more informative than your suggestion regarding what the
> > conversion is expected to do, but may convey the expectation that the
> > output value may still not be in its final (i.e., hardware) form.
> 
> Sounds good, yes. 

OK, I'll hack that in.

> 
> > 
> >> I think that using const to pass data to architecture is great, thanks.
> >>
> >> Reinette
> > 
> > I try to constify by default when straightforward to do so, since the
> > compiler can then find which cases need to change; the reverse
> > direction is harder to automate...
> 
> Could you please elaborate what you mean with "reverse direction"?

I just meant that over-consting tends to result in violations of the
language that the compiler will detect, but under-consting doesn't:

static void foo(int *nonconstp, const int *constp)
{
	*constp = 0; // compiler error
	(*nonconstp); // silently accpeted, though it could have been const
}

So, the compiler will tell you places where const needs to be removed
(or something else needs to change), but to find places where const
could be _added_, you have to hunt them down yourself, or use some
other tool that is probably not part of the usual workflow.

Cheers
---Dave