lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aNFfs43UBp6tjqPM@e133380.arm.com>
Date: Mon, 22 Sep 2025 15:39:47 +0100
From: Dave Martin <Dave.Martin@....com>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: linux-kernel@...r.kernel.org, Tony Luck <tony.luck@...el.com>,
	James Morse <james.morse@....com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
	x86@...nel.org, linux-doc@...r.kernel.org
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
 per-arch

Hi Reinette,

Thanks for the review.

On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> Hi Dave,
> 
> nits:
> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
> resctrl code (and was established by Arm :)).
> Also please use upper case for acronym mba->MBA.

Ack (the local custom in the MPAM code is to use "mba", but arguably,
the meaning is not quite the same -- I'll change it.)

> On 9/2/25 9:24 AM, Dave Martin wrote:
> > The control value parser for the MB resource currently coerces the
> > memory bandwidth percentage value from userspace to be an exact
> > multiple of the bw_gran parameter.
> 
> (to help be specific)
> "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"?

"bw_gran" was intended as an informal shorthand for the abstract
parameter (exposed both in the field you mention and through the
bandiwidth_gran file in resctrl).

I can rewrite it as per your suggestion, but this could be read as
excluding the bandwidth_gran file.  Would it make sense just to write
it out longhand?  For now, I've rewritten it as follows:

 | The control value parser for the MB resource currently coerces the
 | memory bandwidth percentage value from userspace to be an exact
 | multiple of the bandwidth granularity parameter.
 |
 | On MPAM systems, this results in somewhat worse-than-worst-case
 | rounding, since the bandwidth granularity advertised to resctrl by the
 | MPAM driver is in general only an approximation [...]

(I'm happy to go with your suggestion if you're not keen on this,
though.)

> > On MPAM systems, this results in somewhat worse-than-worst-case
> > rounding, since bw_gran is in general only an approximation to the
> > actual hardware granularity on these systems, and the hardware
> > bandwidth allocation control value is not natively a percentage --
> > necessitating a further conversion in the resctrl_arch_update_domains()
> > path, regardless of the conversion done at parse time.
> > 
> > Allow the arch to provide its own parse-time conversion that is
> > appropriate for the hardware, and move the existing conversion to x86.
> > This will avoid accumulated error from rounding the value twice on MPAM
> > systems.
> > 
> > Clarify the documentation, but avoid overly exact promises.
> > 
> > Clamping to bw_min and bw_max still feels generic: leave it in the core
> > code, for now.
> 
> Sounds like MPAM may be ready to start the schema parsing discussion again?
> I understand that MPAM has a few more ways to describe memory bandwidth as
> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> schema format to user space, which seems like a good idea for new schema.

My own ideas in this area are a little different, though I agree with
the general idea.

I'll respond separately on that, to avoid this thread getting off-topic.

For this patch, my aim was to avoid changing anything unnecessarily.

> Is this something MPAM is still considering? For example, the minimum
> and maximum ranges that can be specified, is this something you already
> have some ideas for? Have you perhaps considered Tony's RFD [3] that includes
> discussion on how to handle min/max ranges for bandwidth? 

This is another thing that we probably do want to support at some point,
but it feels like a different thing from the minimum and maximum bounds 
acceptable to an individual schema -- especially since in the hardware
they may behave more like trigger points than hard limits.

Again, I'll respond separately.

[...]

> > Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> > the other tests except for the NONCONT_CAT tests, which do not seem to
> > be supported in my configuration -- and have nothing to do with the
> > code touched by this patch).
> 
> Is the NONCONT_CAT test failing (i.e printing "not ok")?
> 
> The NONCONT_CAT tests may print error messages as debug information as part of
> running, but these errors are expected as part of the test. The test should accurately
> state whether it passed or failed though. For example, below attempts to write
> a non-contiguous CBM to a system that does not support non-contiguous masks.
> This fails as expected, error messages printed as debugging and thus the test passes
> with an "ok".
> 
> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument                                      
> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected                             
> ok 5 L3_NONCONT_CAT: test                             

I don't think that this was anything to do with my changes, but I don't
still seem to have the test output.  (Since this test has to do with
bitmap schemata (?), it seemed unlikely to be affected by changes to
bw_validate().)

I'll need to re-test with and without this patch to check whether it
makes any difference.

> > Notes:
> > 
> > I put the x86 version out of line in order to avoid having to move
> > struct rdt_resource and its dependencies into resctrl_types.h -- which
> > would create a lot of diff noise.  Schemata writes from userspace have
> > a high overhead in any case.
> 
> Sounds good, I expect compiler will inline.

The function and caller are in separate translation units, so unless
LTO is used, I don't think the function will be inlined.

> > 
> > For MPAM the conversion will be a no-op, because the incoming
> > percentage from the core resctrl code needs to be converted to hardware
> > representation in the driver anyway.
> 
> (addressed below)
> 
> > 
> > Perhaps _all_ the types should move to resctrl_types.h.
> 
> Can surely consider when there is a good motivation.
> 
> > 
> > For now, I went for the smallest diffstat...

I'll assume the motivation is not strong enough for now, but shout if
you disagree.

[...]

> > diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> > index c7949dd44f2f..a1d0469d6dfb 100644
> > --- a/Documentation/filesystems/resctrl.rst
> > +++ b/Documentation/filesystems/resctrl.rst
> > @@ -143,12 +143,11 @@ with respect to allocation:
> >  		user can request.
> >  
> >  "bandwidth_gran":
> > -		The granularity in which the memory bandwidth
> > +		The approximate granularity in which the memory bandwidth
> >  		percentage is allocated. The allocated
> >  		b/w percentage is rounded off to the next
> > -		control step available on the hardware. The
> > -		available bandwidth control steps are:
> > -		min_bandwidth + N * bandwidth_gran.
> > +		control step available on the hardware. The available
> > +		steps are at least as small as this value.
> 
> A bit difficult to parse for me. 
> Is "at least as small as" same as "at least"?

It was supposed to mean: "The available steps are no larger than this
value."

Formally My expectation is that this value is the smallest integer
number of percent which is not smaller than the apparent size of any
individual rounding step.  Equivalently, this is the smallest number g
for which writing "MB: 0=x" and "MB: 0=y" yield different
configurations for every in-range x and where y = x + g and y is also
in-range.

That's a bit of a mouthful, though.  If you can think of a more
succinct way of putting it, I'm open to suggestions!

> Please note that the documentation has a section "Memory bandwidth Allocation
> and monitoring" that also contains these exact promises.

Hmmm, somehow I completely missed that.

Does the following make sense?  Ideally, there would be a simpler way
to describe the discrepancy between the reported and actual values of
bw_gran...

 |  Memory bandwidth Allocation and monitoring
 |  ==========================================
 |
 |  [...]
 |
 |  The minimum bandwidth percentage value for each cpu model is predefined
 |  and can be looked up through "info/MB/min_bandwidth". The bandwidth
 |  granularity that is allocated is also dependent on the cpu model and can
 |  be looked up at "info/MB/bandwidth_gran". The available bandwidth
 | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
 | -to the next control step available on the hardware.
 | +control steps are: min_bw + N * (bw_gran - e), where e is a
 | +non-negative, hardware-defined real constant that is less than 1.
 | +Intermediate values are rounded to the next control step available on
 | +the hardware.
 | +
 | +At the time of writing, the constant e referred to in the preceding
 | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
 | +describes the step size exactly), but this may not be the case on other
 | +hardware when the actual granularity is not an exact divisor of 100.

> 
> >  
> >  "delay_linear":
> >  		Indicates if the delay scale is linear or
> > diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> > index feb93b50e990..8bec2b9cc503 100644
> > --- a/arch/x86/include/asm/resctrl.h
> > +++ b/arch/x86/include/asm/resctrl.h
> > @@ -18,6 +18,8 @@
> >   */
> >  #define X86_RESCTRL_EMPTY_CLOSID         ((u32)~0)
> >  
> > +struct rdt_resource;
> > +
> 
> I'm missing something here. Why is this needed?

Oops, it's not.  This got left behind from when I had the function
in-line here.

Removed.

[...]

> > diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> > index d98e0d2de09f..c5e73b75aaa0 100644
> > --- a/fs/resctrl/ctrlmondata.c
> > +++ b/fs/resctrl/ctrlmondata.c
> > @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
> >  		return false;
> >  	}
> >  
> > -	*data = roundup(bw, (unsigned long)r->membw.bw_gran);
> > +	*data = resctrl_arch_round_bw(bw, r);
> 
> Please check that function comments remain accurate after changes (specifically
> if making the conversion more generic as proposed below).

I hoped that the comment for this function was still applicable, though
it can probably be improved.  How about the following?

 | - * hardware. The allocated bandwidth percentage is rounded to the next
 | - * control step available on the hardware.
 | + * hardware. The allocated bandwidth percentage is converted as
 | + * appropriate for consumption by the specific hardware driver.
 
[...]

> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 6fb4894b8cfd..5b2a555cf2dd 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
> >  bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
> >  int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
> >  
> > +/*
> > + * Round a bandwidth control value to the nearest value acceptable to
> > + * the arch code for resource r:
> > + */
> > +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
> > +
> 
> I do not think that resctrl should make any assumptions on what the
> architecture's conversion does (i.e "round"). That architecture needs to be
> asked to "round a bandwidth control value" also sounds strange since resctrl really
> should be able to do something like rounding itself. As I understand from
> the notes this will be a no-op for MPAM making this even more confusing.
> 
> How about naming the helper something like resctrl_arch_convert_bw()?
> (Open to other ideas of course).
> 
> If you make such a change, please check that subject of patch still fits.

I struggled a bit with the name.  Really, this is converting the value
to an intermediate form (which might or might not involve rounding).
For historical reasons, this is a value suitable for writing directly
to the relevant x86 MSR without any further interpretation.

For MPAM, it is convenient to do this conversion later, rather than
during parsing of the value.


Would a name like resctrl_arch_preconvert_bw() be acceptable?

This isn't more informative than your suggestion regarding what the
conversion is expected to do, but may convey the expectation that the
output value may still not be in its final (i.e., hardware) form.

> I think that using const to pass data to architecture is great, thanks.
> 
> Reinette

I try to constify by default when straightforward to do so, since the
compiler can then find which cases need to change; the reverse
direction is harder to automate...

Cheers
---Dave

[...]

> [1] https://lore.kernel.org/lkml/fa93564a-45b0-ccdd-c139-ae4867eacfb5@arm.com/
> [2] https://lore.kernel.org/all/acefb432-6388-44ed-b444-1e52335c6c3d@arm.com/
> [3] https://lore.kernel.org/lkml/Z_mB-gmQe_LR4FWP@agluck-desk3/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ