[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b38f0459-1373-42d3-8526-e8ef9ac4d2e7@intel.com>
Date: Fri, 12 Sep 2025 15:19:04 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Dave Martin <Dave.Martin@....com>, <linux-kernel@...r.kernel.org>
CC: Tony Luck <tony.luck@...el.com>, James Morse <james.morse@....com>,
"Thomas Gleixner" <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
"Borislav Petkov" <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
<x86@...nel.org>, <linux-doc@...r.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
per-arch
Hi Dave,
nits:
Please use the subject prefix "x86,fs/resctrl" to be consistent with other
resctrl code (and was established by Arm :)).
Also please use upper case for acronym mba->MBA.
On 9/2/25 9:24 AM, Dave Martin wrote:
> The control value parser for the MB resource currently coerces the
> memory bandwidth percentage value from userspace to be an exact
> multiple of the bw_gran parameter.
(to help be specific)
"the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"?
>
> On MPAM systems, this results in somewhat worse-than-worst-case
> rounding, since bw_gran is in general only an approximation to the
> actual hardware granularity on these systems, and the hardware
> bandwidth allocation control value is not natively a percentage --
> necessitating a further conversion in the resctrl_arch_update_domains()
> path, regardless of the conversion done at parse time.
>
> Allow the arch to provide its own parse-time conversion that is
> appropriate for the hardware, and move the existing conversion to x86.
> This will avoid accumulated error from rounding the value twice on MPAM
> systems.
>
> Clarify the documentation, but avoid overly exact promises.
>
> Clamping to bw_min and bw_max still feels generic: leave it in the core
> code, for now.
Sounds like MPAM may be ready to start the schema parsing discussion again?
I understand that MPAM has a few more ways to describe memory bandwidth as
well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
schema format to user space, which seems like a good idea for new schema.
Is this something MPAM is still considering? For example, the minimum
and maximum ranges that can be specified, is this something you already
have some ideas for? Have you perhaps considered Tony's RFD [3] that includes
discussion on how to handle min/max ranges for bandwidth?
>
> No functional change.
>
> Signed-off-by: Dave Martin <Dave.Martin@....com>
>
> ---
>
> Based on v6.17-rc3.
>
> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> the other tests except for the NONCONT_CAT tests, which do not seem to
> be supported in my configuration -- and have nothing to do with the
> code touched by this patch).
Is the NONCONT_CAT test failing (i.e printing "not ok")?
The NONCONT_CAT tests may print error messages as debug information as part of
running, but these errors are expected as part of the test. The test should accurately
state whether it passed or failed though. For example, below attempts to write
a non-contiguous CBM to a system that does not support non-contiguous masks.
This fails as expected, error messages printed as debugging and thus the test passes
with an "ok".
# Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument
# Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
ok 5 L3_NONCONT_CAT: test
>
> Notes:
>
> I put the x86 version out of line in order to avoid having to move
> struct rdt_resource and its dependencies into resctrl_types.h -- which
> would create a lot of diff noise. Schemata writes from userspace have
> a high overhead in any case.
Sounds good, I expect compiler will inline.
>
> For MPAM the conversion will be a no-op, because the incoming
> percentage from the core resctrl code needs to be converted to hardware
> representation in the driver anyway.
(addressed below)
>
> Perhaps _all_ the types should move to resctrl_types.h.
Can surely consider when there is a good motivation.
>
> For now, I went for the smallest diffstat...
>
> ---
> Documentation/filesystems/resctrl.rst | 7 +++----
> arch/x86/include/asm/resctrl.h | 2 ++
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 6 ++++++
> fs/resctrl/ctrlmondata.c | 2 +-
> include/linux/resctrl.h | 6 ++++++
> 5 files changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> index c7949dd44f2f..a1d0469d6dfb 100644
> --- a/Documentation/filesystems/resctrl.rst
> +++ b/Documentation/filesystems/resctrl.rst
> @@ -143,12 +143,11 @@ with respect to allocation:
> user can request.
>
> "bandwidth_gran":
> - The granularity in which the memory bandwidth
> + The approximate granularity in which the memory bandwidth
> percentage is allocated. The allocated
> b/w percentage is rounded off to the next
> - control step available on the hardware. The
> - available bandwidth control steps are:
> - min_bandwidth + N * bandwidth_gran.
> + control step available on the hardware. The available
> + steps are at least as small as this value.
A bit difficult to parse for me.
Is "at least as small as" same as "at least"?
Please note that the documentation has a section "Memory bandwidth Allocation
and monitoring" that also contains these exact promises.
>
> "delay_linear":
> Indicates if the delay scale is linear or
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index feb93b50e990..8bec2b9cc503 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -18,6 +18,8 @@
> */
> #define X86_RESCTRL_EMPTY_CLOSID ((u32)~0)
>
> +struct rdt_resource;
> +
I'm missing something here. Why is this needed?
> /**
> * struct resctrl_pqr_state - State cache for the PQR MSR
> * @cur_rmid: The cached Resource Monitoring ID
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 1189c0df4ad7..cf9b30b5df3c 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -16,9 +16,15 @@
> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
> #include <linux/cpu.h>
> +#include <linux/math.h>
>
> #include "internal.h"
>
> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r)
> +{
> + return roundup(val, (unsigned long)r->membw.bw_gran);
> +}
> +
> int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 closid, enum resctrl_conf_type t, u32 cfg_val)
> {
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index d98e0d2de09f..c5e73b75aaa0 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
> return false;
> }
>
> - *data = roundup(bw, (unsigned long)r->membw.bw_gran);
> + *data = resctrl_arch_round_bw(bw, r);
Please check that function comments remain accurate after changes (specifically
if making the conversion more generic as proposed below).
> return true;
> }
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 6fb4894b8cfd..5b2a555cf2dd 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
> bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
>
> +/*
> + * Round a bandwidth control value to the nearest value acceptable to
> + * the arch code for resource r:
> + */
> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
> +
I do not think that resctrl should make any assumptions on what the
architecture's conversion does (i.e "round"). That architecture needs to be
asked to "round a bandwidth control value" also sounds strange since resctrl really
should be able to do something like rounding itself. As I understand from
the notes this will be a no-op for MPAM making this even more confusing.
How about naming the helper something like resctrl_arch_convert_bw()?
(Open to other ideas of course).
If you make such a change, please check that subject of patch still fits.
I think that using const to pass data to architecture is great, thanks.
Reinette
[1] https://lore.kernel.org/lkml/fa93564a-45b0-ccdd-c139-ae4867eacfb5@arm.com/
[2] https://lore.kernel.org/all/acefb432-6388-44ed-b444-1e52335c6c3d@arm.com/
[3] https://lore.kernel.org/lkml/Z_mB-gmQe_LR4FWP@agluck-desk3/
Powered by blists - more mailing lists