[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45b96e99-ac5e-4546-b58b-f4d062d2f823@intel.com>
Date: Thu, 20 Nov 2025 08:46:24 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: Dave Martin <Dave.Martin@....com>, <linux-kernel@...r.kernel.org>
CC: Tony Luck <tony.luck@...el.com>, James Morse <james.morse@....com>, "Ben
Horgan" <ben.horgan@....com>, Thomas Gleixner <tglx@...utronix.de>, "Ingo
Molnar" <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
<dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, "Jonathan
Corbet" <corbet@....net>, <x86@...nel.org>, <linux-doc@...r.kernel.org>
Subject: Re: [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be
per-arch
Hi Dave,
On 10/31/25 8:41 AM, Dave Martin wrote:
> The control value parser for the MB resource currently coerces the
> memory bandwidth percentage value from userspace to be an exact
> multiple of the rdt_resource::resctrl_membw::bw_gran parameter.
>
> On MPAM systems, this results in somewhat worse-than-worst-case
> rounding, since the bandwidth granularity advertised to resctrl by the
> MPAM driver is in general only an approximation to the actual hardware
> granularity on these systems, and the hardware bandwidth allocation
> control value is not natively a percentage -- necessitating a further
> conversion in the resctrl_arch_update_domains() path, regardless of the
> conversion done at parse time.
>
> Allow the arch to provide its own parse-time conversion that is
> appropriate for the hardware, and move the existing conversion to x86.
> This will avoid accumulated error from rounding the value twice on MPAM
> systems.
>
> Clarify the documentation, but avoid overly exact promises.
>
> Clamping to bw_min and bw_max still feels generic: leave it in the core
> code, for now.
>
> No functional change.
Please use max line length available. Changelogs have to conform before merge
and having patch ready saves this work.
>
> Signed-off-by: Dave Martin <Dave.Martin@....com>
>
> ---
...
> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> index b7f35b07876a..fbbcf5421346 100644
> --- a/Documentation/filesystems/resctrl.rst
> +++ b/Documentation/filesystems/resctrl.rst
> @@ -144,12 +144,11 @@ with respect to allocation:
> user can request.
>
> "bandwidth_gran":
> - The granularity in which the memory bandwidth
> - percentage is allocated. The allocated
> - b/w percentage is rounded off to the next
> - control step available on the hardware. The
> - available bandwidth control steps are:
> - min_bandwidth + N * bandwidth_gran.
> + The approximate granularity in which the memory bandwidth
> + percentage is allocated. The allocated bandwidth percentage
> + is rounded up to the next control step available on the
> + hardware. The available hardware steps are no larger than
> + this value.
>
> "delay_linear":
> Indicates if the delay scale is linear or
> @@ -737,8 +736,10 @@ The minimum bandwidth percentage value for each cpu model is predefined
> and can be looked up through "info/MB/min_bandwidth". The bandwidth
> granularity that is allocated is also dependent on the cpu model and can
> be looked up at "info/MB/bandwidth_gran". The available bandwidth
> -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> -to the next control step available on the hardware.
> +control steps are, approximately, min_bw + N * bw_gran. The steps may
> +appear irregular due to rounding to an exact percentage: bw_gran is the
> +maximum interval between the percentage values corresponding to any two
> +adjacent steps in the hardware.
>
> The bandwidth throttling is a core specific mechanism on some of Intel
> SKUs. Using a high bandwidth and a low bandwidth setting on two threads
The documentation changes look good to me. Thank you.
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 1189c0df4ad7..dc05a50e80fb 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -16,9 +16,15 @@
> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
> #include <linux/cpu.h>
> +#include <linux/math.h>
>
> #include "internal.h"
>
> +u32 resctrl_arch_preconvert_bw(u32 val, const struct rdt_resource *r)
> +{
> + return roundup(val, (unsigned long)r->membw.bw_gran);
> +}
> +
> int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 closid, enum resctrl_conf_type t, u32 cfg_val)
> {
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index 0d0ef54fc4de..26e3f7c88c86 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -35,8 +35,8 @@ typedef int (ctrlval_parser_t)(struct rdt_parse_data *data,
> /*
> * Check whether MBA bandwidth percentage value is correct. The value is
> * checked against the minimum and max bandwidth values specified by the
> - * hardware. The allocated bandwidth percentage is rounded to the next
> - * control step available on the hardware.
> + * hardware. The allocated bandwidth percentage is converted as
> + * appropriate for consumption by the specific hardware driver.
The text looks good but adjusting the right margin mid-paragraph seems unnecessary?
> */
> static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
> {
> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
> return false;
> }
>
> - *data = roundup(bw, (unsigned long)r->membw.bw_gran);
> + *data = resctrl_arch_preconvert_bw(bw, r);
> return true;
> }
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index a7d92718b653..1fb6e2118b61 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -485,6 +485,20 @@ bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
> */
> int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
>
> +/*
> + * Convert a bandwidth control value to the appropriate form for
> + * consumption by the hardware driver for resource r.
When being as descriptive, please switch to proper kernel-doc instead. There are
a couple of examples for reference in this header file.
> + *
> + * For example, it simplifies the x86 RDT implementation to round the
> + * value to a suitable step here and then treat the resulting value as
> + * opaque when programming the hardware MSRs later on.
This "For example" can be dropped.
> + *
> + * Architectures for which this pre-conversion hook is not useful
> + * should supply an implementation of this function that just returns
> + * val unmodified.
> + */
> +u32 resctrl_arch_preconvert_bw(u32 val, const struct rdt_resource *r);
> +
> /*
> * Update the ctrl_val and apply this config right now.
> * Must be called on one of the domain's CPUs.
>
> base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa
Reinette
Powered by blists - more mailing lists