[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1c7cc78f-c5ba-4fbc-9b17-61e5b72415ad@intel.com>
Date: Thu, 25 Sep 2025 15:18:51 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: "Luck, Tony" <tony.luck@...el.com>
CC: Dave Martin <Dave.Martin@....com>, <linux-kernel@...r.kernel.org>, "James
Morse" <james.morse@....com>, Thomas Gleixner <tglx@...utronix.de>, "Ingo
Molnar" <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
<dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, "Jonathan
Corbet" <corbet@....net>, <x86@...nel.org>, <linux-doc@...r.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
per-arch
Hi Tony,
On 9/25/25 2:35 PM, Luck, Tony wrote:
> On Thu, Sep 25, 2025 at 01:53:37PM -0700, Reinette Chatre wrote:
>> On 9/25/25 5:46 AM, Dave Martin wrote:
>>> On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
>>>> On 9/22/25 7:39 AM, Dave Martin wrote:
>>>>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
...
>>>>> for which writing "MB: 0=x" and "MB: 0=y" yield different
>>>>> configurations for every in-range x and where y = x + g and y is also
>>>>> in-range.
>>>>>
>>>>> That's a bit of a mouthful, though. If you can think of a more
>>>>> succinct way of putting it, I'm open to suggestions!
>>>>>
>>>>>> Please note that the documentation has a section "Memory bandwidth Allocation
>>>>>> and monitoring" that also contains these exact promises.
>>>>>
>>>>> Hmmm, somehow I completely missed that.
>>>>>
>>>>> Does the following make sense? Ideally, there would be a simpler way
>>>>> to describe the discrepancy between the reported and actual values of
>>>>> bw_gran...
>>>>>
>>>>> | Memory bandwidth Allocation and monitoring
>>>>> | ==========================================
>>>>> |
>>>>> | [...]
>>>>> |
>>>>> | The minimum bandwidth percentage value for each cpu model is predefined
>>>>> | and can be looked up through "info/MB/min_bandwidth". The bandwidth
>>>>> | granularity that is allocated is also dependent on the cpu model and can
>>>>> | be looked up at "info/MB/bandwidth_gran". The available bandwidth
>>>>> | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
>>>>> | -to the next control step available on the hardware.
>>>>> | +control steps are: min_bw + N * (bw_gran - e), where e is a
>>>>> | +non-negative, hardware-defined real constant that is less than 1.
>>>>> | +Intermediate values are rounded to the next control step available on
>>>>> | +the hardware.
>>>>> | +
>>>>> | +At the time of writing, the constant e referred to in the preceding
>>>>> | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
>>>>> | +describes the step size exactly), but this may not be the case on other
>>>>> | +hardware when the actual granularity is not an exact divisor of 100.
>>>>
>>>> Have you considered how to share the value of "e" with users?
>>>
>>> Perhaps introducing this "e" as an explicit parameter is a bad idea and
>>> overly formal. In practice, there are likely to various sources of
>>> skid and approximation in the hardware, so exposing an actual value may
>>> be counterproductive -- i.e., what usable guarantee is this providing
>>> to userspace, if this is likely to be swamped by approximations
>>> elsewhere?
>>>
>>> Instead, maybe we can just say something like:
>>>
>>> | The available steps are spaced at roughly equal intervals between the
>>> | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading
>>> | info/MB/bandwidth_gran gives the worst-case precision of these
>>> | interval steps, in per cent.
>>>
>>> What do you think?
>>
>> I find "worst-case precision" a bit confusing, consider for example, what
>> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
>> the upper limit of these interval steps"? I believe this matches what you
>> mentioned a couple of messages ago: "The available steps are no larger than this
>> value."
>>
>> (and "per cent" -> "percent")
>>
>>>
>>> If that's adequate, then the wording under the definition of
>>> "bandwidth_gran" could be aligned with this.
>>
>> I think putting together a couple of your proposals and statements while making the
>> text more accurate may work:
>>
>> "bandwidth_gran":
>> The approximate granularity in which the memory bandwidth
>> percentage is allocated. The allocated bandwidth percentage
>> is rounded up to the next control step available on the
>> hardware. The available hardware steps are no larger than
>> this value.
>>
>> I assume "available" is needed because, even though the steps are not larger
>> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
>> to 100% range?
>
> What values are allowed for "bandwidth_gran"? The "IntelĀ® Resource
This is a property of the MB resource where the ABI is to express allocations
as a percentage. Current doc:
"bandwidth_gran":
The granularity in which the memory bandwidth
percentage is allocated. The allocated
b/w percentage is rounded off to the next
control step available on the hardware. The
available bandwidth control steps are:
min_bandwidth + N * bandwidth_gran.
I do not expect we can switch it to fractions so I would say that
integer values are allowed, starting at 1.
I understand that the MB resource on AMD supports different ranges and
I find that ABI discrepancy unfortunate. I do not think this should be
seen as an opportunity that "anything goes" when it comes to MB and used as
an excuse to pile on another range of hardware dependent inputs. Instead I
believe we should keep MB interface as-is and instead work on a generic
interface that enables user space to interact with resctrl to have benefit
of all hardware capabilities without needing to know which hardware is
underneath.
> Director Technology (IntelĀ® RDT) Architecture Specification"
>
> https://cdrdv2.intel.com/v1/dl/getContent/789566
>
> describes the upcoming region aware memory bandwidth allocation
> controls as being a number from "1" to "Q" (enumerated in an ACPI
> table). First implementation looks like Q == 255 which means a
> granularity of 0.392% The spec has headroom to allow Q == 511.
>
> I don't expect users to need that granularity at the high bandwidth
> end of the range, but I do expect them to care for highly throttled
> background/batch jobs to make sure they can't affect performance of
> the high priority jobs.
>
> I'd hate to have to round all low bandwidth controls to 1% steps.
This is the limitation if choosing to expose this feature as an MB resource
and seems to be the same problem that Dave is facing. For finer granularity
allocations I expect that we would need a new schema/resource backed by new
properties as proposed by Dave in
https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
This will require updates to user space (that will anyway be needed if wedging
another non-ABI input into MB).
Reinette
Powered by blists - more mailing lists