linux-kernel - Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9dba03c5-cf45-4510-ab6c-2a945e73fd1c@intel.com>
Date: Thu, 25 Sep 2025 13:53:37 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Dave Martin <Dave.Martin@....com>
CC: <linux-kernel@...r.kernel.org>, Tony Luck <tony.luck@...el.com>, "James
 Morse" <james.morse@....com>, Thomas Gleixner <tglx@...utronix.de>, "Ingo
 Molnar" <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
	<dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, "Jonathan
 Corbet" <corbet@....net>, <x86@...nel.org>, <linux-doc@...r.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be
 per-arch

Hi Dave,

On 9/25/25 5:46 AM, Dave Martin wrote:
> On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
>> On 9/22/25 7:39 AM, Dave Martin wrote:
>>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
>>>> Hi Dave,
>>>>
>>>> nits:
>>>> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
>>>> resctrl code (and was established by Arm :)).
>>>> Also please use upper case for acronym mba->MBA.
>>>
>>> Ack (the local custom in the MPAM code is to use "mba", but arguably,
>>> the meaning is not quite the same -- I'll change it.)
>>
>> I am curious what the motivation is for the custom? Knowing this will help
>> me to keep things consistent when the two worlds meet.
> 
> I think this has just evolved over time.  On the x86 side, MBA is a
> specific architectural feature, but on the MPAM side the architecture
> doesn't really have a name for the same thing.  Memory bandwidth is a
> concept, but a few different types of control are defined for it, with
> different names.
> 
> So, for the MPAM driver "mba" is more of a software concept than
> something in a published spec: it's the glue that attaches to "MB"
> resource as seen through resctrl.
> 
> (This isn't official though; it's just the mental model that I have
> formed.)

I see. Thank you for the details. My mental model is simpler: write acronyms
in upper case.

...

>>>>> On MPAM systems, this results in somewhat worse-than-worst-case
>>>>> rounding, since bw_gran is in general only an approximation to the
>>>>> actual hardware granularity on these systems, and the hardware
>>>>> bandwidth allocation control value is not natively a percentage --
>>>>> necessitating a further conversion in the resctrl_arch_update_domains()
>>>>> path, regardless of the conversion done at parse time.
>>>>>
>>>>> Allow the arch to provide its own parse-time conversion that is
>>>>> appropriate for the hardware, and move the existing conversion to x86.
>>>>> This will avoid accumulated error from rounding the value twice on MPAM
>>>>> systems.
>>>>>
>>>>> Clarify the documentation, but avoid overly exact promises.
>>>>>
>>>>> Clamping to bw_min and bw_max still feels generic: leave it in the core
>>>>> code, for now.
>>>>
>>>> Sounds like MPAM may be ready to start the schema parsing discussion again?
>>>> I understand that MPAM has a few more ways to describe memory bandwidth as
>>>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
>>>> schema format to user space, which seems like a good idea for new schema.
>>>
>>> My own ideas in this area are a little different, though I agree with
>>> the general idea.
>>
>> Should we expect a separate proposal from James?
> 
> At some point, yes.  We still need to have a chat about it.
> 
> Right now, I was just throwing an idea out there.

Thank you very much for doing so. We are digesting it.


...

>>> For this patch, my aim was to avoid changing anything unnecessarily.
>>
>> Understood. More below as I try to understand the details but it does not
>> really sound as though the current interface works that great for MPAM. If I
>> understand correctly this patch enables MPAM to use existing interface for
>> its memory bandwidth allocations but doing so does not enable users to 
>> obtain benefit of hardware capabilities. For that users would want to use
>> the new interface?
> 
> In ideal world, probably, yes.
> 
> Since not all use cases will care about full precision, the MB resource
> (approximated for MPAM) should be fine for a lot of people, but I
> expect that sooner or later somebody will want more exact control.

ack.

> 
>>>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
>>>>> the other tests except for the NONCONT_CAT tests, which do not seem to
>>>>> be supported in my configuration -- and have nothing to do with the
>>>>> code touched by this patch).
>>>>
>>>> Is the NONCONT_CAT test failing (i.e printing "not ok")?
>>>>
>>>> The NONCONT_CAT tests may print error messages as debug information as part of
>>>> running, but these errors are expected as part of the test. The test should accurately
>>>> state whether it passed or failed though. For example, below attempts to write
>>>> a non-contiguous CBM to a system that does not support non-contiguous masks.
>>>> This fails as expected, error messages printed as debugging and thus the test passes
>>>> with an "ok".
>>>>
>>>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument                                      
>>>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected                             
>>>> ok 5 L3_NONCONT_CAT: test                             
>>>
>>> I don't think that this was anything to do with my changes, but I don't
>>> still seem to have the test output.  (Since this test has to do with
>>> bitmap schemata (?), it seemed unlikely to be affected by changes to
>>> bw_validate().)
>>
>> I agree that this should not have anything to do with this patch. My concern
>> is that I understood that the test failed for a feature that is not supported.
>> If this is the case then there may be a problem with the test. The test should
>> not fail if the feature is not supported but instead skip the test.
> 
> I'll try to capture more output from this when I re-run it, so that we
> can figure out what this is.

Thank you.


...

>>>
>>> [...]
>>>
>>>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
>>>>> index c7949dd44f2f..a1d0469d6dfb 100644
>>>>> --- a/Documentation/filesystems/resctrl.rst
>>>>> +++ b/Documentation/filesystems/resctrl.rst
>>>>> @@ -143,12 +143,11 @@ with respect to allocation:
>>>>>  		user can request.
>>>>>  
>>>>>  "bandwidth_gran":
>>>>> -		The granularity in which the memory bandwidth
>>>>> +		The approximate granularity in which the memory bandwidth
>>>>>  		percentage is allocated. The allocated
>>>>>  		b/w percentage is rounded off to the next
>>>>> -		control step available on the hardware. The
>>>>> -		available bandwidth control steps are:
>>>>> -		min_bandwidth + N * bandwidth_gran.
>>>>> +		control step available on the hardware. The available
>>>>> +		steps are at least as small as this value.
>>>>
>>>> A bit difficult to parse for me. 
>>>> Is "at least as small as" same as "at least"?
>>>
>>> It was supposed to mean: "The available steps are no larger than this
>>> value."
>>
>> This is clear to me, especially when compared with the planned addition to
>> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting
>> the paragraph below (more below).
>>
>>>
>>> Formally My expectation is that this value is the smallest integer
>>> number of percent which is not smaller than the apparent size of any
>>> individual rounding step.  Equivalently, this is the smallest number g
>>
>> Considering the two statements:
>> - "The available steps are no larger than this value."
>> - "this value ... is not smaller than the apparent size of any individual rounding step"
>>
>> The "not larger" and "not smaller" sounds like all these words just end up saying that
>> this is the step size?
> 
> They are intended to be the same statement: A <= B versus
> B >= A respectively.

This is what I understood from the words ... and that made me think that it
can be simplified to A = B ... but no need to digress ... onto the alternatives below ...

> 
> But I'd be the first to admit that the wording is a bit twisted!
> (I wouldn't be astonshed if I got something wrong somewhere.)
> 
> See below for an alternative way of describing this that might be more
> intuitive.
> 
>>
>>> for which writing "MB: 0=x" and "MB: 0=y" yield different
>>> configurations for every in-range x and where y = x + g and y is also
>>> in-range.
>>>
>>> That's a bit of a mouthful, though.  If you can think of a more
>>> succinct way of putting it, I'm open to suggestions!
>>>
>>>> Please note that the documentation has a section "Memory bandwidth Allocation
>>>> and monitoring" that also contains these exact promises.
>>>
>>> Hmmm, somehow I completely missed that.
>>>
>>> Does the following make sense?  Ideally, there would be a simpler way
>>> to describe the discrepancy between the reported and actual values of
>>> bw_gran...
>>>
>>>  |  Memory bandwidth Allocation and monitoring
>>>  |  ==========================================
>>>  |
>>>  |  [...]
>>>  |
>>>  |  The minimum bandwidth percentage value for each cpu model is predefined
>>>  |  and can be looked up through "info/MB/min_bandwidth". The bandwidth
>>>  |  granularity that is allocated is also dependent on the cpu model and can
>>>  |  be looked up at "info/MB/bandwidth_gran". The available bandwidth
>>>  | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
>>>  | -to the next control step available on the hardware.
>>>  | +control steps are: min_bw + N * (bw_gran - e), where e is a
>>>  | +non-negative, hardware-defined real constant that is less than 1.
>>>  | +Intermediate values are rounded to the next control step available on
>>>  | +the hardware.
>>>  | +
>>>  | +At the time of writing, the constant e referred to in the preceding
>>>  | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
>>>  | +describes the step size exactly), but this may not be the case on other
>>>  | +hardware when the actual granularity is not an exact divisor of 100.
>>
>> Have you considered how to share the value of "e" with users?
> 
> Perhaps introducing this "e" as an explicit parameter is a bad idea and
> overly formal.  In practice, there are likely to various sources of
> skid and approximation in the hardware, so exposing an actual value may
> be counterproductive -- i.e., what usable guarantee is this providing
> to userspace, if this is likely to be swamped by approximations
> elsewhere?
> 
> Instead, maybe we can just say something like:
> 
>  | The available steps are spaced at roughly equal intervals between the
>  | value reported by info/MB/min_bandwidth and 100%, inclusive.  Reading
>  | info/MB/bandwidth_gran gives the worst-case precision of these
>  | interval steps, in per cent.
> 
> What do you think?

I find "worst-case precision" a bit confusing, consider for example, what
would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
the upper limit of these interval steps"? I believe this matches what you
mentioned a couple of messages ago: "The available steps are no larger than this
value."

(and "per cent" -> "percent")

> 
> If that's adequate, then the wording under the definition of
> "bandwidth_gran" could be aligned with this.

I think putting together a couple of your proposals and statements while making the
text more accurate may work:

	 "bandwidth_gran":
		The approximate granularity in which the memory bandwidth
 		percentage is allocated. The allocated bandwidth percentage
		is rounded up to the next control step available on the
		hardware. The available hardware steps are no larger than
		this value.

I assume "available" is needed because, even though the steps are not larger
than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
to 100% range?
 
>>>> I think that using const to pass data to architecture is great, thanks.
>>>>
>>>> Reinette
>>>
>>> I try to constify by default when straightforward to do so, since the
>>> compiler can then find which cases need to change; the reverse
>>> direction is harder to automate...
>>
>> Could you please elaborate what you mean with "reverse direction"?
> 
> I just meant that over-consting tends to result in violations of the
> language that the compiler will detect, but under-consting doesn't:
> 
> static void foo(int *nonconstp, const int *constp)
> {
> 	*constp = 0; // compiler error
> 	(*nonconstp); // silently accpeted, though it could have been const
> }
> 
> So, the compiler will tell you places where const needs to be removed
> (or something else needs to change), but to find places where const
> could be _added_, you have to hunt them down yourself, or use some
> other tool that is probably not part of the usual workflow.

Got it, thanks.

Reinette