lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8ef51f28-e01a-4a7d-ba86-059437edb60b@amd.com>
Date: Fri, 14 Feb 2025 12:31:00 -0600
From: "Moger, Babu" <bmoger@....com>
To: Reinette Chatre <reinette.chatre@...el.com>,
 Dave Martin <Dave.Martin@....com>
Cc: Babu Moger <babu.moger@....com>, peternewman@...gle.com, corbet@....net,
 tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, tony.luck@...el.com, fenghua.yu@...el.com,
 x86@...nel.org, hpa@...or.com, paulmck@...nel.org,
 akpm@...ux-foundation.org, thuth@...hat.com, rostedt@...dmis.org,
 xiongwei.song@...driver.com, pawan.kumar.gupta@...ux.intel.com,
 daniel.sneddon@...ux.intel.com, jpoimboe@...nel.org, perry.yuan@....com,
 sandipan.das@....com, kai.huang@...el.com, xiaoyao.li@...el.com,
 seanjc@...gle.com, xin3.li@...el.com, andrew.cooper3@...rix.com,
 ebiggers@...gle.com, mario.limonciello@....com, james.morse@....com,
 tan.shaopeng@...itsu.com, linux-doc@...r.kernel.org,
 linux-kernel@...r.kernel.org, maciej.wieczor-retman@...el.com,
 eranian@...gle.com
Subject: Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth
 Monitoring Counters (ABMC)

Hi Dave/Reinette,

On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> Hi Dave,
> 
> On 2/13/25 9:37 AM, Dave Martin wrote:
>> Hi Reinette,
>>
>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>> Hi Dave,
>>>
>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>> Hi there,
>>>>
>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>
>>>>> This series adds the support for Assignable Bandwidth Monitoring Counters
>>>>> (ABMC). It is also called QoS RMID Pinning feature
>>>>>
>>>>> Series is written such that it is easier to support other assignable
>>>>> features supported from different vendors.
>>>>>
>>>>> The feature details are documented in the  APM listed below [1].
>>>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>>>> Monitoring (ABMC). The documentation is available at
>>>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>>>>
>>>>> The patches are based on top of commit
>>>>> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
>>
>> [...]
>>
>>>>> b. Check how many ABMC counters are available.
>>>>>
>>>>> 	# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
>>>>> 	32
>>>>
>>>> Is this file needed?
>>>>
>>>> With MPAM, it is more difficult to promise that the same number of
>>>> counters will be available everywhere.
>>>>
>>>> Rather than lie, or report a "safe" value here that may waste some
>>>> counters, can we just allow the number of counters to be be discovered
>>>> per domain via available_mbm_cntrs?
>>>
>>> This sounds reasonable to me. I think us having trouble with the
>>> user documentation of this file so late in development should also have been
>>> a sign to rethink its value.
>>>
>>> For a user to discover the number of counters supported via available_mbm_cntrs
>>> would require the file's contents to be captured right after mount. Since we've
>>> had scenarios where new userspace needs to discover an up-and-running system's
>>> configuration this may not be possible. I thus wonder instead of removing
>>> num_mbm_cntrs, it could be modified to return the per-domain supported counters
>>> instead of a single value?
>>
>> Is it actually useful to be able to discover the number of counters
>> that exist?  A counter that exists but is not available cannot be used,
>> so perhaps it is not useful to know about it in the first place.
> 
> An alternative perspective of what "available" means is "how many counters
> could I possibly get to do this new monitoring task". A user may be willing
> to re-assign counters if the new monitoring task is important. Knowing
> how many counters are already free and available for assignment would be
> easy from available_mbm_cntrs but to get an idea of how many counters
> could be re-assigned to help out with the new task would require
> some intricate parsing of mbm_assign_control.
> 
> 
>> But if we keep this file but make it report the number of counters for
>> each domain (similarly to mbm_available_cntrs), then I think the MPAM
>> driver should be able to work with that.
>>
>>>> num_closids and num_rmids are already problematic for MPAM, so it would
>>>> be good to avoid any more parameters of this sort from being reported
>>>> to userspace unless there is a clear understanding of why they are
>>>> needed.
>>>
>>> Yes. Appreciate your help in identifying what could be problematic for MPAM.
>>
>> For clarity: this is a background issue, mostly orthogonal to this
>> series.
>>
>> If this series is merged as-is, with a global per-resource
>> num_mbm_cntrs property, then this not really worse than the current
>> situation -- it's just a bit annoying from the MPAM perspective.
>>
>>
>> In a nutshell, the num_closids / num_rmids parameters seem to expose
>> RDT-specific hardware semantics to userspace, implying a specific
>> allocation model for control group and monitoring group identifiers.
>>
>> The guarantees that userspace is entitled to asssume when resctrl
>> reports particular values do not seem to be well described and are hard
>> to map onto the nearest-equivalent MPAM implementation.  A combination
>> of control and monitoring groups that can be created on x86 may not be
>> creatable on MPAM, even when the number of supportable control and
>> monitoring partitions is the same.
> 
> I understand. This interface was created almost a decade ago. It would have been
> wonderful if the user interface could have been created with a clear vision
> of all the use cases it would end up needing to support. I am trying to be
> very careful with this new user interface as I try to consider all the things I
> learned while working on resctrl. All help get this new interface right is
> greatly appreciated.
> 
> Since your specifically mention issues that MPAM has with num_rmids, please
> note that we have been trying (see [1], but maybe start reading thread at [2])
> to find ways to make this work with MPAM but no word from MPAM side.
> I see that you were not cc'd on the discussion so this is not a criticism of
> you personally but I would like to highlight that we do try to make things
> work well for MPAM but so far this work seems ignored, yet critisized
> for not being done. I expect the more use cases are thrown at an interface
> as it is developed the better it would get and I would gladly work with MPAM
> folks to improve things.
> 
>> Even with the ABMC series, we may still be constrained on what we can
>> report for num_rmids: we can't know in advance whether or not the user
>> is going to use mbm_cntr_assign mode -- if not, we can't promise to
>> create more monitoring groups than the number of counters in the
>> hardware.
> 
> It is the architecture that decides which modes are supported and
> which is default.
> 
>> It seems natural for the counts reported by "available_mbm_cntrs" to
>> change dynamically when the ABMC assignment mode is changed, but I
>> think userspace are likely to expect the global "num_rmids" parameters
>> to be fixed for the lifetime of the resctrl mount (and possibly fixed
>> for all time on a given hardware platform -- at least, modulo CDP).
>>
>>
>> I think it might be possible to tighten up the docmentation of
>> num_closids in particular in a way that doesn't conflict with x86 and
>> may make it easier for MPAM to fit in with, but that feels like a
>> separate conversation.
>>
>> None of this should be considered a blocker for this series, either way.
>>
>>>>
>>>> Reporting number of counters per monitoring domain is a more natural
>>>> fit for MPAM, as below:
>>>>
>>>>> c. Check how many ABMC counters are available in each domain.
>>>>>
>>>>> 	# cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
>>>>> 	0=30;1=30
>>>>
>>>> For MPAM, this seems supportable.  Each monitoring domain will have
>>>> some counters, and a well-defined number of them will be available for
>>>> allocation at any one time.
>>
>> [...]
>>
>>>>> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>> [...]
>>
>>>>>         Flags can be one of the following:
>>>>>
>>>>>          t  MBM total event is enabled.
>>>>>          l  MBM local event is enabled.
>>>>>          tl Both total and local MBM events are enabled.
>>>>>          _  None of the MBM events are enabled
>>>>>
>>>>> 	Examples:
>>>>
>>>> [...]
>>>>
>>>> I think that this basically works for MPAM.
>>>>
>>>> The local/total distinction doesn't map in a consistent way onto MPAM,
>>>> but this problem is not specific to ABMC.  It feels sensible for ABMC
>>>> to be built around the same concepts that resctrl already has elsewhere
>>>> in the interface.  MPAM will do its best to fit (as already).
>>>>
>>>> Regarding Peter's use case of assiging multiple counters to a
>>>> monitoring group [1], I feel that it's probably good enough to make
>>>> sure that the ABMC interface can be extended in future in a backwards
>>>> compatible way so as to support this, without trying to support it
>>>> immediately.
>>>>
>>>> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
>>>>
>>>
>>> I do not think that resctrl's current support of the mbm_total_bytes and
>>> mbm_local_bytes should be considered as the "only" two available "slots"
>>> into which all possible events should be forced into. "mon_features" exists
>>> to guide user space to which events are supported and as I see it new events
>>> can be listed here to inform user space of their availability, with their
>>> associated event files available in the resource groups.
>>
>> That's fair.  I wasn't currently sure how (or if) the set of countable
>> events was expected to grow / evolve via this route.
>>
>> Either way, I think this confirms that there is at least one viable way
>> to enable more counters for a single control group, on top of this
>> series.
>>
>> (If there is more than one way, that seems fine?)
>>
>>>>
>>>> For example, if we added new generic "letters" -- say, "0" to "9",
>>>> combined with new counter files in resctrlfs, that feels like a
>>>> possible approach.  ABMC (as in this series) should just reject such
>>>> such assignments, and the new counter files wouldn't exist.
>>>>
>>>> Availability of this feature could also be reported as a distinct mode
>>>> in mbm_assign_mode, say "mbm_cntr_generic", or whatever.
>>>>
>>>>
>>>> A _sketch_ of this follows.  This is NOT a proposal -- the key
>>>> question is whether we are confident that we can extend the interface
>>>> in this way in the future without breaking anything.
>>>>
>>>> If "yes", then the ABMC interface (as proposed by this series) works as
>>>> a foundation to build on.
>>>>
>>>> --8<--
>>>>
>>>> [artists's impression]
>>>>
>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>>>   	mbm_cntr_generic
>>>>   	[mbm_cntr_assign]
>>>>   	default
>>>>
>>>> # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>>> # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type
>>>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type
>>>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type
>>>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type
>>>>
>>>> ...
>>>>
>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes
>>>>
>>>> etc.
>>>>
>>>
>>> It is not clear to me what additional features such an interface enables. It
>>> also looks like user space will need to track and manage counter IDs?
>>
>> My idea was that for these generic counters, new files could be exposed
>> to configure what they actually count (the ..._type files shown above;
>> or possibly via the ..._config files that already exist).
>>
>> The "IDs" were inteded as abstract; the number only relates the
>> assignments in mbm_assign_control to the files created elsewhere.  This
>> wouldn't be related to IDs assigned by the hardware.
> 
> I see. Yes, this sounds related to and a generalization of the AMD
> configurable event feature.
> 
>>
>> If there are multiple resctrl users then using numeric IDs might be
>> problematic; though if we go eventually in the direction of making
>> resctrlfs multi-mountable then each mount could have its own namespace.
> 
> I am not aware of "multi-mountable" direction.
> 
>>
>> Allowing counters to be named and configured with a mkdir()-style
>> interface might be possible too; that might make it easier for users to
>> coexist within a single resctrl mount (if we think that's important
>> enough).
>>
>>> It sounds to me as though the issue starts with your statement
>>> "The local/total distinction doesn't map in a consistent way onto MPAM". To
>>> address this I expect that an MPAM system will not support nor list
>>> mbm_total_bytes and/or mbm_local_bytes in its mon_features file (*)? Instead,
>>> it would list the events that are appropriate to the system? Trying to match
>>> with what Peter said [1] in the message you refer to, this may be possible:
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mon_features
>>> mbm_local_read_bytes
>>> mbm_local_write_bytes
>>> mbm_local_bytes
>>>
>>> (*) I am including mbm_local_bytes since it could be an event that can be software
>>> defined as a sum of mbm_local_read_bytes and mbm_local_write_bytes when they are both
>>> counted.
>>>
>>> I see the support for MPAM events distinct from the support of assignable counters.
>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>> Please help me understand if you see it differently.
>>> 	
>>> Doing so would need to come up with alphabetical letters for these events,
>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>
>>> mbm_local_read_bytes a
>>> mbm_local_write_bytes b
>>>
>>> Then mbm_assign_control can be used as:
>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>> <value>
>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>
>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>
>>> Reinette
>>>
>>> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
>>
>> That approach would also work, where an MPAM system has events are not
>> a reasonable approximation of the generic "total" or "local".
>>
>> For now we would probably stick with "total" and "local" anyway though,
>> because the MPAM architecture doesn't natively allow the mapping onto
>> the memory system topology to be discovered, and the information in
>> ACPI / device tree is insufficient to tell us everything we'd need to
>> know.  But I guess what counts as "local" in particular will be quite
>> hardware and topology dependent even on x86, so perhaps we shouldn't
>> worry about having the behaviour match exactly (?)
>>
>> Regarding the code letters, my idea was that the event type might be
>> configured by a separate file, instead of in mbm_assign_control
>> directly, in which case running out of letters wouldn't be a problem.
> 
> This work started with individual files for counters but the issue was
> raised that this will require a large number of filesystem calls when, for
> example, a user wants to move a group of counters associated with the events
> of one set of monitoring groups to another set of monitoring groups. This
> is for the use case where there are a significant number of monitor groups
> for which there are not sufficient counters. With mbm_assign_control this
> can be done in a single write and such a monitoring transition can thus
> be accomplished more efficiently.
> 
>>
>> Alternatively, if we want to be able to expand beyond single letters,
>> could we reserve one or more characters for extension purposes?
>>
>> If braces are forbidden by the syntax today, could we add support for
>> something like the following later on, without breaking anything?
>>
>> # echo '//0={foo}{bar};1={bar}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
> 
> Thank you for the suggestion. I think we may need something like this.
> Babu, what do you think?

I'm not quite clear on this. Do we know what 'foo' and 'bar' refer to?
It is a random text?

In his example from
https://lore.kernel.org/lkml/Z643WdXYARTADSBy@e133380.arm.com/
--------------------------------------------------------------
The numbers are not supposed to have an hardware significance.

	'//0=6'

just "means assign some unused counter for domain 0, and create files
in resctrl so I can configure and read it".

The "6" is really just a tag for labelling the resulting resctrl
file names so that the user can tell them apart.  It's not supposed
to imply any specific hardware counter or event.
------------------------------------------------------------------

It seems that 'foo' and 'bar' are tags used to create files in 
/sys/fs/resctrl/info/L3_MON/.

Given that, it looks like we're discussing entirely different things.

> 
>>
>> For now, my main concern would be whether this series prevents that
>> sort of thing being added in a backwards compatible way later.
>>
>> I don't really see anything that is a blocker.
>>
>> What do you think?
> 
> I do not fully understand the MPAM counter feature. It almost sounds like
> every counter could be configured independently with the expectation to
> configure and assign each counter independently to a domain. As I understand
> these capabilities match AMD's ABMC feature, but the planned implementation
> to support ABMC first configures events per-domain and then assign these
> events to counters. hmmm ... but in your example a file like
> "mbm_counter0_bytes_type" is global. Could you please elaborate how in
> your example writing a single letter to that file will be interpreted?
> 
> 
> Reinette
> 
> [1] https://lore.kernel.org/lkml/46767ca7-1f1b-48e8-8ce6-be4b00d129f9@intel.com/
> [2] https://lore.kernel.org/lkml/CALPaoChad6=xqz+BQQd=dB915xhj1gusmcrS9ya+T2GyhTQc5Q@mail.gmail.com/
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ