lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7628cec8-5914-4895-8289-027e7821777e@amd.com>
Date: Thu, 22 May 2025 10:44:07 -0500
From: "Moger, Babu" <babu.moger@....com>
To: Reinette Chatre <reinette.chatre@...el.com>,
 Peter Newman <peternewman@...gle.com>
Cc: "Moger, Babu" <bmoger@....com>, corbet@....net, tony.luck@...el.com,
 tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, james.morse@....com, dave.martin@....com,
 fenghuay@...dia.com, x86@...nel.org, hpa@...or.com, paulmck@...nel.org,
 akpm@...ux-foundation.org, thuth@...hat.com, rostedt@...dmis.org,
 ardb@...nel.org, gregkh@...uxfoundation.org, daniel.sneddon@...ux.intel.com,
 jpoimboe@...nel.org, alexandre.chartre@...cle.com,
 pawan.kumar.gupta@...ux.intel.com, thomas.lendacky@....com,
 perry.yuan@....com, seanjc@...gle.com, kai.huang@...el.com,
 xiaoyao.li@...el.com, kan.liang@...ux.intel.com, xin3.li@...el.com,
 ebiggers@...gle.com, xin@...or.com, sohil.mehta@...el.com,
 andrew.cooper3@...rix.com, mario.limonciello@....com,
 linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
 maciej.wieczor-retman@...el.com, eranian@...gle.com, Xiaojian.Du@....com,
 gautham.shenoy@....com
Subject: Re: [PATCH v13 00/27] x86/resctrl : Support AMD Assignable Bandwidth
 Monitoring Counters (ABMC)

Hi  Reinette,

On 5/21/25 18:03, Reinette Chatre wrote:
> Hi Peter and Babu,
> 
> On 5/21/25 2:18 AM, Peter Newman wrote:
>> Hi Babu/Reinette,
>>
>> On Wed, May 21, 2025 at 1:44 AM Reinette Chatre
>> <reinette.chatre@...el.com> wrote:
>>>
>>> Hi Babu,
>>>
>>> On 5/20/25 4:25 PM, Moger, Babu wrote:
>>>> Hi Reinette,
>>>>
>>>> On 5/20/2025 1:23 PM, Reinette Chatre wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On 5/20/25 10:51 AM, Moger, Babu wrote:
>>>>>> Hi Reinette,
>>>>>>
>>>>>> On 5/20/25 11:06, Reinette Chatre wrote:
>>>>>>> Hi Babu,
>>>>>>>
>>>>>>> On 5/20/25 8:28 AM, Moger, Babu wrote:
>>>>>>>> On 5/19/25 10:59, Peter Newman wrote:
>>>>>>>>> On Fri, May 16, 2025 at 12:52 AM Babu Moger <babu.moger@....com> wrote:
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>>>>> /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs: Reports the number of monitoring
>>>>>>>>>> counters available for assignment.
>>>>>>>>>
>>>>>>>>> Earlier I discussed with Reinette[1] what num_mbm_cntrs should
>>>>>>>>> represent in a "soft-ABMC" implementation where assignment is
>>>>>>>>> implemented by assigning an RMID, which would result in all events
>>>>>>>>> being assigned at once.
>>>>>>>>>
>>>>>>>>> My main concern is how many "counters" you can assign by assigning
>>>>>>>>> RMIDs. I recall Reinette proposed reporting the number of groups which
>>>>>>>>> can be assigned separately from counters which can be assigned.
>>>>>>>>
>>>>>>>> More context may be needed here. Currently, num_mbm_cntrs indicates the
>>>>>>>> number of counters available per domain, which is 32.
>>>>>>>>
>>>>>>>> At the moment, we can assign 2 counters to each group, meaning each RMID
>>>>>>>> can be associated with 2 hardware counters. In theory, it's possible to
>>>>>>>> assign all 32 hardware counters to a group—allowing one RMID to be linked
>>>>>>>> with up to 32 counters. However, we currently lack the interface to
>>>>>>>> support that level of assignment.
>>>>>>>>
>>>>>>>> For now, the plan is to support basic assignment and expand functionality
>>>>>>>> later once we have the necessary data structure and requirements.
>>>>>>>
>>>>>>> Looks like some requirements did not make it into this implementation.
>>>>>>> Do you recall the discussion that resulted in you writing [2]? Looks like
>>>>>>> there is a question to Peter in there on how to determine how many "counters"
>>>>>>> are available in soft-ABMC. I interpreted [3] at that time to mean that this
>>>>>>> information would be available in a future AMD publication.
>>>>>>
>>>>>> We already have a method to determine the number of counters in soft-ABMC
>>>>>> mode, which Peter has addressed [4].
>>>>>>
>>>>>> [4]
>>>>>> https://lore.kernel.org/lkml/20250203132642.2746754-1-peternewman@google.com/
>>>>>>
>>>>>> This appears to be more of a workaround, and I doubt it will be included
>>>>>> in any official AMD documentation. Additionally, the long-term direction
>>>>>> is moving towards ABMC.
>>>>>>
>>>>>> I don’t believe this workaround needs to be part of the current series. It
>>>>>> can be added later when soft-ABMC is implemented.
>>>>>
>>>>> Agreed. What about the plans described in [2]? (Thanks to Peter for
>>>>> catching this!).
>>>>>
>>>>> It is important to keep track of requirements while working on a feature to
>>>>> ensure that the implementation supports the planned use cases. Re-reading that
>>>>> thread it is not clear to me how soft-ABMC's per-group assignment would look.
>>>>> Could you please share how you see it progress from this implementation?
>>>>> This includes the single event vs. multiple event assignment. I would like to
>>>>> highlight that this is not a request for this to be supported in this implementation
>>>>> but there needs to be a plan for how this can be supported on top of interfaces
>>>>> established by this work.
>>>>>
>>>>
>>>> Here’s my current understanding of soft-ABMC. Peter may have a more in-depth perspective on this.
>>>>
>>>> Soft-ABMC:
>>>> a. num_mbm_cntrs: This is a software-defined limit based on the number of active RMIDs that can be supported. The value can be obtained using the code referenced in [4].
>>
>> I would call it a hardware-defined limit that can be probed by software.
>>
>> The main question is whether this file returns the exact number of
>> RMIDs hardware can track or double that number (mbm_total_bytes +
>> mbm_local_bytes) so that the value is always measured in events.
> 
> tl;dr: I continue [3] to find it most intuitive for num_mbm_cntrs to be the exact
> number of "active" RMIDs that the system can support *and* changing the name of
> the modes to help user interpret num_mbm_cntrs: "mbm_cntr_event_assign" for ABMC,
> "mbm_cntr_group_assign" for soft-ABMC.
> 
> details
> -------
> 
> We are now back to the previous discussion about what user can expect from
> the interface. Let me try and re-cap that discussion so that we can all hopefully
> get back on the same page. Please add corrections/updates where needed.
> 
> soft-ABMC
> ---------
>   soft-ABMC manages "active" (term TBD) RMID assignment to monitor groups. When an
>   "active" RMID is assigned to a monitor group then *all* MBM events (not LLC occupancy)
>   in that monitor group are counted. "Active" RMID assignment can be done per domain.
> 
>   Requirement: resctrl should accurately reflect which events are counted. That is,
>   we do not want resctrl to pretend to allow user to assign an "active" RMID to
>   only one event in a monitor group while all events are actually counted.
> 
>   Caveat: To support rapid re-assignment of RMIDs to monitor groups, llc_occupancy
>   event is disabled when soft-ABMC is enabled.
> 
> ABMC
> ----
>   ABMC manages (hardware) counter assignment to monitor group (RMID), event pairs.
>   When a hardware counter is assigned to an RMID, event pair then only that
>   RMID, event is counted. Hardware counter assignment can be done per domain.
> 
> 
> shared assignment
> -----------------
> A shared assignment applies to both soft-ABMC and ABMC. A user can designate a
> "counter" (could be hardware counter or "active" RMID) as shared and that means
> the counter within that domain is shared between different monitor groups and actual
> assignment is scheduled by resctrl.  

Good summary: Thanks.

> 
> 
> user interface
> --------------
> 
> Next, consider the interface while keeping above definitions and requirements in mind.
> 
> This series introduces (using implementation, not cover-letter):
> 
> /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs 
> "num_mbm_cntrs":                                                               
> 	The maximum number of monitoring counters (total of available and assigned
> 	counters) in each domain when the system supports mbm_cntr_assign mode. 
> 
> /sys/fs/resctrl/mbm_L3_assignments
> "mbm_L3_assignments":                                                          
> 	This interface file is created when the mbm_cntr_assign mode is supported
> 	and shows the assignment status for each group.              
> 
> Consider "mbm_L3_assignments" first. The interface is documented for ABMC support
> where it is possible to manage individual event assignment within monitor group.
> 
> For ABMC it is possible to assign just one event at a time and doing so consumes
> one counter in that domain:
> 
> a) Starting state on system with 32 counters per domain, two events in default
>    resource group consumes two counters in that domain:
> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> 0=30;1=32
> # cat /sys/fs/resctrl/mbm_L3_assignments
> mbm_total_bytes:0=e;1=_
> mbm_local_bytes:0=e;1=_
> 
> b) Assign counter to mbm_local_bytes in domain 1:
> # echo "mbm_local_bytes:1=e" > /sys/fs/resctrl/mbm_L3_assignments
> # cat /sys/fs/resctrl/mbm_L3_assignments
> mbm_total_bytes:0=e;1=_
> mbm_local_bytes:0=e;1=e
> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> 0=30;1=31
> 
> The question is how this should look on soft-ABMC system. Let's say hypothetically
> that on a soft-ABMC system it is possible to have 32 "active" RMIDs.
> 
> a) Starting state on system with 32 "active RMIDs" per domain, two events in default
>    resource group consumes one RMID in that domain:
> 
> # cat /sys/fs/resctrl/mbm_L3_assignments
> mbm_total_bytes:0=e;1=_
> mbm_local_bytes:0=e;1=_
> 
> What should num_mbm_cntrs display?
> 
> Option A (counters are RMIDs):
> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> 0=31;1=32
> 
> Option B (pretend RMIDs are events):
> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> 0=62;1=64
> 
> b) Assign counter to mbm_local_bytes in domain 1:
> # echo "mbm_local_bytes:1=e" > /sys/fs/resctrl/mbm_L3_assignments
> # cat /sys/fs/resctrl/mbm_L3_assignments
> mbm_total_bytes:0=e;1=e
> mbm_local_bytes:0=e;1=e
> 
> Note that even though user requested only mbm_local_bytes to be assigned, it
> actually results in both mbm_total_bytes and mbm_local_bytes to be assigned. This
> ensures accurate state representation to user space but this also creates an
> inconsistent user interface between soft-ABMC and ABMC since user space intends
> to use the same interface but "sometimes" assigning one event results in assign
> of one event while "sometimes" it results in assign of multiple events.
> 
> wrt "num_mbm_cntrs"
> 
> Option A (counters are RMIDs):
> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> 0=31;1=31
> 
> Option B (pretend RMIDs are events):
> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> 0=62;1=62 
> 
> Neither option seems ideal to me since the interface cannot be consistent
> between ABMC and soft-ABMC.
> As I mentioned in [2] it is not possible to hide ABMC and soft-ABMC behind
> the same interface. When user space wants to monitor a particular monitor group
> then it should be clear how that can be accomplished. Not knowing if
> an assignment/unassignment to/from an event would impact one or all events
> and whether it will consume one or multiple counters does not sound like a good
> interface to me. 
> 
> As I understand current interface, user is required to know how ABMC and soft-ABMC
> is implemented to be able to configure the system. For example, if user has file like:
> 	# cat /sys/fs/resctrl/mbm_L3_assignments
> 	mbm_total_bytes:0=e;1=e
> 	mbm_local_bytes:0=e;1=e
> user must know underlying implementation to be able to manage monitoring of
> events and assigning counters otherwise it will be a surprise to lose monitoring
> of all events when unassigning one event.
> 
> This is why I proposed in [3] that the name of the mode reflects how user can interact
> with the system. Instead of one "mbm_cntr_assign" mode there can be "mbm_cntr_event_assign"
> that is used for ABMC and "mbm_cntr_group_assign" that is used for soft-ABMC. The mode should
> make it clear what the system is capable of wrt counter assignments.

Yes, that makes sense. Perhaps we can also simplify it further:

# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode:
[mbm_cntr_evt_assign] <- for ABMC
 mbm_cntr_grp_assign  <- for soft-ABMC

> 
> Considering this the interface should be clear:
> num_mbm_cntrs: reflects the number of counters in each domain that can be assigned. In
> "mbm_cntr_event_assign" this will be the number of counters that can be assigned to 
> each event within a monitoring group, in "mbm_cntr_group_assign" this will be the number
> of counters that can be assigned to entire monitoring groups impacting all MBM events.
> 
> mbm_L3_assignments: manages the counter assignment in each group. When user knows the mode
> is "mbm_cntr_event_assign"/"mbm_cntr_group_assign" then it should be clear to user space how the
> interface behaves wrt assignment, no surprises of multiple events impacted when
> assigning/unassigning single event.
> 
> For soft-ABMC I thus find it most intuitive for num_mbm_cntrs to be the exact number
> of "active" RMIDs that the system can support *and* changing the name of the modes
> to help user interpret num_mbm_cntrs.

Sure. The option A: fits well here then.

 Option A (counters are RMIDs):
 # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
 0=31;1=31

> 
>>
>> There's also the mongroup-RMID overcommit use case I described
>> above[1]. On Intel we can safely assume that there are counters to
>> back all RMIDs, so num_mbm_cntrs would be calculated directly from
>> num_rmids.
> 
> This is about the:
> 	There's now more interest in Google for allowing explicit control of
> 	where RMIDs are assigned on Intel platforms. Even though the number of
> 	RMIDs implemented by hardware tends to be roughly the number of
> 	containers they want to support, they often still need to create
> 	containers when all RMIDs have already been allocated, which is not
> 	currently allowed. Once the container has been created and starts
> 	running, it's no longer possible to move its threads into a monitoring
> 	group whenever RMIDs should become available again, so it's important
> 	for resctrl to maintain an accurate task list for a container even
> 	when RMIDs are not available.
> 
> I see a monitor group as a collection of tasks that need to be monitored together.
> The "task list" is the group of tasks that share a monitoring ID that
> is required to be a valid ID since when any of the tasks are scheduled that ID is
> written to the hardware. I intentionally tried to not use RMID since I believe
> this is required for all archs.
> I thus do not understand how a task can start running when it does not have
> a valid monitoring ID. The idea of "deferred assignment" is not clear to me,
> there can never be "unmonitored tasks", no? I think I am missing something here.
> 
>> I realized this use case is more difficult to implement on MPAM,
>> because a PARTID is effectively a CLOSID+RMID, so deferring assigning
>> a unique PARTID to a group also results in it being in a different
>> allocation group. It will work if the unmonitored groups could find a
>> way to share PARTIDs, but this has consequences on allocation - but
>> hopefully no worse than sharing CLOSIDs on x86.
>>
>> There's a lot of interest in monitoring ID overcommit in Google, so I
>> think it's worth it for me to investigate the additional structural
>> changes needed in resctrl (i.e., breaking the FS-level association
>> between mongroups and HW monitoring IDs). Such a framework could be a
>> better fit for soft-ABMC. For example, if overcommit is allowed, we
>> would just report the number of simultaneous RMIDs we were able to
>> probe as num_rmids. I would want the same shared assignment scheduler
>> to be able to work with RMIDs and counters, though.
>>
>> Thanks,
>> -Peter
>>
>> [1] https://lore.kernel.org/lkml/CALPaoChSzzU5mzMZsdT6CeyEn0WD1qdT9fKCoNW_ty4tojtrkw@mail.gmail.com/
> 
> Reinette
> 
> [2] https://lore.kernel.org/lkml/b9e48e8f-3035-4a7e-a983-ce829bd9215a@intel.com/
> [3] https://lore.kernel.org/lkml/b3babdac-da08-4dfd-9544-47db31d574f5@intel.com/
> 

-- 
Thanks
Babu Moger

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ