lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aC5lL_qY00vd8qp4@agluck-desk3>
Date: Wed, 21 May 2025 16:43:43 -0700
From: "Luck, Tony" <tony.luck@...el.com>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: Peter Newman <peternewman@...gle.com>, "Moger, Babu" <bmoger@....com>,
	babu.moger@....com, corbet@....net, tglx@...utronix.de,
	mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com,
	james.morse@....com, dave.martin@....com, fenghuay@...dia.com,
	x86@...nel.org, hpa@...or.com, paulmck@...nel.org,
	akpm@...ux-foundation.org, thuth@...hat.com, rostedt@...dmis.org,
	ardb@...nel.org, gregkh@...uxfoundation.org,
	daniel.sneddon@...ux.intel.com, jpoimboe@...nel.org,
	alexandre.chartre@...cle.com, pawan.kumar.gupta@...ux.intel.com,
	thomas.lendacky@....com, perry.yuan@....com, seanjc@...gle.com,
	kai.huang@...el.com, xiaoyao.li@...el.com,
	kan.liang@...ux.intel.com, xin3.li@...el.com, ebiggers@...gle.com,
	xin@...or.com, sohil.mehta@...el.com, andrew.cooper3@...rix.com,
	mario.limonciello@....com, linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org, maciej.wieczor-retman@...el.com,
	eranian@...gle.com, Xiaojian.Du@....com, gautham.shenoy@....com
Subject: Re: [PATCH v13 00/27] x86/resctrl : Support AMD Assignable Bandwidth
 Monitoring Counters (ABMC)

On Wed, May 21, 2025 at 04:03:37PM -0700, Reinette Chatre wrote:
> Hi Peter and Babu,
> 
> On 5/21/25 2:18 AM, Peter Newman wrote:
> > Hi Babu/Reinette,
> > 
> > On Wed, May 21, 2025 at 1:44 AM Reinette Chatre
> > <reinette.chatre@...el.com> wrote:
> >>
> >> Hi Babu,
> >>
> >> On 5/20/25 4:25 PM, Moger, Babu wrote:
> >>> Hi Reinette,
> >>>
> >>> On 5/20/2025 1:23 PM, Reinette Chatre wrote:
> >>>> Hi Babu,
> >>>>
> >>>> On 5/20/25 10:51 AM, Moger, Babu wrote:
> >>>>> Hi Reinette,
> >>>>>
> >>>>> On 5/20/25 11:06, Reinette Chatre wrote:
> >>>>>> Hi Babu,
> >>>>>>
> >>>>>> On 5/20/25 8:28 AM, Moger, Babu wrote:
> >>>>>>> On 5/19/25 10:59, Peter Newman wrote:
> >>>>>>>> On Fri, May 16, 2025 at 12:52 AM Babu Moger <babu.moger@....com> wrote:
> >>>>>>
> >>>>>> ...
> >>>>>>
> >>>>>>>>> /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs: Reports the number of monitoring
> >>>>>>>>> counters available for assignment.
> >>>>>>>>
> >>>>>>>> Earlier I discussed with Reinette[1] what num_mbm_cntrs should
> >>>>>>>> represent in a "soft-ABMC" implementation where assignment is
> >>>>>>>> implemented by assigning an RMID, which would result in all events
> >>>>>>>> being assigned at once.
> >>>>>>>>
> >>>>>>>> My main concern is how many "counters" you can assign by assigning
> >>>>>>>> RMIDs. I recall Reinette proposed reporting the number of groups which
> >>>>>>>> can be assigned separately from counters which can be assigned.
> >>>>>>>
> >>>>>>> More context may be needed here. Currently, num_mbm_cntrs indicates the
> >>>>>>> number of counters available per domain, which is 32.
> >>>>>>>
> >>>>>>> At the moment, we can assign 2 counters to each group, meaning each RMID
> >>>>>>> can be associated with 2 hardware counters. In theory, it's possible to
> >>>>>>> assign all 32 hardware counters to a group—allowing one RMID to be linked
> >>>>>>> with up to 32 counters. However, we currently lack the interface to
> >>>>>>> support that level of assignment.
> >>>>>>>
> >>>>>>> For now, the plan is to support basic assignment and expand functionality
> >>>>>>> later once we have the necessary data structure and requirements.
> >>>>>>
> >>>>>> Looks like some requirements did not make it into this implementation.
> >>>>>> Do you recall the discussion that resulted in you writing [2]? Looks like
> >>>>>> there is a question to Peter in there on how to determine how many "counters"
> >>>>>> are available in soft-ABMC. I interpreted [3] at that time to mean that this
> >>>>>> information would be available in a future AMD publication.
> >>>>>
> >>>>> We already have a method to determine the number of counters in soft-ABMC
> >>>>> mode, which Peter has addressed [4].
> >>>>>
> >>>>> [4]
> >>>>> https://lore.kernel.org/lkml/20250203132642.2746754-1-peternewman@google.com/
> >>>>>
> >>>>> This appears to be more of a workaround, and I doubt it will be included
> >>>>> in any official AMD documentation. Additionally, the long-term direction
> >>>>> is moving towards ABMC.
> >>>>>
> >>>>> I don’t believe this workaround needs to be part of the current series. It
> >>>>> can be added later when soft-ABMC is implemented.
> >>>>
> >>>> Agreed. What about the plans described in [2]? (Thanks to Peter for
> >>>> catching this!).
> >>>>
> >>>> It is important to keep track of requirements while working on a feature to
> >>>> ensure that the implementation supports the planned use cases. Re-reading that
> >>>> thread it is not clear to me how soft-ABMC's per-group assignment would look.
> >>>> Could you please share how you see it progress from this implementation?
> >>>> This includes the single event vs. multiple event assignment. I would like to
> >>>> highlight that this is not a request for this to be supported in this implementation
> >>>> but there needs to be a plan for how this can be supported on top of interfaces
> >>>> established by this work.
> >>>>
> >>>
> >>> Here’s my current understanding of soft-ABMC. Peter may have a more in-depth perspective on this.
> >>>
> >>> Soft-ABMC:
> >>> a. num_mbm_cntrs: This is a software-defined limit based on the number of active RMIDs that can be supported. The value can be obtained using the code referenced in [4].
> > 
> > I would call it a hardware-defined limit that can be probed by software.
> > 
> > The main question is whether this file returns the exact number of
> > RMIDs hardware can track or double that number (mbm_total_bytes +
> > mbm_local_bytes) so that the value is always measured in events.
> 
> tl;dr: I continue [3] to find it most intuitive for num_mbm_cntrs to be the exact
> number of "active" RMIDs that the system can support *and* changing the name of
> the modes to help user interpret num_mbm_cntrs: "mbm_cntr_event_assign" for ABMC,
> "mbm_cntr_group_assign" for soft-ABMC.
> 
> details
> -------
> 
> We are now back to the previous discussion about what user can expect from
> the interface. Let me try and re-cap that discussion so that we can all hopefully
> get back on the same page. Please add corrections/updates where needed.
> 
> soft-ABMC
> ---------
>   soft-ABMC manages "active" (term TBD) RMID assignment to monitor groups. When an
>   "active" RMID is assigned to a monitor group then *all* MBM events (not LLC occupancy)
>   in that monitor group are counted. "Active" RMID assignment can be done per domain.
> 
>   Requirement: resctrl should accurately reflect which events are counted. That is,
>   we do not want resctrl to pretend to allow user to assign an "active" RMID to
>   only one event in a monitor group while all events are actually counted.
> 
>   Caveat: To support rapid re-assignment of RMIDs to monitor groups, llc_occupancy
>   event is disabled when soft-ABMC is enabled.
> 
> ABMC
> ----
>   ABMC manages (hardware) counter assignment to monitor group (RMID), event pairs.
>   When a hardware counter is assigned to an RMID, event pair then only that
>   RMID, event is counted. Hardware counter assignment can be done per domain.
> 
> 
> shared assignment
> -----------------
> A shared assignment applies to both soft-ABMC and ABMC. A user can designate a
> "counter" (could be hardware counter or "active" RMID) as shared and that means
> the counter within that domain is shared between different monitor groups and actual
> assignment is scheduled by resctrl.  
> 
> 
> user interface
> --------------
> 
> Next, consider the interface while keeping above definitions and requirements in mind.
> 
> This series introduces (using implementation, not cover-letter):
> 
> /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs 
> "num_mbm_cntrs":                                                               
> 	The maximum number of monitoring counters (total of available and assigned
> 	counters) in each domain when the system supports mbm_cntr_assign mode. 
> 
> /sys/fs/resctrl/mbm_L3_assignments
> "mbm_L3_assignments":                                                          
> 	This interface file is created when the mbm_cntr_assign mode is supported
> 	and shows the assignment status for each group.              
> 
> Consider "mbm_L3_assignments" first. The interface is documented for ABMC support
> where it is possible to manage individual event assignment within monitor group.
> 
> For ABMC it is possible to assign just one event at a time and doing so consumes
> one counter in that domain:
> 
> a) Starting state on system with 32 counters per domain, two events in default
>    resource group consumes two counters in that domain:
> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> 0=30;1=32
> # cat /sys/fs/resctrl/mbm_L3_assignments
> mbm_total_bytes:0=e;1=_
> mbm_local_bytes:0=e;1=_
> 
> b) Assign counter to mbm_local_bytes in domain 1:
> # echo "mbm_local_bytes:1=e" > /sys/fs/resctrl/mbm_L3_assignments
> # cat /sys/fs/resctrl/mbm_L3_assignments
> mbm_total_bytes:0=e;1=_
> mbm_local_bytes:0=e;1=e
> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> 0=30;1=31
> 
> The question is how this should look on soft-ABMC system. Let's say hypothetically
> that on a soft-ABMC system it is possible to have 32 "active" RMIDs.
> 
> a) Starting state on system with 32 "active RMIDs" per domain, two events in default
>    resource group consumes one RMID in that domain:
> 
> # cat /sys/fs/resctrl/mbm_L3_assignments
> mbm_total_bytes:0=e;1=_
> mbm_local_bytes:0=e;1=_
> 
> What should num_mbm_cntrs display?
> 
> Option A (counters are RMIDs):
> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> 0=31;1=32
> 
> Option B (pretend RMIDs are events):
> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> 0=62;1=64
> 
> b) Assign counter to mbm_local_bytes in domain 1:
> # echo "mbm_local_bytes:1=e" > /sys/fs/resctrl/mbm_L3_assignments
> # cat /sys/fs/resctrl/mbm_L3_assignments
> mbm_total_bytes:0=e;1=e
> mbm_local_bytes:0=e;1=e
> 
> Note that even though user requested only mbm_local_bytes to be assigned, it
> actually results in both mbm_total_bytes and mbm_local_bytes to be assigned. This
> ensures accurate state representation to user space but this also creates an
> inconsistent user interface between soft-ABMC and ABMC since user space intends
> to use the same interface but "sometimes" assigning one event results in assign
> of one event while "sometimes" it results in assign of multiple events.
> 
> wrt "num_mbm_cntrs"
> 
> Option A (counters are RMIDs):
> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> 0=31;1=31
> 
> Option B (pretend RMIDs are events):
> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> 0=62;1=62 
> 
> Neither option seems ideal to me since the interface cannot be consistent
> between ABMC and soft-ABMC.
> As I mentioned in [2] it is not possible to hide ABMC and soft-ABMC behind
> the same interface. When user space wants to monitor a particular monitor group
> then it should be clear how that can be accomplished. Not knowing if
> an assignment/unassignment to/from an event would impact one or all events
> and whether it will consume one or multiple counters does not sound like a good
> interface to me. 
> 
> As I understand current interface, user is required to know how ABMC and soft-ABMC
> is implemented to be able to configure the system. For example, if user has file like:
> 	# cat /sys/fs/resctrl/mbm_L3_assignments
> 	mbm_total_bytes:0=e;1=e
> 	mbm_local_bytes:0=e;1=e
> user must know underlying implementation to be able to manage monitoring of
> events and assigning counters otherwise it will be a surprise to lose monitoring
> of all events when unassigning one event.
> 
> This is why I proposed in [3] that the name of the mode reflects how user can interact
> with the system. Instead of one "mbm_cntr_assign" mode there can be "mbm_cntr_event_assign"
> that is used for ABMC and "mbm_cntr_group_assign" that is used for soft-ABMC. The mode should
> make it clear what the system is capable of wrt counter assignments.
> 
> Considering this the interface should be clear:
> num_mbm_cntrs: reflects the number of counters in each domain that can be assigned. In
> "mbm_cntr_event_assign" this will be the number of counters that can be assigned to 
> each event within a monitoring group, in "mbm_cntr_group_assign" this will be the number
> of counters that can be assigned to entire monitoring groups impacting all MBM events.
> 
> mbm_L3_assignments: manages the counter assignment in each group. When user knows the mode
> is "mbm_cntr_event_assign"/"mbm_cntr_group_assign" then it should be clear to user space how the
> interface behaves wrt assignment, no surprises of multiple events impacted when
> assigning/unassigning single event.
> 
> For soft-ABMC I thus find it most intuitive for num_mbm_cntrs to be the exact number
> of "active" RMIDs that the system can support *and* changing the name of the modes
> to help user interpret num_mbm_cntrs.
> 
> > 
> > There's also the mongroup-RMID overcommit use case I described
> > above[1]. On Intel we can safely assume that there are counters to
> > back all RMIDs, so num_mbm_cntrs would be calculated directly from
> > num_rmids.
> 
> This is about the:
> 	There's now more interest in Google for allowing explicit control of
> 	where RMIDs are assigned on Intel platforms. Even though the number of
> 	RMIDs implemented by hardware tends to be roughly the number of
> 	containers they want to support, they often still need to create
> 	containers when all RMIDs have already been allocated, which is not
> 	currently allowed. Once the container has been created and starts
> 	running, it's no longer possible to move its threads into a monitoring
> 	group whenever RMIDs should become available again, so it's important
> 	for resctrl to maintain an accurate task list for a container even
> 	when RMIDs are not available.
> 
> I see a monitor group as a collection of tasks that need to be monitored together.
> The "task list" is the group of tasks that share a monitoring ID that
> is required to be a valid ID since when any of the tasks are scheduled that ID is
> written to the hardware. I intentionally tried to not use RMID since I believe
> this is required for all archs.
> I thus do not understand how a task can start running when it does not have
> a valid monitoring ID. The idea of "deferred assignment" is not clear to me,
> there can never be "unmonitored tasks", no? I think I am missing something here.

In the AMD/RMID implemenentation this might be achieved with something
extra in the task structure to denote whether a task is in a monitored
group or not. E.g. We add "task->rmid_valid" as well as "task->rmid".
Tasks in an unmonitored group retain their "task->rmid" (that's what
identifies them as a member of a group) but have task->rmid_valid set
to false.  Context switch code would be updated to load "0" into the
IA32_PQR_ASSOC.RMID field for tasks without a valid RMID. So they
would still be monitored, but activity would be bundled with all
tasks in the default resctrl group.

Presumably something analogous could be done for ARM/MPAM.

> > I realized this use case is more difficult to implement on MPAM,
> > because a PARTID is effectively a CLOSID+RMID, so deferring assigning
> > a unique PARTID to a group also results in it being in a different
> > allocation group. It will work if the unmonitored groups could find a
> > way to share PARTIDs, but this has consequences on allocation - but
> > hopefully no worse than sharing CLOSIDs on x86.
> > 
> > There's a lot of interest in monitoring ID overcommit in Google, so I
> > think it's worth it for me to investigate the additional structural
> > changes needed in resctrl (i.e., breaking the FS-level association
> > between mongroups and HW monitoring IDs). Such a framework could be a
> > better fit for soft-ABMC. For example, if overcommit is allowed, we
> > would just report the number of simultaneous RMIDs we were able to
> > probe as num_rmids. I would want the same shared assignment scheduler
> > to be able to work with RMIDs and counters, though.
> > 
> > Thanks,
> > -Peter
> > 
> > [1] https://lore.kernel.org/lkml/CALPaoChSzzU5mzMZsdT6CeyEn0WD1qdT9fKCoNW_ty4tojtrkw@mail.gmail.com/
> 
> Reinette
> 
> [2] https://lore.kernel.org/lkml/b9e48e8f-3035-4a7e-a983-ce829bd9215a@intel.com/
> [3] https://lore.kernel.org/lkml/b3babdac-da08-4dfd-9544-47db31d574f5@intel.com/

-Tony

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ