[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d07a70f5-b1ca-4766-a1d9-53988dd9164f@intel.com>
Date: Wed, 12 Mar 2025 08:07:41 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: <babu.moger@....com>, "Moger, Babu" <bmoger@....com>, "Luck, Tony"
<tony.luck@...el.com>
CC: Peter Newman <peternewman@...gle.com>, Dave Martin <Dave.Martin@....com>,
<corbet@....net>, <tglx@...utronix.de>, <mingo@...hat.com>, <bp@...en8.de>,
<dave.hansen@...ux.intel.com>, <x86@...nel.org>, <hpa@...or.com>,
<paulmck@...nel.org>, <akpm@...ux-foundation.org>, <thuth@...hat.com>,
<rostedt@...dmis.org>, <xiongwei.song@...driver.com>,
<pawan.kumar.gupta@...ux.intel.com>, <daniel.sneddon@...ux.intel.com>,
<jpoimboe@...nel.org>, <perry.yuan@....com>, <sandipan.das@....com>,
<kai.huang@...el.com>, <xiaoyao.li@...el.com>, <seanjc@...gle.com>,
<xin3.li@...el.com>, <andrew.cooper3@...rix.com>, <ebiggers@...gle.com>,
<mario.limonciello@....com>, <james.morse@....com>,
<tan.shaopeng@...itsu.com>, <linux-doc@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <maciej.wieczor-retman@...el.com>,
<eranian@...gle.com>
Subject: Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth
Monitoring Counters (ABMC)
Hi Babu,
On 3/11/25 1:35 PM, Moger, Babu wrote:
> Hi All,
>
> On 3/10/25 22:51, Reinette Chatre wrote:
>>
>>
>> On 3/10/25 6:44 PM, Moger, Babu wrote:
>>> Hi Tony,
>>>
>>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>>> Hi All,
>>>>>
>>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>>> Hi Peter,
>>>>>>
>>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>>> Hi Babu,
>>>>>>>
>>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@....com> wrote:
>>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>>
>>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@....com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>>
>>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@....com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>>> <reinette.chatre@...el.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>>> <reinette.chatre@...el.com> wrote:
>>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>>> <reinette.chatre@...el.com> wrote:
>>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>> <reinette.chatre@...el.com> wrote:
>>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@...el.com> wrote:
>>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>>>
>>>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>>>
>>>>>>>>>>>> It is documented below.
>>>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>>>
>>>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>>>> just two.
>>>>>>>>>>>>
>>>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>>>
>>>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>>>> using it as is.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>>>
>>>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>>>
>>>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>>>> supported.
>>>>>>>>>>>
>>>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>>>
>>>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>>>
>>>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>>>
>>>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>>>
>>>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>>>> evolves.
>>>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>>>
>>>>>>>>>> # define event configurations
>>>>>>>>>>
>>>>>>>>>> ========================================================
>>>>>>>>>> Bits Mnemonics Description
>>>>>>>>>> ==== ========================================================
>>>>>>>>>> 6 VictimBW Dirty Victims from all types of memory
>>>>>>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>>>>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>>>>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>>>>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>>>>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>>>>>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>>>>>>>> ==== ========================================================
>>>>>>>>>>
>>>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>>>
>>>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> v = VictimBW
>>>>>>>>>>
>>>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>>>
>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>
>>>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>>>> and cleaner to implement.
>>>>>>>>>
>>>>>>>>> Roughly what I had in mind:
>>>>>>>>>
>>>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>>>> to single letters. In the resulting directory, populate a file where
>>>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>>>> want specific events and don't care about portability.
>>>>>>>>
>>>>>>>>
>>>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>>>
>>>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>>>
>>>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>>>
>>>>>>>> How many configurations should we allow? Do we know?
>>>>>>>
>>>>>>> Do we need an upper limit?
>>>>>>
>>>>>> I think so. This needs to be maintained in some data structure. We can
>>>>>> start with 2 default configurations for now.
>>
>> There is a big difference between no upper limit and 2. The hardware is
>> capable of supporting per-domain configurations so more flexibility is
>> certainly possible. Consider the example presented by Peter in:
>> https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
>>
>>>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>>>> configure all domains the same for a group.
>>>>>>>>
>>>>>>>> What is the difference between shared and exclusive?
>>>>>>>
>>>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>>>> aggregation files[2].
>>>>>>>
>>>>>>> These do not need to be implemented immediately, but knowing that they
>>>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>>>> counters and reading their values.
>>>>>>
>>>>>> Ok. Lets focus on exclusive assignments for now.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>>>
>>>>>>>> There should be a more efficient way to handle this.
>>>>>>>>
>>>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>>>
>>>>>>> I had rejected it due to the high-frequency of access of a large
>>>>>>> number of files, which has since been addressed by shared assignment
>>>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>>>
>>>>>> I think we should address this as well. Creating three extra files for
>>>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>>>
>>>>>>>> That was another problem we need to address.
>>>>>>>
>>>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>>>> can read multiple sysfs nodes to remind themselves.
>>>>>>
>>>>>> I suggest, we should provide users with an option to list the assignments
>>>>>> of all groups in a single command. As the number of groups increases, it
>>>>>> becomes cumbersome to query each group individually.
>>>>>>
>>>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>>>> for this purpose. More details on this below.
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>>>> which the count values can be read.
>>>>>>>>>
>>>>>>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>> LclFill
>>>>>>>>> LclNTWr
>>>>>>>>> LclSlowFill
>>>>>>>>
>>>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>>>
>>>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>>>> only looking at struct kernfs_syscall_ops
>>>>>>>
>>>>>>>>
>>>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>> LclFill <-rename these to generic names.
>>>>>>>> LclNTWr
>>>>>>>> LclSlowFill
>>>>>>>>
>>>>>>>
>>>>>>> I think portable and non-portable event names should both be available
>>>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>>>> will be applied in general, but when they turn up an issue, it can
>>>>>>> often lead to a more focused investigation, requiring more precise
>>>>>>> events.
>>>>>>
>>>>>> I aggree. We should provide both portable and non-portable event names.
>>>>>>
>>>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>>>> of the current interface. Idea here is to start with basic assigment
>>>>>> feature with options to enhance it in the future. Feel free to
>>>>>> comment/suggest.
>>>>>>
>>>>>> 1. Event configurations will be in
>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>>>
>>>>>> There will be two pre-defined configurations by default.
>>>>>>
>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>>> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>>>
>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>> LclFill, LclNTWr, LclSlowFill
>>>>>>
>>>>>> 2. Users will have options to update these configurations.
>>>>>>
>>>>>> #echo "LclFill, LclNTWr, RmtFill" >
>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>
>>>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>>>> reporting "local_bytes" any more. They report something different,
>>>> and users only know if they come to check the options currently
>>>> configured in this file. Changing the contents without changing
>>>> the name seems confusing to me.
>>>
>>> It is the same behaviour right now with BMEC. It is configurable.
>>> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>>>
>>> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>>
>> This could be supported by following Peter's original proposal where the name
>> of the counter configuration is provided by the user via a mkdir:
>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>
>> As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
>
> Sure. We can do that. I was thinking in the first phase, just provide the
> default pre-defined configuration and option to update the configuration.
>
> We can add the mkdir support later. That way we can provide basic ABMC
> support without too much code complexity with mkdir support.
This is not clear to me how you envision the "first phase". Is it what you
proposed above, for example:
#echo "LclFill, LclNTWr, RmtFill" >
/sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
In above the counter configuration name is a file.
How could mkdir support be added to this later if there are already files present?
>
>>
>>>
>>>>
>>>>>>
>>>>>> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>> LclFill, LclNTWr, RmtFill
>>>>>>
>>>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>>>
>>>>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>> mkdir /sys/fs/resctrl/test/
>>>>>>
>>>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>>> e: Exclusive
>>>>>> s: Shared
>>>>>> u: Unassigned
>>>>>>
>>>>>> Exclusive mode is supported now. Shared mode will be supported in the
>>>>>> future.
>>>>>>
>>>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>> to list the assignment state of all the groups.
>>>>>>
>>>>>> Format:
>>>>>> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>>>
>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>> test//mbm_total_bytes:0=e;1=e
>>>>>> test//mbm_local_bytes:0=e;1=e
>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>> //mbm_local_bytes:0=e;1=e
>>
>> This would make mbm_assign_control even more unwieldy and quicker to exceed a
>> page of data (these examples never seem to reflect those AMD systems with the many
>> L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
>> and solved when/if going this route.
>
> This problem is not specific this series. I feel it is a generic problem
> to many of the semilar interfaces. I dont know how it is addressed. May
> have to investigate on this. Any pointers would be helpful.
Dave Martin already did a lot of analysis here. What other pointers do you need?
>
>
>>
>> There seems to be two opinions about this file at moment. Would it be possible to
>> summarize the discussion with pros/cons raised to make an informed selection?
>> I understand that Google as represented by Peter no longer requires/requests this
>> file but the motivation for this change seems new and does not seem to reduce the
>> original motivation for this file. We may also want to separate requirements for reading
>> from and writing to this file.
>
> Yea. We can just use mbm_assign_control for reading the assignment states.
>
> Summary: We have two proposals.
>
> First one from Peter:
>
> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>
>
> Pros
> a. Allows flexible creation of free-form names for assignable
> configurations, stored in info/L3_MON/counter_configs/.
>
> b. Events can be accessed using corresponding free-form names in the
> mon_data directory, making it clear to users what each event represents.
>
>
> Cons:
> a. Requires three separate files for assignment in each group
> (assign_exclusive, assign_shared, unassign), which might be excessive.
>
> b. No built-in listing support, meaning users must query each group
> individually to check assignment states.
>
>
> Second Proposal (Mine)
>
> https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
>
> Pros:
>
> a. Maintains the flexibility of free-form names for assignable
> configurations (info/L3_MON/counter_configs/).
>
> b. Events remain accessible via free-form names in mon_data, ensuring
> clarity on their purpose.
>
> c. Adds the ability to list assignment states for all groups in a single
> command.
>
> Cons:
> a. Potential buffer overflow issues when handling a large number of
> groups and domains and code complexity to fix the issue.
>
>
> Third Option: A Hybrid Approach
>
> We could combine elements from both proposals:
>
> a. Retain the free-form naming approach for assignable configurations in
> info/L3_MON/counter_configs/.
>
> b. Use the assignment method from the first proposal:
> $mkdir test
> $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
>
> c. Introduce listing support via the info/L3_MON/mbm_assign_control
> interface, enabling users to read assignment states for all groups in one
> place. Only reading support.
>
>
>>
>>>>>>
>>>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>>>
>>>>>> Format:
>>>>>> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>>>
>>>>>> #echo "test//mbm_local_bytes:0=e;1=e" >
>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>
>>>>>> #echo "test//mbm_local_bytes:0=u;1=u" >
>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>
>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>> test//mbm_total_bytes:0=u;1=u
>>>>>> test//mbm_local_bytes:0=u;1=u
>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>> //mbm_local_bytes:0=e;1=e
>>>>>>
>>>>>> The corresponding events will be read in
>>>>>>
>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>
>>>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>>>> mbm_local_bytes) will be supported.
>>>>>>
>>>>>> 8. In the future, there will be options to create multiple configurations
>>>>>> and corresponding directory will be created in
>>>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>>>
>>>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>>>> directory? Like this:
>>>>
>>>> # echo "LclFill, LclNTWr, RmtFill" >
>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>>>
>>>> This seems OK (dependent on the user picking meaningful names for
>>>> the set of attributes picked ... but if they want to name this
>>>> monitor file "brian" then they have to live with any confusion
>>>> that they bring on themselves).
>>>>
>>>> Would this involve an extension to kernfs? I don't see a function
>>>> pointer callback for file creation in kernfs_syscall_ops.
>>>>
>>>>>>
>>>>>
>>>>> I know you are all busy with multiple series going on parallel. I am still
>>>>> waiting for the inputs on this. It will be great if you can spend some time
>>>>> on this to see if we can find common ground on the interface.
>>>>>
>>>>> Thanks
>>>>> Babu
>>>>
>>>> -Tony
>>>>
>>>
>>>
>>> thanks
>>> Babu
>>
>> Reinette
>>
>>
>
Powered by blists - more mailing lists