[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <631798a4-303f-4e56-afd0-2c217fa10d94@intel.com>
Date: Tue, 25 Feb 2025 13:41:05 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: Peter Newman <peternewman@...gle.com>
CC: "Moger, Babu" <bmoger@....com>, Dave Martin <Dave.Martin@....com>, "Babu
Moger" <babu.moger@....com>, <corbet@....net>, <tglx@...utronix.de>,
<mingo@...hat.com>, <bp@...en8.de>, <dave.hansen@...ux.intel.com>,
<tony.luck@...el.com>, <x86@...nel.org>, <hpa@...or.com>,
<paulmck@...nel.org>, <akpm@...ux-foundation.org>, <thuth@...hat.com>,
<rostedt@...dmis.org>, <xiongwei.song@...driver.com>,
<pawan.kumar.gupta@...ux.intel.com>, <daniel.sneddon@...ux.intel.com>,
<jpoimboe@...nel.org>, <perry.yuan@....com>, <sandipan.das@....com>,
<kai.huang@...el.com>, <xiaoyao.li@...el.com>, <seanjc@...gle.com>,
<xin3.li@...el.com>, <andrew.cooper3@...rix.com>, <ebiggers@...gle.com>,
<mario.limonciello@....com>, <james.morse@....com>,
<tan.shaopeng@...itsu.com>, <linux-doc@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <maciej.wieczor-retman@...el.com>,
<eranian@...gle.com>
Subject: Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth
Monitoring Counters (ABMC)
Hi Peter,
On 2/25/25 9:11 AM, Peter Newman wrote:
> Hi Reinette,
>
> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
> <reinette.chatre@...el.com> wrote:
>>
>> Hi Peter,
>>
>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>> <reinette.chatre@...el.com> wrote:
>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>> <reinette.chatre@...el.com> wrote:
>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>> <reinette.chatre@...el.com> wrote:
>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>> <reinette.chatre@...el.com> wrote:
>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>
>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>
>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>
>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>
>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>> for.
>>>>>>>>
>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>
>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>> customers.
>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>
>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>> event names.
>>>>>>
>>>>>> Thank you for clarifying.
>>>>>>
>>>>>>>
>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>> which events should share a counter and which should be counted by
>>>>>>> separate counters. I think the amount of information that would need
>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>
>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>> writes in ABMC would look like...
>>>>>>>
>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>
>>>>>>> (per domain)
>>>>>>> group 0:
>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>> group 1:
>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>> ...
>>>>>>>
>>>>>>
>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>> configuration is a requirement?
>>>>>
>>>>> If it's global and we want a particular group to be watched by more
>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>> for that group in all domains, or allocating counters in domains where
>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>> there's less pressure on the counters.
>>>>>
>>>>> In Dave's proposal it looks like global configuration means
>>>>> globally-defined "named counter configurations", which works because
>>>>> it's really per-domain assignment of the configurations to however
>>>>> many counters the group needs in each domain.
>>>>
>>>> I think I am becoming lost. Would a global configuration not break your
>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>> globally then it would not make it possible to support the full configurability
>>>> of the hardware.
>>>> Before I add more confusion, let me try with an example that builds on your
>>>> earlier example copied below:
>>>>
>>>>>>> (per domain)
>>>>>>> group 0:
>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>> group 1:
>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>> ...
>>>>
>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>> I understand it:
>>>>
>>>> group 0:
>>>> domain 0:
>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>> domain 1:
>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>> group 1:
>>>> domain 0:
>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>> domain 1:
>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>
>>>> You mention that you do not want counters to be allocated in domains that they
>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>> in domain 1, resulting in:
>>>>
>>>> group 0:
>>>> domain 0:
>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>> group 1:
>>>> domain 0:
>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>> domain 1:
>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>
>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>
>>>> group 0:
>>>> domain 0:
>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>> group 1:
>>>> domain 0:
>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>> domain 1:
>>>> counter 0: LclFill,RmtFill
>>>> counter 1: LclNTWr,RmtNTWr
>>>> counter 2: LclSlowFill,RmtSlowFill
>>>> counter 3: VictimBW
>>>>
>>>> The counters are shown with different per-domain configurations that seems to
>>>> match with earlier goals of (a) choose events counted by each counter and
>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>> understand the above does contradict global counter configuration though.
>>>> Or do you mean that only the *name* of the counter is global and then
>>>> that it is reconfigured as part of every assignment?
>>>
>>> Yes, I meant only the *name* is global. I assume based on a particular
>>> system configuration, the user will settle on a handful of useful
>>> groupings to count.
>>>
>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>
>>> # define global configurations (in ABMC terms), not necessarily in this
>>> # syntax and probably not in the mbm_assign_control file.
>>>
>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>> w=VictimBW,LclNTWr,RmtNTWr
>>>
>>> # legacy "total" configuration, effectively r+w
>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>
>>> /group0/0=t;1=t
>>> /group1/0=t;1=t
>>> /group2/0=_;1=t
>>> /group3/0=rw;1=_
>>>
>>> - group2 is restricted to domain 0
>>> - group3 is restricted to domain 1
>>> - the rest are unrestricted
>>> - In group3, we decided we need to separate read and write traffic
>>>
>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>
>>
>> I see. Thank you for the example.
>>
>> resctrl supports per-domain configurations with the following possible when
>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>
>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>
>> /group0/0=t;1=t
>> /group1/0=t;1=t
>>
>> Even though the flags are identical in all domains, the assigned counters will
>> be configured differently in each domain.
>>
>> With this supported by hardware and currently also supported by resctrl it seems
>> reasonable to carry this forward to what will be supported next.
>
> The hardware supports both a per-domain mode, where all groups in a
> domain use the same configurations and are limited to two events per
> group and a per-group mode where every group can be configured and
> assigned freely. This series is using the legacy counter access mode
> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
> in the domain can be read. If we chose to read the assigned counter
> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
> rather than asking the hardware to find the counter by RMID, we would
> not be limited to 2 counters per group/domain and the hardware would
> have the same flexibility as on MPAM.
>
> (I might have said something confusing in my last messages because I
> had forgotten that I switched to the extended assignment mode when
> prototyping with soft-ABMC and MPAM.)
>
> Forcing all groups on a domain to share the same 2 counter
> configurations would not be acceptable for us, as the example I gave
> earlier is one I've already been asked about.
I am surprised to hear this at this point of this work. Sounds like
we need to go back a couple of steps to determine how to best support
user requirements that now includes per-group counter assignment.
Have you perhaps looked into how users access the counter data as
part of your prototyping?
Reinette
Powered by blists - more mailing lists