[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a19d96ac-f83a-4f5a-98ce-c5554e12afc5@intel.com>
Date: Wed, 14 Aug 2024 10:37:42 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Peter Newman <peternewman@...gle.com>
CC: <babu.moger@....com>, <corbet@....net>, <fenghua.yu@...el.com>,
<tglx@...utronix.de>, <mingo@...hat.com>, <bp@...en8.de>,
<dave.hansen@...ux.intel.com>, <x86@...nel.org>, <hpa@...or.com>,
<paulmck@...nel.org>, <rdunlap@...radead.org>, <tj@...nel.org>,
<peterz@...radead.org>, <yanjiewtw@...il.com>, <kim.phillips@....com>,
<lukas.bulwahn@...il.com>, <seanjc@...gle.com>, <jmattson@...gle.com>,
<leitao@...ian.org>, <jpoimboe@...nel.org>, <rick.p.edgecombe@...el.com>,
<kirill.shutemov@...ux.intel.com>, <jithu.joseph@...el.com>,
<kai.huang@...el.com>, <kan.liang@...ux.intel.com>,
<daniel.sneddon@...ux.intel.com>, <pbonzini@...hat.com>,
<sandipan.das@....com>, <ilpo.jarvinen@...ux.intel.com>,
<maciej.wieczor-retman@...el.com>, <linux-doc@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <eranian@...gle.com>, <james.morse@....com>
Subject: Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth
Monitoring Counters (ABMC)
Hi Peter,
On 8/2/24 3:50 PM, Peter Newman wrote:
> On Fri, Aug 2, 2024 at 1:55 PM Reinette Chatre
> <reinette.chatre@...el.com> wrote:
>> On 8/2/24 11:49 AM, Peter Newman wrote:
>>> On Fri, Aug 2, 2024 at 9:14 AM Reinette Chatre
>>>> I am of course not familiar with details of the software implementation
>>>> - could there be benefits to using it even if hardware counters are
>>>> supported?
>>>
>>> I can't see any situation where the user would want to choose software
>>> over hardware counters. The number of groups which can be monitored by
>>> software assignable counters will always be less than with hardware,
>>> due to the need for consuming one RMID (and the counters automatically
>>> allocated to it by the AMD hardware) for all unassigned groups.
>>
>> Thank you for clarifying. This seems specific to this software implementation,
>> and I missed that there was a shift from soft-RMIDs to soft-ABMC. If I remember
>> correctly this depends on undocumented hardware specific knowledge.
>
> For the benefit of anyone else who needs to monitor bandwidth on a
> large number of monitoring groups on pre-ABMC AMD implementations,
> hopefully a future AMD publication will clarify, at least on some
> existing, pre-ABMC models, exactly when the QM_CTR.U bit is set.
>
>
>>>
>>> The behavior as I've implemented today is:
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_events
>>> 0
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>> test//0=_;1=_;
>>> //0=_;1=_;
>>>
>>> # echo "test//1+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>> test//0=_;1=tl;
>>> //0=_;1=_;
>>>
>>> # echo "test//1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>> test//0=_;1=_;
>>> //0=_;1=_;
>>>
>>>
>>
>> This highlights how there cannot be a generic/consistent interface between hardware
>> and software implementation. If resctrl implements something like above without any
>> other hints to user space then it will push complexity to user space since user space
>> would not know if setting one flag results in setting more than that flag, which may
>> force a user space implementation to always follow a write with a read that
>> needs to confirm what actually resulted from the write. Similarly, that removing a
>> flag impacts other flags needs to be clear without user space needing to "try and
>> see what happens".
>
> I'll return to this topic in the context of MPAM below...
>
>> It is not clear to me how to interpret the above example when it comes to the
>> RMID management though. If the RMID assignment is per group then I expected all
>> the domains of a group to have the same flag(s)?
>
> The group RMIDs are never programmed into any MSRs and the RMID space
> is independent in each domain, so it is still possible to do
> per-domain assignment. (and like with soft RMIDs, this enables us to
> create unlimited groups, but we've never been limited by the size of
> the RMID space)
>
> However, in our use cases, jobs are not confined to any domain, so
> bandwidth measurements must be done simultaneously in all domains, so
> we have no current use for per-domain assignment. But if any Google
> users did begin to see value in confining jobs to domains, this could
> change.
>
>>
>>>>
>>>>> However, If we don't expect to see these semantics in any other
>>>>> implementation, these semantics could be implicit in the definition of
>>>>> a SW assignable counter.
>>>>
>>>> It is not clear to me how implementation differences between hardware
>>>> and software assignment can be hidden from user space. It is possible
>>>> to let user space enable individual events and then silently upgrade it
>>>> to all events. I see two options here, either "mbm_control" needs to
>>>> explicitly show this "silent upgrade" so that user space knows which
>>>> events are actually enabled, or "mbm_control" only shows flags/events enabled
>>>> from user space perspective. In the former scenario, this needs more
>>>> user space support since a generic user space cannot be confident which
>>>> flags are set after writing to "mbm_control". In the latter scenario,
>>>> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
>>>> to rely on it to know which events can be enabled and if some are
>>>> actually "silently enabled" when user space still thinks it needs to be
>>>> enabled the number of available counters becomes vague.
>>>>
>>>> It is not clear to me how to present hardware and software assignable
>>>> counters with a single consistent interface. Actually, what if the
>>>> "mbm_mode" is what distinguishes how counters are assigned instead of how
>>>> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
>>>> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
>>>> and "mbm_cntr_group_assignable" is used? Could that replace a
>>>> potential "mbm_assign_events" while also supporting user space in
>>>> interactions with "mbm_control"?
>>>
>>> If I understand this correctly, is this a preference that the info
>>> node be named differently if its value will have different units,
>>> rather than a second node to indicate what the value of num_mbm_cntrs
>>> actually means? This sounds reasonable to me.
>>
>> Indeed. As you highlighted, user space may not need to know if
>> counters are backed by hardware or software, but user space needs to
>> know what to expect from (how to interact with) interface.
>>
>>> I think it's also important to note that in MPAM, the MBWU (memory
>>> bandwidth usage) monitors don't have a concept of local versus total
>>> bandwidth, so event assignment would likely not apply there either.
>>> What the counted bandwidth actually represents is more implicit in the
>>> monitor's position in the memory system in the particular
>>> implementation. On a theoretical multi-socket system, resctrl would
>>> require knowledge about the system's architecture to stitch together
>>> the counts from different types of monitors to produce a local and
>>> total value. I don't know if we'd program this SoC-specific knowledge
>>> into the kernel to produce a unified MBM resource like we're
>>> accustomed to now or if we'd present multiple MBM resources, each only
>>> providing an mbm_total_bytes event. In this case, the counters would
>>> have to be assigned separately in each MBM resource, especially if the
>>> different MBM resources support a different number of counters.
>>>
>>
>> "total" and "local" bandwidth is already in grey area after the
>> introduction of mbm_total_bytes_config/mbm_local_bytes_config where
>> user space could set values reported to not be constrained by the
>> "total" and "local" terms. We keep sticking with it though, even in
>> this implementation that uses the "t" and "l" flags, knowing that
>> what is actually monitored when "l" is set is just what the user
>> configured via mbm_local_bytes_config, which theoretically
>> can be "total" bandwidth.
>
> If it makes sense to support a separate, group-assignment interface at
> least for MPAM, this would be a better fit for soft-ABMC, even if it
> does have to stay downstream.
(apologies for the delay)
Could we please take a step back and confirm/agree what is meant with "group-
assignment"? In a previous message [1] I latched onto the statement
"the implementation is assigning RMIDs to groups, assignment results in all
events being counted.". In this I understood "groups" to be resctrl groups
and I understood this to mean that when a (soft-ABMC) counter is assigned
it applies to the entire resctrl group (all domains, all events). The
subsequent example in [2] was thus unexpected to me when the interface
was used to assign a (soft-ABMC) counter to the group but not all domains
were impacted.
Considering this, could you please elaborate what is meant with
"group assignment"?
Thank you
Reinette
[1] https://lore.kernel.org/lkml/CALPaoCi_TBZnULHQpYns+H+30jODZvyQpUHJRDHNwjQzajrD=A@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CALPaoCi1CwLy_HbFNOxPfdReEJstd3c+DvOMJHb5P9jBP+iatw@mail.gmail.com/
Powered by blists - more mailing lists