linux-kernel - Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <19f608a1-8119-79bb-da00-703cee6ea5b5@amd.com>
Date: Fri, 2 Aug 2024 19:49:13 -0500
From: "Moger, Babu" <bmoger@....com>
To: Reinette Chatre <reinette.chatre@...el.com>,
 Peter Newman <peternewman@...gle.com>
Cc: babu.moger@....com, corbet@....net, fenghua.yu@...el.com,
 tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
 paulmck@...nel.org, rdunlap@...radead.org, tj@...nel.org,
 peterz@...radead.org, yanjiewtw@...il.com, kim.phillips@....com,
 lukas.bulwahn@...il.com, seanjc@...gle.com, jmattson@...gle.com,
 leitao@...ian.org, jpoimboe@...nel.org, rick.p.edgecombe@...el.com,
 kirill.shutemov@...ux.intel.com, jithu.joseph@...el.com,
 kai.huang@...el.com, kan.liang@...ux.intel.com,
 daniel.sneddon@...ux.intel.com, pbonzini@...hat.com, sandipan.das@....com,
 ilpo.jarvinen@...ux.intel.com, maciej.wieczor-retman@...el.com,
 linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, eranian@...gle.com,
 james.morse@....com
Subject: Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth
 Monitoring Counters (ABMC)

Hi Peter/Reinette,

On 8/2/2024 3:55 PM, Reinette Chatre wrote:
> Hi Peter,
> 
> On 8/2/24 11:49 AM, Peter Newman wrote:
>> On Fri, Aug 2, 2024 at 9:14 AM Reinette Chatre
>> <reinette.chatre@...el.com> wrote:
>>> On 8/1/24 3:45 PM, Peter Newman wrote:
>>>> On Thu, Aug 1, 2024 at 2:50 PM Reinette Chatre
>>>> <reinette.chatre@...el.com> wrote:
>>>>> On 7/17/24 10:19 AM, Moger, Babu wrote:
>>>>>> On 7/12/24 17:03, Reinette Chatre wrote:
>>>>>>> On 7/3/24 2:48 PM, Babu Moger wrote:
>>>
>>>>>>>> # Examples
>>>>>>>>
>>>>>>>> a. Check if ABMC support is available
>>>>>>>>        #mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>>>>
>>>>>>>>        #cat /sys/fs/resctrl/info/L3_MON/mbm_mode
>>>>>>>>        [abmc]
>>>>>>>>        legacy
>>>>>>>>
>>>>>>>>        Linux kernel detected ABMC feature and it is enabled.
>>>>>>>
>>>>>>> How about renaming "abmc" to "mbm_cntrs"? This will match the 
>>>>>>> num_mbm_cntrs
>>>>>>> info file and be the final step to make this generic so that another
>>>>>>> architecture
>>>>>>> can more easily support assignining hardware counters without 
>>>>>>> needing to call
>>>>>>> the feature AMD's "abmc".
>>>>>>
>>>>>> I think we aleady settled this with "mbm_cntr_assignable".
>>>>>>
>>>>>> For soft-RMID" it will be mbm_sw_assignable.
>>>>>
>>>>> Maybe getting a bit long but how about "mbm_cntr_sw_assignable" to 
>>>>> match
>>>>> with the term "mbm_cntr" in accompanying "num_mbm_cntrs"?
>>>>
>>>> My users are pushing for a consistent interface regardless of whether
>>>> counter assignment is implemented in hardware or software, so I would
>>>> like to avoid exposing implementation differences in the interface
>>>> where possible.
>>>
>>> This seems a reasonable ask but can we be confident that if hardware
>>> supports assignable counters then there will never be a reason to use
>>> software assignable counters? (This needs to also consider how/if Arm
>>> may use this feature.)
>>>
>>> I am of course not familiar with details of the software implementation
>>> - could there be benefits to using it even if hardware counters are
>>> supported?
>>
>> I can't see any situation where the user would want to choose software
>> over hardware counters. The number of groups which can be monitored by
>> software assignable counters will always be less than with hardware,
>> due to the need for consuming one RMID (and the counters automatically
>> allocated to it by the AMD hardware) for all unassigned groups.
> 
> Thank you for clarifying. This seems specific to this software 
> implementation,
> and I missed that there was a shift from soft-RMIDs to soft-ABMC. If I 
> remember
> correctly this depends on undocumented hardware specific knowledge.
> 
>> I consider software assignable a workaround to enable measuring
>> bandwidth reliably on a large number of groups on pre-ABMC AMD
>> hardware, or rather salvaging MBM on pre-ABMC hardware making use of
>> our users' effort to adapt to counter assignment in resctrl. We hope
>> no future implementations will choose to silently drop bandwidth
>> counts, so fingers crossed, the software implementation can be phased
>> out when these generations of AMD hardware are decommissioned.
> 
> That sounds ideal.
> 
>>
>> The MPAM specification natively supports (or requires) counter
>> assignment in hardware. From what I recall in the last of James'
>> prototypes I looked at, MBM was only supported if the implementation
>> provided as many bandwidth counters as there were possible monitoring
>> groups, so that it could assume a monitor IDs for every PARTID:PMG
>> combination.
> 
> Thank you for this insight.
> 
>>
>>>
>>> What I would like to avoid is future complexity of needing a new 
>>> mount/config
>>> option that user space needs to use to select if a single 
>>> "mbm_cntr_assignable"
>>> is backed by hardware or software.
>>
>> In my testing so far, automatically enabling counter assignment and
>> automatically allocating counters for all events in new groups works
>> well enough.
>>
>> The only configuration I need is the ability to disable the automatic
>> counter allocation so that a userspace agent can have control of where
>> all the counters are assigned at all times. It's easy to implement
>> this as a simple flag if the user accepts that they need to manually
>> deallocate any automatically-allocated counters from groups created
>> before the flag was cleared.
>>
>>>
>>>> The main semantic difference with SW assignments is that it is not
>>>> possible to assign counters to individual events. Because the
>>>> implementation is assigning RMIDs to groups, assignment results in all
>>>> events being counted.
>>>>
>>>> I was considering introducing a boolean mbm_assign_events node to
>>>> indicate whether assigning individual events is supported. If true,
>>>> num_mbm_cntrs indicates the number of events which can be counted,
>>>> otherwise it indicates the number of groups to which counters can be
>>>> assigned and attempting to assign a single event is silently upgraded
>>>> to assigning counters to all events in the group.
>>>
>>> How were you envisioning your users using the control file 
>>> ("mbm_control")
>>> in these scenarios? Does this file's interface even work for SW 
>>> assignment
>>> scenarios?
>>>
>>> Users should expect consistent interface for "mbm_control" also.
>>>
>>> It sounds to me that a potential "mbm_assign_events" will be false 
>>> for SW
>>> assignments. That would mean that "num_mbm_cntrs" will
>>> contain the number of groups to which counters can be assigned?
>>> Would user space be required to always enable all flags (enable all 
>>> events) of
>>> all domains to the same values ... or would enabling of one flag (one 
>>> event)
>>> in one domain automatically result in all flags (all events) enabled 
>>> for all
>>> domains ... or would enabling of one flag (one event) in one domain 
>>> only appear
>>> to user space to be enabled while in reality all flags/events are 
>>> actually enabled?
>>
>> I believe mbm_control should always accurately reflect which events
>> are being counted.
> 
> I agree.
> 
>>
>> The behavior as I've implemented today is:
>>
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_events
>> 0
>>
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>> test//0=_;1=_;
>> //0=_;1=_;
>>
>> # echo "test//1+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>> test//0=_;1=tl;
>> //0=_;1=_;
>>
>> # echo "test//1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>> test//0=_;1=_;
>> //0=_;1=_;
>>
>>
> 
> This highlights how there cannot be a generic/consistent interface 
> between hardware
> and software implementation. If resctrl implements something like above 
> without any
> other hints to user space then it will push complexity to user space 
> since user space
> would not know if setting one flag results in setting more than that 
> flag, which may
> force a user space implementation to always follow a write with a read that
> needs to confirm what actually resulted from the write. Similarly, that 
> removing a
> flag impacts other flags needs to be clear without user space needing to 
> "try and
> see what happens".
> 
> It is not clear to me how to interpret the above example when it comes 
> to the
> RMID management though. If the RMID assignment is per group then I 
> expected all
> the domains of a group to have the same flag(s)?
> 
>>>
>>>> However, If we don't expect to see these semantics in any other
>>>> implementation, these semantics could be implicit in the definition of
>>>> a SW assignable counter.
>>>
>>> It is not clear to me how implementation differences between hardware
>>> and software assignment can be hidden from user space. It is possible
>>> to let user space enable individual events and then silently upgrade it
>>> to all events. I see two options here, either "mbm_control" needs to
>>> explicitly show this "silent upgrade" so that user space knows which
>>> events are actually enabled, or "mbm_control" only shows flags/events 
>>> enabled
>>> from user space perspective. In the former scenario, this needs more
>>> user space support since a generic user space cannot be confident which
>>> flags are set after writing to "mbm_control". In the latter scenario,
>>> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
>>> to rely on it to know which events can be enabled and if some are
>>> actually "silently enabled" when user space still thinks it needs to be
>>> enabled the number of available counters becomes vague.
>>>
>>> It is not clear to me how to present hardware and software assignable
>>> counters with a single consistent interface. Actually, what if the
>>> "mbm_mode" is what distinguishes how counters are assigned instead of 
>>> how
>>> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
>>> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
>>> and "mbm_cntr_group_assignable" is used? Could that replace a
>>> potential "mbm_assign_events" while also supporting user space in
>>> interactions with "mbm_control"?
>>
>> If I understand this correctly, is this a preference that the info
>> node be named differently if its value will have different units,
>> rather than a second node to indicate what the value of num_mbm_cntrs
>> actually means? This sounds reasonable to me.
> 
> Indeed. As you highlighted, user space may not need to know if
> counters are backed by hardware or software, but user space needs to
> know what to expect from (how to interact with) interface.
> 
>> I think it's also important to note that in MPAM, the MBWU (memory
>> bandwidth usage) monitors don't have a concept of local versus total
>> bandwidth, so event assignment would likely not apply there either.
>> What the counted bandwidth actually represents is more implicit in the
>> monitor's position in the memory system in the particular
>> implementation. On a theoretical multi-socket system, resctrl would
>> require knowledge about the system's architecture to stitch together
>> the counts from different types of monitors to produce a local and
>> total value. I don't know if we'd program this SoC-specific knowledge
>> into the kernel to produce a unified MBM resource like we're
>> accustomed to now or if we'd present multiple MBM resources, each only
>> providing an mbm_total_bytes event. In this case, the counters would
>> have to be assigned separately in each MBM resource, especially if the
>> different MBM resources support a different number of counters.
>>
> 
> "total" and "local" bandwidth is already in grey area after the
> introduction of mbm_total_bytes_config/mbm_local_bytes_config where
> user space could set values reported to not be constrained by the
> "total" and "local" terms. We keep sticking with it though, even in
> this implementation that uses the "t" and "l" flags, knowing that
> what is actually monitored when "l" is set is just what the user
> configured via mbm_local_bytes_config, which theoretically
> can be "total" bandwidth.
> 
> Reinette
> 
> ps. I will be offline next week.

Thanks for heads up.

Looks like we still need to figure out few things about the interface.

However, I need resolve few issues with v5. I can go ahead and post v6 
next week. We can continue our discussion. That way we are making some 
forward progress in the series. Let me know  what do you think.

Thanks
- Babu Moger