lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <afb99efe-0de2-f7ad-d0b8-f2a0ea998efd@amd.com>
Date: Fri, 2 Aug 2024 15:38:06 -0500
From: "Moger, Babu" <bmoger@....com>
To: Peter Newman <peternewman@...gle.com>,
 Reinette Chatre <reinette.chatre@...el.com>
Cc: babu.moger@....com, corbet@....net, fenghua.yu@...el.com,
 tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
 paulmck@...nel.org, rdunlap@...radead.org, tj@...nel.org,
 peterz@...radead.org, yanjiewtw@...il.com, kim.phillips@....com,
 lukas.bulwahn@...il.com, seanjc@...gle.com, jmattson@...gle.com,
 leitao@...ian.org, jpoimboe@...nel.org, rick.p.edgecombe@...el.com,
 kirill.shutemov@...ux.intel.com, jithu.joseph@...el.com,
 kai.huang@...el.com, kan.liang@...ux.intel.com,
 daniel.sneddon@...ux.intel.com, pbonzini@...hat.com, sandipan.das@....com,
 ilpo.jarvinen@...ux.intel.com, maciej.wieczor-retman@...el.com,
 linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, eranian@...gle.com,
 james.morse@....com
Subject: Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth
 Monitoring Counters (ABMC)

Hi Peter/Reinette,

On 8/2/2024 1:49 PM, Peter Newman wrote:
> Hi Reinette,
> 
> On Fri, Aug 2, 2024 at 9:14 AM Reinette Chatre
> <reinette.chatre@...el.com> wrote:
>>
>> Hi Peter,
>>
>> On 8/1/24 3:45 PM, Peter Newman wrote:
>>> On Thu, Aug 1, 2024 at 2:50 PM Reinette Chatre
>>> <reinette.chatre@...el.com> wrote:
>>>> On 7/17/24 10:19 AM, Moger, Babu wrote:
>>>>> On 7/12/24 17:03, Reinette Chatre wrote:
>>>>>> On 7/3/24 2:48 PM, Babu Moger wrote:
>>
>>>>>>> # Examples
>>>>>>>
>>>>>>> a. Check if ABMC support is available
>>>>>>>        #mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>>>
>>>>>>>        #cat /sys/fs/resctrl/info/L3_MON/mbm_mode
>>>>>>>        [abmc]
>>>>>>>        legacy
>>>>>>>
>>>>>>>        Linux kernel detected ABMC feature and it is enabled.
>>>>>>
>>>>>> How about renaming "abmc" to "mbm_cntrs"? This will match the num_mbm_cntrs
>>>>>> info file and be the final step to make this generic so that another
>>>>>> architecture
>>>>>> can more easily support assignining hardware counters without needing to call
>>>>>> the feature AMD's "abmc".
>>>>>
>>>>> I think we aleady settled this with "mbm_cntr_assignable".
>>>>>
>>>>> For soft-RMID" it will be mbm_sw_assignable.
>>>>
>>>> Maybe getting a bit long but how about "mbm_cntr_sw_assignable" to match
>>>> with the term "mbm_cntr" in accompanying "num_mbm_cntrs"?
>>>
>>> My users are pushing for a consistent interface regardless of whether
>>> counter assignment is implemented in hardware or software, so I would
>>> like to avoid exposing implementation differences in the interface
>>> where possible.
>>
>> This seems a reasonable ask but can we be confident that if hardware
>> supports assignable counters then there will never be a reason to use
>> software assignable counters? (This needs to also consider how/if Arm
>> may use this feature.)
>>
>> I am of course not familiar with details of the software implementation
>> - could there be benefits to using it even if hardware counters are
>> supported?
> 
> I can't see any situation where the user would want to choose software
> over hardware counters. The number of groups which can be monitored by
> software assignable counters will always be less than with hardware,
> due to the need for consuming one RMID (and the counters automatically
> allocated to it by the AMD hardware) for all unassigned groups.
> 
> I consider software assignable a workaround to enable measuring
> bandwidth reliably on a large number of groups on pre-ABMC AMD
> hardware, or rather salvaging MBM on pre-ABMC hardware making use of
> our users' effort to adapt to counter assignment in resctrl. We hope
> no future implementations will choose to silently drop bandwidth
> counts, so fingers crossed, the software implementation can be phased
> out when these generations of AMD hardware are decommissioned.
> 
> The MPAM specification natively supports (or requires) counter
> assignment in hardware. From what I recall in the last of James'
> prototypes I looked at, MBM was only supported if the implementation
> provided as many bandwidth counters as there were possible monitoring
> groups, so that it could assume a monitor IDs for every PARTID:PMG
> combination.
> 
>>
>> What I would like to avoid is future complexity of needing a new mount/config
>> option that user space needs to use to select if a single "mbm_cntr_assignable"
>> is backed by hardware or software.
> 
> In my testing so far, automatically enabling counter assignment and
> automatically allocating counters for all events in new groups works
> well enough.
> 
> The only configuration I need is the ability to disable the automatic
> counter allocation so that a userspace agent can have control of where
> all the counters are assigned at all times. It's easy to implement
> this as a simple flag if the user accepts that they need to manually
> deallocate any automatically-allocated counters from groups created
> before the flag was cleared.
> 
>>
>>> The main semantic difference with SW assignments is that it is not
>>> possible to assign counters to individual events. Because the
>>> implementation is assigning RMIDs to groups, assignment results in all
>>> events being counted.
>>>
>>> I was considering introducing a boolean mbm_assign_events node to
>>> indicate whether assigning individual events is supported. If true,
>>> num_mbm_cntrs indicates the number of events which can be counted,
>>> otherwise it indicates the number of groups to which counters can be
>>> assigned and attempting to assign a single event is silently upgraded
>>> to assigning counters to all events in the group.
>>
>> How were you envisioning your users using the control file ("mbm_control")
>> in these scenarios? Does this file's interface even work for SW assignment
>> scenarios?
>>
>> Users should expect consistent interface for "mbm_control" also.
>>
>> It sounds to me that a potential "mbm_assign_events" will be false for SW
>> assignments. That would mean that "num_mbm_cntrs" will
>> contain the number of groups to which counters can be assigned?
>> Would user space be required to always enable all flags (enable all events) of
>> all domains to the same values ... or would enabling of one flag (one event)
>> in one domain automatically result in all flags (all events) enabled for all
>> domains ... or would enabling of one flag (one event) in one domain only appear
>> to user space to be enabled while in reality all flags/events are actually enabled?
> 
> I believe mbm_control should always accurately reflect which events
> are being counted.
> 
> The behavior as I've implemented today is:
> 
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_events
> 0
> 
> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> test//0=_;1=_;
> //0=_;1=_;
> 
> # echo "test//1+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> test//0=_;1=tl;
> //0=_;1=_;
> 
> # echo "test//1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control
> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> test//0=_;1=_;
> //0=_;1=_;

It enables/disables the events automatically("silent upgrade/degrade").
This looks good to me.

> 
> 
>>
>>> However, If we don't expect to see these semantics in any other
>>> implementation, these semantics could be implicit in the definition of
>>> a SW assignable counter.
>>
>> It is not clear to me how implementation differences between hardware
>> and software assignment can be hidden from user space. It is possible
>> to let user space enable individual events and then silently upgrade it
>> to all events. I see two options here, either "mbm_control" needs to
>> explicitly show this "silent upgrade" so that user space knows which
>> events are actually enabled, or "mbm_control" only shows flags/events enabled
>> from user space perspective. In the former scenario, this needs more
>> user space support since a generic user space cannot be confident which
>> flags are set after writing to "mbm_control". In the latter scenario,
>> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
>> to rely on it to know which events can be enabled and if some are
>> actually "silently enabled" when user space still thinks it needs to be
>> enabled the number of available counters becomes vague.
>>
>> It is not clear to me how to present hardware and software assignable
>> counters with a single consistent interface. Actually, what if the
>> "mbm_mode" is what distinguishes how counters are assigned instead of how
>> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
>> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
>> and "mbm_cntr_group_assignable" is used? Could that replace a
>> potential "mbm_assign_events" while also supporting user space in
>> interactions with "mbm_control"?
> 
> If I understand this correctly, is this a preference that the info
> node be named differently if its value will have different units,
> rather than a second node to indicate what the value of num_mbm_cntrs
> actually means? This sounds reasonable to me.

Looks like we are agreeing with "silent upgrade/degrade" option.

"mbm_mode" will look like below(Replaced event with evt and group with grp).

#cat /sys/fs/resctrl/infor/L3_MON/mbm_mode
[mbm_cntr_evt_assignable]
mbm_cntr_grp_assignable
legacy

Does that look ok?

I am not clear on num_mbm_cntrs in case of mbm_cntr_grp_assignable.

Peter, How do you figure out how many counters are available in soft-ABMC?


> 
> I think it's also important to note that in MPAM, the MBWU (memory
> bandwidth usage) monitors don't have a concept of local versus total
> bandwidth, so event assignment would likely not apply there either.
> What the counted bandwidth actually represents is more implicit in the
> monitor's position in the memory system in the particular
> implementation. On a theoretical multi-socket system, resctrl would
> require knowledge about the system's architecture to stitch together
> the counts from different types of monitors to produce a local and
> total value. I don't know if we'd program this SoC-specific knowledge
> into the kernel to produce a unified MBM resource like we're
> accustomed to now or if we'd present multiple MBM resources, each only
> providing an mbm_total_bytes event. In this case, the counters would
> have to be assigned separately in each MBM resource, especially if the
> different MBM resources support a different number of counters.
> 
> Thanks,
> -Peter
> 

-- 
- Babu Moger

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ