lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <088878bd-7533-492d-838c-6b39a93aad4d@amd.com>
Date: Tue, 27 Feb 2024 12:12:19 -0600
From: "Moger, Babu" <babu.moger@....com>
To: Reinette Chatre <reinette.chatre@...el.com>,
 James Morse <james.morse@....com>, corbet@....net, fenghua.yu@...el.com,
 tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com
Cc: x86@...nel.org, hpa@...or.com, paulmck@...nel.org, rdunlap@...radead.org,
 tj@...nel.org, peterz@...radead.org, yanjiewtw@...il.com,
 kim.phillips@....com, lukas.bulwahn@...il.com, seanjc@...gle.com,
 jmattson@...gle.com, leitao@...ian.org, jpoimboe@...nel.org,
 rick.p.edgecombe@...el.com, kirill.shutemov@...ux.intel.com,
 jithu.joseph@...el.com, kai.huang@...el.com, kan.liang@...ux.intel.com,
 daniel.sneddon@...ux.intel.com, pbonzini@...hat.com, sandipan.das@....com,
 ilpo.jarvinen@...ux.intel.com, peternewman@...gle.com,
 maciej.wieczor-retman@...el.com, linux-doc@...r.kernel.org,
 linux-kernel@...r.kernel.org, eranian@...gle.com
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth
 Monitoring Counters (ABMC)

Hi Reinette,

On 2/26/24 15:20, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>> On 2/23/24 16:21, Reinette Chatre wrote:
>>> On 2/23/2024 12:11 PM, Moger, Babu wrote:
>>>> On 2/23/24 11:17, Reinette Chatre wrote:
>>>>>
>>>>>
>>>>> On 2/20/2024 12:48 PM, Moger, Babu wrote:
>>>>>> On 2/20/24 09:21, James Morse wrote:
>>>>>>> On 19/01/2024 18:22, Babu Moger wrote:
>>>>>
>>>>>>>> e. Enable ABMC mode.
>>>>>>>>
>>>>>>>> 	#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>>>>         #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>>>>         1
>>>>>>>
>>>>>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that
>>>>>>> supports it, or enabled implicitly when the first assignment attempt arrives?
>>>>>>>
>>>>>>> I guess this is really needed for a reset - could we implement that instead? This way
>>>>>>> there isn't an extra step user-space has to do to make the assignments work.
>>>>>>
>>>>>> Mostly the new features are added as an opt-in method. So, kept it that
>>>>>> way. If we enable this feature automatically, then we have provide an
>>>>>> option to disable it.
>>>>>>
>>>>>
>>>>> At the same time it sounds to me like ABMC can improve current users'
>>>>> experience without requiring them to do anything. This sounds appealing.
>>>>> For example, if I understand correctly, it may be possible to start resctrl
>>>>> with ABMC enabled by default and the number of monitoring groups (currently
>>>>> exposed to user space via "num_rmids") limited to the number of counters
>>>>> supported by ABMC. Existing users would then by default obtain better behavior
>>>>> of counters not resetting.
>>>>
>>>> Yes, I like the idea. But i will break compatibility with pqos
>>>> tool(intel_cmt_cat utility). pqos tool monitoring will not work without
>>>> supporting ABMC enablement in the tool. ABMC feature requires an extra
>>>> step to assign the counters for monitor to work.
>>>
>>> I am considering two scenarios, the "default behavior" is what a user will
>>> experience when booting resctrl on an ABMC system and the "new feature
>>> behavior" where a user can take full advantage of all that ABMC (and soft
>>> RMID, and MPAM) can offer.
>>>
>>> So, first, on an ABMC system in the "default behavior" scenario I expect
>>> that resctrl can do required ABMC counter configuration automatically at
>>> the time a monitor group is created. In this "default behavior" scenario
>>> resctrl would expose "num_rmids" to be half of the number of assignable
>>> counters. When a user then creates a monitor group two counters will be
>>> used and configured to count the local and total bytes respectively. If
>>> two counters are not available then ENOSPC returned, just like when system
>>> is out of closid/rmid.  With this "default behavior" user space thus gets
>>> improved behavior without making any changes on its part. I do not have
>>
>> We can automatically assign the h/w counter when monitor group is created
>> until we run out of h/w counters. That is good idea. By default user will
>> not notice any difference in ABMC mode.
>>
>>> insight into how many counters ABMC could be expected to expose though ...
>>> so some users may be surprised at how few monitor groups can be created
>>> with new hardware? This may not be an issue since that would accurately
>>> reflect how many _reliable_ monitor groups can be created and if user needs
>>> more monitor groups then that would be a time to explore the "new feature"
>>> that requires changes in how user interacts with resctrl.
>>
>> Currently, 32 h/w counters are available to configure. With two counters
>> for each group, we can create 16 groups(15 new groups plus the default
>> group). That should be fine as pqos tool creates only 16 groups when it is
>> started.
> 
> user space can never assume that a certain number of groups can
> be created. 
> 
>>> Apart from the "default behavior" there are two options to consider ...
>>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>>     where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>>     system, where the previous "num_rmids" monitor groups can be created but
>>>     the counters are reset unpredictably ... should this still be supported
>>>     on ABMC systems though?
>>
>> I would say yes. For some reason user(hardware or software issues) is not
>> able to use ABMC mode, they have an option to go back to legacy mode.
> 
> I see. Should this perhaps be protected behind the resctrl "debug" mount option?

The debug option gives wrong impression. It is better to keep the option
open to enable the feature in normal mode.

> 
>>> (b) the "new feature" behavior where user space gets full benefit of ABMC
>>>     that allows user space to create any number of monitor groups but then
>>>     user space needs to let hardware (via resctrl) know which
>>>     events should be counted.
>>
>> Is this "new feature" is enabled by default when ABMC is available?
> 
> Not in this design, no. In these scenarios ABMC will be available and enabled
> in both the "default" and "new feature" behavior. The difference is no user
> space changes are needed in "default" scenario and resctrl limits the number
> of monitor groups to support all monitor groups to be backed by hardware
> counters. 
> When "new feature" is enabled when ABMC is available and enabled then
> user space is able to create more monitor groups than available hardware
> counters and new user interface is required to manage associating counters
> with monitor events.

ok. That sounds good.

> 
>>
>> Or we need to provide an interface to enable this feature?
> 
> Yes, an interface will be needed to enable this feature.

ok.

> 
>>
>>
>>>
>>> I expect that only (b) above would require user space change. Considering
>>> that per documentation, "num_rmids" means "This is the upper bound for how
>>> many "CTRL_MON" + "MON" groups can be created" I expect that "num_rmids"
>>> becomes undefined when "new feature" is enabled. When this new feature is enabled
>>> then user space is no longer limited by number of RMIDs on how many monitor
>>
>> With ABMC, we will have a new field "mbm_assignable_counters". We don't
>> have to change the definition of "num_rmids".
> 
> The problem here is that "num_rmids" is (as per Documentation/arch/x86/resctrl.rst)
> documented to be an upper bound for how many monitor groups can be created.
> As I understand, when ABMC is enabled and its full capability exposed to user
> space then there is no limit to how many monitor groups can be created, no?

No. That is not correct. The number of monitor groups is still limited by
num_rmids. But assignment is limited by mbm_assignable_counters. More below.

> 
> For example, if I understand correctly, theoretically, when ABMC is enabled then
> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
> is not unsigned, tbd if number of directories may also be limited by kernfs).
> User space could theoretically create more monitor groups than the number of
> rmids that a resource claims to support using current upstream enumeration.

CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
bits(depends on specific h/w) to represent RMIDs. So, we cannot create
more than this limit(r->num_rmid).

In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
RMID to assign the monitoring. So, assignment limit is
mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.

> Instead, it is the "mbm_assignable_counters" that is of interest, that is what
> user space uses to determine how many of the (potentially very large number of)
> monitor groups/monitor events can be counted at any particular time.
> 
>>> groups can be created and this is the point that the user interface that you
>>> and Peter have ideas about comes into play. Specifically, user space needing
>>> a way to specify:
>>> (a) "let me create more monitor groups that the hardware can support"/"let me
>>>      control which events/monitor groups are counted"
>>>      (like the "mbm_assign" file in your proposal)
>>> (b) "here are the events that need to be counted" 
>>>      (like the "monitor_state" and "mbm_{local,total}_bytes_assigned" proposals)
>>
>> With global assignment option out of way for now(may be introduced later),
>> we can provide two interfaces.
>>
>> 1. /sys/fs/resctrl/info/L3_MON/mbm_assign
>> This will be enabled by default when ABMC is available. Users can disable
>> this option to go back to legacy mode.
> 
> Potentially (all naming placeholders that will only be visible on systems that
> actually supports particular mode):
> legacy [default] new_feature soft_rmid

ok

> 
>>
>> 2. /sys/fs/resctrl/monitor_state.
>> This can used to individually assign or unassign the counters in each group.
>>
>> When assigned:
>> #cat /sys/fs/resctrl/monitor_state
>> 0=total-assign,local-assign;1=total-assign,local-assign
>>
>> When unassigned:
>> #cat /sys/fs/resctrl/monitor_state
>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>
>>
>> Thoughts?
> 
> How do you expect this interface to be used? I understand the mechanics
> of this interface but on a higher level, do you expect user space to
> once in a while assign a new counter to a single event or monitor group
> (for which a fine grained interface works) or do you expect user space to
> shift multiple counters across several monitor events at intervals?

I think we should provide both the options. I was thinking of providing
fine grained interface first.

Few use cases:
1. User wants to assign only one event (total or local) per group.
   In this case, he can assign 32 events in 32 different groups.

   #echo 0=total-assign >  /sys/fs/resctrl/monitor_state
   or
   #echo 0=local-assign >  /sys/fs/resctrl/monitor_state

   When done:

   #echo 0=total-unassign >  /sys/fs/resctrl/monitor_state
   or
   #echo 0=local-unassign >  /sys/fs/resctrl/monitor_state

   Note: 0 the domain here.


2. User wants to assign both "local" and "total" events per group. In this
case, he can assign 32 events in 16 different groups.

   #echo 0=local-assign,total-assign  >  /sys/fs/resctrl/monitor_state

   When done:

   #echo 0=local-unassign,total-unassign  >  /sys/fs/resctrl/monitor_state

3. combination of 1 and 2.

4. Assign multiple group assignment at once. I consider this as global
assignment. This can be achieved by 1 and 2 from user space looping thru
all the interested groups. Peter is worried about system call latency
here. He wants to optimize this. I was thinking this can done later.

> 
> Across resctrl's lifetime we have seen examples of user space wanting
> to accomplish more with a single resctrl interaction. For example moving
> multiple tasks to a group that you added support for and moving a monitor
> group feature from Peter.
> 
> I thus think that it would be valuable to consider more efficient
> interfaces from the beginning. I do not think that this is the type
> of work that is an optimization to be delayed until an unspecified later
> time, but instead multiple usage of interface can be considered from the
> start with a most optimal interface created from the beginning. Specifically,
> why does resctrl need to be "extended" to support a global assignment as proposed
> by Peter at a later time, why can it not be done as the original and (ideally)
> only mechanism?
> 
> Reinette

-- 
Thanks
Babu Moger

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ