lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 1 Mar 2024 14:36:10 -0600
From: "Moger, Babu" <babu.moger@....com>
To: Reinette Chatre <reinette.chatre@...el.com>,
 James Morse <james.morse@....com>, corbet@....net, fenghua.yu@...el.com,
 tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, Peter Newman <peternewman@...gle.com>
Cc: x86@...nel.org, hpa@...or.com, paulmck@...nel.org, rdunlap@...radead.org,
 tj@...nel.org, peterz@...radead.org, yanjiewtw@...il.com,
 kim.phillips@....com, lukas.bulwahn@...il.com, seanjc@...gle.com,
 jmattson@...gle.com, leitao@...ian.org, jpoimboe@...nel.org,
 rick.p.edgecombe@...el.com, kirill.shutemov@...ux.intel.com,
 jithu.joseph@...el.com, kai.huang@...el.com, kan.liang@...ux.intel.com,
 daniel.sneddon@...ux.intel.com, pbonzini@...hat.com, sandipan.das@....com,
 ilpo.jarvinen@...ux.intel.com, maciej.wieczor-retman@...el.com,
 linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, eranian@...gle.com
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth
 Monitoring Counters (ABMC)

Hi Reinette,

On 2/29/24 15:50, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/29/2024 12:37 PM, Moger, Babu wrote:
>> On 2/28/24 14:04, Reinette Chatre wrote:
>>> On 2/28/2024 9:59 AM, Moger, Babu wrote:
>>>> On 2/27/24 17:50, Reinette Chatre wrote:
>>>>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>>>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>>>>
>>>
>>>>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>>>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>>>>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>>>>>> User space could theoretically create more monitor groups than the number of
>>>>>>> rmids that a resource claims to support using current upstream enumeration.
>>>>>>
>>>>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>>>>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>>>>>> more than this limit(r->num_rmid).
>>>>>>
>>>>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>>>>>> RMID to assign the monitoring. So, assignment limit is
>>>>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>>>>>
>>>>> I see. Thank you for clarifying. This does make enabling simpler and one
>>>>> less user interface item that needs changing.
>>>>>
>>>>> ...
>>>>>
>>>>>>>> 2. /sys/fs/resctrl/monitor_state.
>>>>>>>> This can used to individually assign or unassign the counters in each group.
>>>>>>>>
>>>>>>>> When assigned:
>>>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>>>>>
>>>>>>>> When unassigned:
>>>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>>>>>
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>
>>>>>>> How do you expect this interface to be used? I understand the mechanics
>>>>>>> of this interface but on a higher level, do you expect user space to
>>>>>>> once in a while assign a new counter to a single event or monitor group
>>>>>>> (for which a fine grained interface works) or do you expect user space to
>>>>>>> shift multiple counters across several monitor events at intervals?
>>>>>>
>>>>>> I think we should provide both the options. I was thinking of providing
>>>>>> fine grained interface first.
>>>>>
>>>>> Could you please provide a motivation for why two interfaces, one inefficient
>>>>> and one not, should be created and maintained? Users can still do fine grained
>>>>> assignment with a global assignment interface.
>>>>
>>>> Lets consider one by one.
>>>>
>>>> 1. Fine grained assignment.
>>>>
>>>> It will be part of the mongroup(or control mongroup). User has the access
>>>> to the group and can query the group's current status before assigning or
>>>> unassigning.
>>>>
>>>>    $cd /sys/fs/resctrl/ctrl_mon1
>>>>    $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>>        0=total-unassign,local-unassign;1=total-unassign,local-unassign;
>>>>
>>>> Assign the total event
>>>>
>>>>   $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>>
>>>> Assign the local event
>>>>
>>>>    $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>>
>>>> Assign both events:
>>>>
>>>>    $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>>
>>>> Check the assignment status.
>>>>
>>>>    $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>>        0=total-assign,local-assign;1=total-unassign,local-unassign;
>>>>
>>>> -User interface is simple.
>>>
>>> This should not be the only motivation. Please do not sacrifice efficiency
>>> and usability just to have a simple interface. One can also argue that this
>>> interface can only be considered simple from the kernel implementation perspective,
>>> from user space it seems complicated. For example, as James pointed out earlier [1],
>>> user space would need to walk the entire resctrl to find out where counters are
>>> assigned. Peter also pointed out how the multiple syscalls needed when adjusting
>>> hundreds of monitor groups is inefficient. Please take all feedback into account.
>>>
>>> You consider "simple interface" as a motivation, there seems to be at least two
>>> arguments against this interface. Please consider these in your comparison
>>> between interfaces. These are things that should be noted and make their way to
>>> the cover letter.
>>>
>>>>
>>>> -Assignment will fail if all the h/w counters are exhausted. User needs to
>>>> unassign a counter from another group and use that counter here. This can
>>>> be done just querying the monitor state of another group.
>>>
>>> Right ... and as you state there can be hundreds of monitor groups that
>>> user space would need to walk and query to get this information.
>>>
>>>>
>>>> -Monitor group's details(cpus, tasks) are part of the group. So, it is
>>>> better to have assignment state inside the group.
>>>
>>> The assignment state should be clear from the event file.
>>>
>>>> Note: Used interface names here just to give example.
>>>>
>>>>
>>>> 2. global assignment:
>>>>
>>>> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
>>>> directory.
>>>>
>>>> In case there are 100 mongroups, we need to have a way to list current
>>>> assignment status for these groups. I am not sure how to list status of
>>>> these 100 groups.
>>>
>>> The kernel has many examples of interfaces that manages status of a large
>>> number of entities. I am thinking, for example, we can learn a lot from
>>> how dynamic debug works. On my system I see:
>>>
>>> $ wc -l /sys/kernel/debug/dynamic_debug/control
>>> 5359 /sys/kernel/debug/dynamic_debug/control
>>>
>>>>
>>>> If user is wants to assign the local event(or total) in a specific group
>>>> in this list of 100 groups, I am not sure how to provide interface for
>>>> that. Should we pass the name of mongroup? That will involve looping
>>>> through using the call kernfs_walk_and_get. This may be ok if we are
>>>> dealing with very small number of groups.
>>>>
>>>
>>> What is your concern when needing to modify a large number of groups?
>>> Are you concerned about the size of the writes needing to be parsed? It looks
>>> like kernfs does support writes of larger than PAGE_SIZE, but it is not clear
>>> to me that such large sizes will be required.   
>>>
>>> There is also kernfs_find_and_get() that may be more convenient to use.
>>
>> Will look at this. There is also kernfs_name and kernfs_path.
>>
>>> I believe user space needs to provide control group name for a global
>>> interface (the same name can be used by monitor groups belonging to
>>> different control groups), and that can be used to narrow search.
>>>
>>> Reading your message I do not find any motivation _against_ a global
>>> interface, except that it is not obvious to you how such interface may look
>>> or work. That is fair. Peter seems to have ideas and a working implementation
>>> that can be used as reference. So far I have only seen one comment [2] from James
>>> that was skeptical about the global interface but the reason notes that MPAM
>>> allocates counters per domain, which is the same as ABMC so we will need more
>>> information from James here on what is required since he did not respond to
>>> Peter.
>>>
>>> Below is a *hypothetical* interface to start a discussion that explores how
>>> to support fine grained assignment in an interface that aims to be easy to use
>>> by user space. Obviously Peter is also working on something so there
>>> are many viewpoints to consider.
>>>
>>> File info/L3_MON/mbm_assign_control:
>>> #control_group/mon_group/flags
>>> ctrl_a/mon_a/00=_;01=_
>>> ctrl_a/mon_b/00=l;01=t
>>> ctrl_b/mon_c/00=lt;01=lt
>>
>> I think you left few things here(Like the default control_mon group).
> 
> No. Similar to proc_resctrl_show() the fields can be empty for
> the default group or mon groups belonging to control group.

ok. Need to understand this better. Hope I learn while doing this work.

> 
>>
>> To make more clear, let me list all the groups here based this.
>>
>> When none of the counters assigned:
>>
>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> resctrl/00=none,none;01=none,none (#default control_mon group)
>> resctrl/mon_a/00=none,none;01=none,none (#mon group)
>> resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group)
>> resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group)
> 
> I am concerned that inconsistent use of "/" will make parsing hard.

Do you mean, you don't want to see multiple "/"?

resctrl/ctrl_a/mon_ab/

Change to

mon_ab/

> 
> I find "resctrl" and all the "none" redundant. It is not clear what
> this improves.
> Why have:
> resctrl/00=none,none;01=none,none
> when this could do:
> //00=_;01=_

ok.

"//" meaning root of resctrl filesystem?


> 
> 
>> When some counters are assigned:
>>
>> $echo "resctrl/00=total,local" >
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to
>> default group)
>>
>> $echo "resctrl/mon_a/00=total;01=total" >
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon
>> group)
>>
>> $echo "resctrl/ctrl_a/00=local;01=local" >
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>> $echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" >
>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>
> 
> We could learn some more lessons from dynamic debug (see 
> Documentation/admin-guide/dynamic-debug-howto.rst). 
> For example, "=" can be used to make an assignment while "+"
> can be used to add a counter and "-" can be used to remove a counter.
> "=_" can be used to remove counters from all events in that domain.

Yes. Looked at dynamic debug. I am still learning this interface. Some 
examples below based on my understanding.

To assign a counters to default group on domain 0.
$echo "//00=+lt;01=+lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

To assign a counters to mon group inside the default group.
$echo "mon_a/00=+t;01=+t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

To assign a counters to control mon group inside the default group.
$echo "ctrl_a/00=+l;01=+l"  > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

To assign a counters to control mon group inside another control group.
$echo "mon_ab/00=+lt;01=+lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro

To unassign a counters to control mon group inside another control group.
$echo "mon_ab/00=-lt;01=-lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control

To unassign all the counters on a specific group.
$echo "mon_ab/00=_" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control

It does not matter control group or mon group. We just need to name of 
the group in this interface.

Listing will be

$cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
//00=lt;01=lt
/mon_a/00=t;01=t
/ctrl_a/00=l;01=l
/mon_ab/00=_;01=_

> 
> The interface should also support assign/un-assign to multiple groups with
> a single write. To start this could use '\n' as separator as is the custom
> with other resctrl interfaces. 

Yes. that should be fine.

> 
>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> resctrl/00=total,local;01=none,none (#default control_mon group)
>> resctrl/mon_a/00=total,none;01=total,none (#mon group)
>> resctrl/ctrl_a/00=none,local;01=none,local (#control_mon group)
>> resctrl/ctrl_a/mon_ab/00=total,local;01=total,local (#mon group)
>>
>>
>> Few comments about this approach:
>> 1.This will involve lots of text processing in the kernel. Will need to
>> figure out calls for these processing.
> 
> I see that additional parsing will be needed to determine control group
> and monitor group. For these it sounds like you already have a few options
> for kernfs API to use.
> Apart from that the counter assignment will be similar parsing as what
> was done in your previous versions. I think parsing will be easier if it
> does not try to use words for the events but just use one letter flags.
> For example, there is thus no need to look for "," in the parsing of the
> events, just parse one character at a time where each character has a
> specific meaning.

ok.

> 
>>
>> 2.In this approach there is no way to list assignment of a single
>> group(like group resctrl/ctrl_a/mon_ab alone).
> 
> Should the kernel be responsible for enabling this? User space can just
> do a "cat mbm_assign_control | grep mon_ab". Is this not sufficient?

That may be ok. Peter, Please comment on this.

> 
>>
>> 3. This is similar to fine grained approach we discussed but in global level.
> 
> That is what I have been trying to get across. This has full benefit of the
> original implementation while also addressing all problems raised against it.
> 
>>
>> Want to get Pater/James comments about this approach.
> (Peter)
> 
> Of course. I look forward to that. Once agreed it may also be worthwhile to
> approach x86 maintainers with an RFC of the proposed new user interface to learn
> their guidance. This is where it is important to keep track of all the requirements,
> as well as pros and cons of different options.

Ok. Sure. I am fine making next version as RFC.

> 
> Reinette

-- 
Thanks
Babu Moger

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ