[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1738493b-3248-4c9e-82a8-1599a033440d@intel.com>
Date: Wed, 28 Feb 2024 12:04:24 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: <babu.moger@....com>, James Morse <james.morse@....com>, <corbet@....net>,
<fenghua.yu@...el.com>, <tglx@...utronix.de>, <mingo@...hat.com>,
<bp@...en8.de>, <dave.hansen@...ux.intel.com>
CC: <x86@...nel.org>, <hpa@...or.com>, <paulmck@...nel.org>,
<rdunlap@...radead.org>, <tj@...nel.org>, <peterz@...radead.org>,
<yanjiewtw@...il.com>, <kim.phillips@....com>, <lukas.bulwahn@...il.com>,
<seanjc@...gle.com>, <jmattson@...gle.com>, <leitao@...ian.org>,
<jpoimboe@...nel.org>, <rick.p.edgecombe@...el.com>,
<kirill.shutemov@...ux.intel.com>, <jithu.joseph@...el.com>,
<kai.huang@...el.com>, <kan.liang@...ux.intel.com>,
<daniel.sneddon@...ux.intel.com>, <pbonzini@...hat.com>,
<sandipan.das@....com>, <ilpo.jarvinen@...ux.intel.com>,
<peternewman@...gle.com>, <maciej.wieczor-retman@...el.com>,
<linux-doc@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<eranian@...gle.com>
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth
Monitoring Counters (ABMC)
Hi Babu,
On 2/28/2024 9:59 AM, Moger, Babu wrote:
> On 2/27/24 17:50, Reinette Chatre wrote:
>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>
>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>>> User space could theoretically create more monitor groups than the number of
>>>> rmids that a resource claims to support using current upstream enumeration.
>>>
>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>>> more than this limit(r->num_rmid).
>>>
>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>>> RMID to assign the monitoring. So, assignment limit is
>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>>
>> I see. Thank you for clarifying. This does make enabling simpler and one
>> less user interface item that needs changing.
>>
>> ...
>>
>>>>> 2. /sys/fs/resctrl/monitor_state.
>>>>> This can used to individually assign or unassign the counters in each group.
>>>>>
>>>>> When assigned:
>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>>
>>>>> When unassigned:
>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>>
>>>>>
>>>>> Thoughts?
>>>>
>>>> How do you expect this interface to be used? I understand the mechanics
>>>> of this interface but on a higher level, do you expect user space to
>>>> once in a while assign a new counter to a single event or monitor group
>>>> (for which a fine grained interface works) or do you expect user space to
>>>> shift multiple counters across several monitor events at intervals?
>>>
>>> I think we should provide both the options. I was thinking of providing
>>> fine grained interface first.
>>
>> Could you please provide a motivation for why two interfaces, one inefficient
>> and one not, should be created and maintained? Users can still do fine grained
>> assignment with a global assignment interface.
>
> Lets consider one by one.
>
> 1. Fine grained assignment.
>
> It will be part of the mongroup(or control mongroup). User has the access
> to the group and can query the group's current status before assigning or
> unassigning.
>
> $cd /sys/fs/resctrl/ctrl_mon1
> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
> 0=total-unassign,local-unassign;1=total-unassign,local-unassign;
>
> Assign the total event
>
> $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>
> Assign the local event
>
> $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>
> Assign both events:
>
> $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>
> Check the assignment status.
>
> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
> 0=total-assign,local-assign;1=total-unassign,local-unassign;
>
> -User interface is simple.
This should not be the only motivation. Please do not sacrifice efficiency
and usability just to have a simple interface. One can also argue that this
interface can only be considered simple from the kernel implementation perspective,
from user space it seems complicated. For example, as James pointed out earlier [1],
user space would need to walk the entire resctrl to find out where counters are
assigned. Peter also pointed out how the multiple syscalls needed when adjusting
hundreds of monitor groups is inefficient. Please take all feedback into account.
You consider "simple interface" as a motivation, there seems to be at least two
arguments against this interface. Please consider these in your comparison
between interfaces. These are things that should be noted and make their way to
the cover letter.
>
> -Assignment will fail if all the h/w counters are exhausted. User needs to
> unassign a counter from another group and use that counter here. This can
> be done just querying the monitor state of another group.
Right ... and as you state there can be hundreds of monitor groups that
user space would need to walk and query to get this information.
>
> -Monitor group's details(cpus, tasks) are part of the group. So, it is
> better to have assignment state inside the group.
The assignment state should be clear from the event file.
> Note: Used interface names here just to give example.
>
>
> 2. global assignment:
>
> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
> directory.
>
> In case there are 100 mongroups, we need to have a way to list current
> assignment status for these groups. I am not sure how to list status of
> these 100 groups.
The kernel has many examples of interfaces that manages status of a large
number of entities. I am thinking, for example, we can learn a lot from
how dynamic debug works. On my system I see:
$ wc -l /sys/kernel/debug/dynamic_debug/control
5359 /sys/kernel/debug/dynamic_debug/control
>
> If user is wants to assign the local event(or total) in a specific group
> in this list of 100 groups, I am not sure how to provide interface for
> that. Should we pass the name of mongroup? That will involve looping
> through using the call kernfs_walk_and_get. This may be ok if we are
> dealing with very small number of groups.
>
What is your concern when needing to modify a large number of groups?
Are you concerned about the size of the writes needing to be parsed? It looks
like kernfs does support writes of larger than PAGE_SIZE, but it is not clear
to me that such large sizes will be required.
There is also kernfs_find_and_get() that may be more convenient to use.
I believe user space needs to provide control group name for a global
interface (the same name can be used by monitor groups belonging to
different control groups), and that can be used to narrow search.
Reading your message I do not find any motivation _against_ a global
interface, except that it is not obvious to you how such interface may look
or work. That is fair. Peter seems to have ideas and a working implementation
that can be used as reference. So far I have only seen one comment [2] from James
that was skeptical about the global interface but the reason notes that MPAM
allocates counters per domain, which is the same as ABMC so we will need more
information from James here on what is required since he did not respond to
Peter.
Below is a *hypothetical* interface to start a discussion that explores how
to support fine grained assignment in an interface that aims to be easy to use
by user space. Obviously Peter is also working on something so there
are many viewpoints to consider.
File info/L3_MON/mbm_assign_control:
#control_group/mon_group/flags
ctrl_a/mon_a/00=_;01=_
ctrl_a/mon_b/00=l;01=t
ctrl_b/mon_c/00=lt;01=lt
Above file displays to user:
* No counters are assigned to monitor group mon_a within control group ctrl_a
* Counter for local MBM is assigned to domain 0 of monitor group mon_b within
control group ctrl_a
* Counter for total MBM is assigned to domain 1 of monitor group mon_b within
control group ctrl_a
* Counters for local and total MBM are assigned to both domains of monitor
group mon_c within control group ctrl_b
With above interface user space can, with a single read, get insight into
how counters are assigned across all monitor groups.
User space can write to the file to modify the flags. If assigning a new
counter when no more counters are available then the write will fail.
Potentially, if changes are made in order provided by the user then
the user will be able to unassign counters from one group and re-assign to
another group with a single write.
I provide this purely to generate some ideas and gather more thoughts on
a global interface.
Reinette
[1] https://lore.kernel.org/lkml/2f373abf-f0c0-4f5d-9e22-1039a40a57f0@arm.com/
[2] https://lore.kernel.org/lkml/1a8c1cd6-a1ce-47a2-bc87-d4cccc84519b@arm.com/
Powered by blists - more mailing lists