[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54687d59-d0e4-4fe7-b25f-dc1fead01ea1@intel.com>
Date: Thu, 29 Feb 2024 13:50:35 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: <babu.moger@....com>, James Morse <james.morse@....com>, <corbet@....net>,
<fenghua.yu@...el.com>, <tglx@...utronix.de>, <mingo@...hat.com>,
<bp@...en8.de>, <dave.hansen@...ux.intel.com>, Peter Newman
<peternewman@...gle.com>
CC: <x86@...nel.org>, <hpa@...or.com>, <paulmck@...nel.org>,
<rdunlap@...radead.org>, <tj@...nel.org>, <peterz@...radead.org>,
<yanjiewtw@...il.com>, <kim.phillips@....com>, <lukas.bulwahn@...il.com>,
<seanjc@...gle.com>, <jmattson@...gle.com>, <leitao@...ian.org>,
<jpoimboe@...nel.org>, <rick.p.edgecombe@...el.com>,
<kirill.shutemov@...ux.intel.com>, <jithu.joseph@...el.com>,
<kai.huang@...el.com>, <kan.liang@...ux.intel.com>,
<daniel.sneddon@...ux.intel.com>, <pbonzini@...hat.com>,
<sandipan.das@....com>, <ilpo.jarvinen@...ux.intel.com>,
<maciej.wieczor-retman@...el.com>, <linux-doc@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <eranian@...gle.com>
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth
Monitoring Counters (ABMC)
Hi Babu,
On 2/29/2024 12:37 PM, Moger, Babu wrote:
> On 2/28/24 14:04, Reinette Chatre wrote:
>> On 2/28/2024 9:59 AM, Moger, Babu wrote:
>>> On 2/27/24 17:50, Reinette Chatre wrote:
>>>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>>>
>>
>>>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>>>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>>>>> User space could theoretically create more monitor groups than the number of
>>>>>> rmids that a resource claims to support using current upstream enumeration.
>>>>>
>>>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>>>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>>>>> more than this limit(r->num_rmid).
>>>>>
>>>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>>>>> RMID to assign the monitoring. So, assignment limit is
>>>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>>>>
>>>> I see. Thank you for clarifying. This does make enabling simpler and one
>>>> less user interface item that needs changing.
>>>>
>>>> ...
>>>>
>>>>>>> 2. /sys/fs/resctrl/monitor_state.
>>>>>>> This can used to individually assign or unassign the counters in each group.
>>>>>>>
>>>>>>> When assigned:
>>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>>>>
>>>>>>> When unassigned:
>>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>>>>
>>>>>>>
>>>>>>> Thoughts?
>>>>>>
>>>>>> How do you expect this interface to be used? I understand the mechanics
>>>>>> of this interface but on a higher level, do you expect user space to
>>>>>> once in a while assign a new counter to a single event or monitor group
>>>>>> (for which a fine grained interface works) or do you expect user space to
>>>>>> shift multiple counters across several monitor events at intervals?
>>>>>
>>>>> I think we should provide both the options. I was thinking of providing
>>>>> fine grained interface first.
>>>>
>>>> Could you please provide a motivation for why two interfaces, one inefficient
>>>> and one not, should be created and maintained? Users can still do fine grained
>>>> assignment with a global assignment interface.
>>>
>>> Lets consider one by one.
>>>
>>> 1. Fine grained assignment.
>>>
>>> It will be part of the mongroup(or control mongroup). User has the access
>>> to the group and can query the group's current status before assigning or
>>> unassigning.
>>>
>>> $cd /sys/fs/resctrl/ctrl_mon1
>>> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign;
>>>
>>> Assign the total event
>>>
>>> $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>
>>> Assign the local event
>>>
>>> $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>
>>> Assign both events:
>>>
>>> $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>
>>> Check the assignment status.
>>>
>>> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>> 0=total-assign,local-assign;1=total-unassign,local-unassign;
>>>
>>> -User interface is simple.
>>
>> This should not be the only motivation. Please do not sacrifice efficiency
>> and usability just to have a simple interface. One can also argue that this
>> interface can only be considered simple from the kernel implementation perspective,
>> from user space it seems complicated. For example, as James pointed out earlier [1],
>> user space would need to walk the entire resctrl to find out where counters are
>> assigned. Peter also pointed out how the multiple syscalls needed when adjusting
>> hundreds of monitor groups is inefficient. Please take all feedback into account.
>>
>> You consider "simple interface" as a motivation, there seems to be at least two
>> arguments against this interface. Please consider these in your comparison
>> between interfaces. These are things that should be noted and make their way to
>> the cover letter.
>>
>>>
>>> -Assignment will fail if all the h/w counters are exhausted. User needs to
>>> unassign a counter from another group and use that counter here. This can
>>> be done just querying the monitor state of another group.
>>
>> Right ... and as you state there can be hundreds of monitor groups that
>> user space would need to walk and query to get this information.
>>
>>>
>>> -Monitor group's details(cpus, tasks) are part of the group. So, it is
>>> better to have assignment state inside the group.
>>
>> The assignment state should be clear from the event file.
>>
>>> Note: Used interface names here just to give example.
>>>
>>>
>>> 2. global assignment:
>>>
>>> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
>>> directory.
>>>
>>> In case there are 100 mongroups, we need to have a way to list current
>>> assignment status for these groups. I am not sure how to list status of
>>> these 100 groups.
>>
>> The kernel has many examples of interfaces that manages status of a large
>> number of entities. I am thinking, for example, we can learn a lot from
>> how dynamic debug works. On my system I see:
>>
>> $ wc -l /sys/kernel/debug/dynamic_debug/control
>> 5359 /sys/kernel/debug/dynamic_debug/control
>>
>>>
>>> If user is wants to assign the local event(or total) in a specific group
>>> in this list of 100 groups, I am not sure how to provide interface for
>>> that. Should we pass the name of mongroup? That will involve looping
>>> through using the call kernfs_walk_and_get. This may be ok if we are
>>> dealing with very small number of groups.
>>>
>>
>> What is your concern when needing to modify a large number of groups?
>> Are you concerned about the size of the writes needing to be parsed? It looks
>> like kernfs does support writes of larger than PAGE_SIZE, but it is not clear
>> to me that such large sizes will be required.
>>
>> There is also kernfs_find_and_get() that may be more convenient to use.
>
> Will look at this. There is also kernfs_name and kernfs_path.
>
>> I believe user space needs to provide control group name for a global
>> interface (the same name can be used by monitor groups belonging to
>> different control groups), and that can be used to narrow search.
>>
>> Reading your message I do not find any motivation _against_ a global
>> interface, except that it is not obvious to you how such interface may look
>> or work. That is fair. Peter seems to have ideas and a working implementation
>> that can be used as reference. So far I have only seen one comment [2] from James
>> that was skeptical about the global interface but the reason notes that MPAM
>> allocates counters per domain, which is the same as ABMC so we will need more
>> information from James here on what is required since he did not respond to
>> Peter.
>>
>> Below is a *hypothetical* interface to start a discussion that explores how
>> to support fine grained assignment in an interface that aims to be easy to use
>> by user space. Obviously Peter is also working on something so there
>> are many viewpoints to consider.
>>
>> File info/L3_MON/mbm_assign_control:
>> #control_group/mon_group/flags
>> ctrl_a/mon_a/00=_;01=_
>> ctrl_a/mon_b/00=l;01=t
>> ctrl_b/mon_c/00=lt;01=lt
>
> I think you left few things here(Like the default control_mon group).
No. Similar to proc_resctrl_show() the fields can be empty for
the default group or mon groups belonging to control group.
>
> To make more clear, let me list all the groups here based this.
>
> When none of the counters assigned:
>
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> resctrl/00=none,none;01=none,none (#default control_mon group)
> resctrl/mon_a/00=none,none;01=none,none (#mon group)
> resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group)
> resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group)
I am concerned that inconsistent use of "/" will make parsing hard.
I find "resctrl" and all the "none" redundant. It is not clear what
this improves.
Why have:
resctrl/00=none,none;01=none,none
when this could do:
//00=_;01=_
> When some counters are assigned:
>
> $echo "resctrl/00=total,local" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to
> default group)
>
> $echo "resctrl/mon_a/00=total;01=total" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon
> group)
>
> $echo "resctrl/ctrl_a/00=local;01=local" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> $echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" >
> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>
We could learn some more lessons from dynamic debug (see
Documentation/admin-guide/dynamic-debug-howto.rst).
For example, "=" can be used to make an assignment while "+"
can be used to add a counter and "-" can be used to remove a counter.
"=_" can be used to remove counters from all events in that domain.
The interface should also support assign/un-assign to multiple groups with
a single write. To start this could use '\n' as separator as is the custom
with other resctrl interfaces.
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> resctrl/00=total,local;01=none,none (#default control_mon group)
> resctrl/mon_a/00=total,none;01=total,none (#mon group)
> resctrl/ctrl_a/00=none,local;01=none,local (#control_mon group)
> resctrl/ctrl_a/mon_ab/00=total,local;01=total,local (#mon group)
>
>
> Few comments about this approach:
> 1.This will involve lots of text processing in the kernel. Will need to
> figure out calls for these processing.
I see that additional parsing will be needed to determine control group
and monitor group. For these it sounds like you already have a few options
for kernfs API to use.
Apart from that the counter assignment will be similar parsing as what
was done in your previous versions. I think parsing will be easier if it
does not try to use words for the events but just use one letter flags.
For example, there is thus no need to look for "," in the parsing of the
events, just parse one character at a time where each character has a
specific meaning.
>
> 2.In this approach there is no way to list assignment of a single
> group(like group resctrl/ctrl_a/mon_ab alone).
Should the kernel be responsible for enabling this? User space can just
do a "cat mbm_assign_control | grep mon_ab". Is this not sufficient?
>
> 3. This is similar to fine grained approach we discussed but in global level.
That is what I have been trying to get across. This has full benefit of the
original implementation while also addressing all problems raised against it.
>
> Want to get Pater/James comments about this approach.
(Peter)
Of course. I look forward to that. Once agreed it may also be worthwhile to
approach x86 maintainers with an RFC of the proposed new user interface to learn
their guidance. This is where it is important to keep track of all the requirements,
as well as pros and cons of different options.
Reinette
Powered by blists - more mailing lists