[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4bf82744-da09-43c3-b8f1-7fc203d1c1c3@amd.com>
Date: Thu, 28 Nov 2024 13:35:09 -0600
From: "Moger, Babu" <bmoger@....com>
To: Peter Newman <peternewman@...gle.com>,
Reinette Chatre <reinette.chatre@...el.com>
Cc: babu.moger@....com, corbet@....net, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, dave.hansen@...ux.intel.com, fenghua.yu@...el.com,
x86@...nel.org, hpa@...or.com, thuth@...hat.com, paulmck@...nel.org,
rostedt@...dmis.org, akpm@...ux-foundation.org, xiongwei.song@...driver.com,
pawan.kumar.gupta@...ux.intel.com, daniel.sneddon@...ux.intel.com,
perry.yuan@....com, sandipan.das@....com, kai.huang@...el.com,
xiaoyao.li@...el.com, seanjc@...gle.com, jithu.joseph@...el.com,
brijesh.singh@....com, xin3.li@...el.com, ebiggers@...gle.com,
andrew.cooper3@...rix.com, mario.limonciello@....com, james.morse@....com,
tan.shaopeng@...itsu.com, tony.luck@...el.com, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, maciej.wieczor-retman@...el.com,
eranian@...gle.com, jpoimboe@...nel.org, thomas.lendacky@....com
Subject: Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display
number of free counters
Hi Peter,
On 11/28/2024 5:10 AM, Peter Newman wrote:
> Hi Babu, Reinette,
>
> On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
> <reinette.chatre@...el.com> wrote:
>>
>> Hi Babu,
>>
>> On 11/27/24 6:57 AM, Moger, Babu wrote:
>>> Hi Reinette,
>>>
>>> On 11/26/2024 5:56 PM, Reinette Chatre wrote:
>>>> Hi Babu,
>>>>
>>>> On 11/26/24 3:31 PM, Moger, Babu wrote:
>>>>> On 11/25/2024 1:00 PM, Reinette Chatre wrote:
>>>>>> On 11/22/24 3:36 PM, Moger, Babu wrote:
>>>>>>> On 11/21/2024 3:12 PM, Reinette Chatre wrote:
>>>>>>>> On 11/19/24 11:20 AM, Moger, Babu wrote:
>>>>>>>>> On 11/15/24 18:31, Reinette Chatre wrote:
>>>>>>>>>> On 10/29/24 4:21 PM, Babu Moger wrote:
>>>>>>>>>>> Provide the interface to display the number of free monitoring counters
>>>>>>>>>>> available for assignment in each doamin when mbm_cntr_assign is supported.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Babu Moger <babu.moger@....com>
>>>>>>>>>>> ---
>>>>>>>>>>> v9: New patch.
>>>>>>>>>>> ---
>>>>>>>>>>> Documentation/arch/x86/resctrl.rst | 4 ++++
>>>>>>>>>>> arch/x86/kernel/cpu/resctrl/monitor.c | 1 +
>>>>>>>>>>> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
>>>>>>>>>>> 3 files changed, 38 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>>>>>>>>>> index 2f3a86278e84..2bc58d974934 100644
>>>>>>>>>>> --- a/Documentation/arch/x86/resctrl.rst
>>>>>>>>>>> +++ b/Documentation/arch/x86/resctrl.rst
>>>>>>>>>>> @@ -302,6 +302,10 @@ with the following files:
>>>>>>>>>>> memory bandwidth tracking to a single memory bandwidth event per
>>>>>>>>>>> monitoring group.
>>>>>>>>>>> +"available_mbm_cntrs":
>>>>>>>>>>> + The number of free monitoring counters available assignment in each domain
>>>>>>>>>>
>>>>>>>>>> "The number of free monitoring counters available assignment" -> "The number of monitoring
>>>>>>>>>> counters available for assignment"?
>>>>>>>>>>
>>>>>>>>>> (not taking into account how text may change after addressing Peter's feedback)
>>>>>>>>>
>>>>>>>>> How about this?
>>>>>>>>>
>>>>>>>>> "The number of monitoring counters available for assignment in each domain
>>>>>>>>> when the architecture supports mbm_cntr_assign mode. There are a total of
>>>>>>>>> "num_mbm_cntrs" counters are available for assignment. Counters can be
>>>>>>>>> assigned or unassigned individually in each domain. A counter is available
>>>>>>>>> for new assignment if it is unassigned in all domains."
>>>>>>>>
>>>>>>>> Please consider the context of this paragraph. It follows right after the description
>>>>>>>> of "num_mbm_cntrs" that states "Up to two counters can be assigned per monitoring group".
>>>>>>>> I think it is confusing to follow that with a paragraph that states "Counters can be
>>>>>>>> assigned or unassigned individually in each domain." I wonder if it may be helpful to
>>>>>>>> use a different term ... for example a counter is *assigned* to an event of a monitoring
>>>>>>>> group but this assignment may be to specified (not yet supported) or all (this work) domains while
>>>>>>>> it is only *programmed*/*activated* to specified domains. Of course, all of this documentation
>>>>>>>> needs to remain coherent if future work decides to indeed support per-domain assignment.
>>>>>>>>
>>>>>>>
>>>>>>> Little bit lost here. Please help me.
>>>>>>
>>>>>> I think this highlights the uncertainty this interface brings. How do you expect users
>>>>>> to use this interface? At this time I think this interface can create a lot of confusion.
>>>>>> For example, consider a hypothetical system with three domains and four counters that
>>>>>> has the following state per mbm_assign_control:
>>>>>>
>>>>>> //0=tl;1=_;2=l #default group uses counters 0 and 1 to monitor total and local MBM
>>>>>> /m1/0=_;1=t;2=t #monitor group m1 uses counter 2, just for total MBM
>>>>>> /m2/0=l;1=_;2=l #monitor group m2 uses counter 3, just for local MBM
>>>>>> /m3/0=_;1=_;2=_
>>>>>>
>>>>>> Since, in this system there are only four counters available, and
>>>>>> they have all been assigned, then there are no new counters available for
>>>>>> assignment.
>>>>>>
>>>>>> If I understand correctly, available_mbm_cntrs will read:
>>>>>> 0=1;1=3;2=1
>>>>>
>>>>> Yes. Exactly. This causes confusion to the user.
>>>>>>
>>>>>> How is a user to interpret the above numbers? It does not reflect
>>>>>> that no counter can be assigned to m3, instead it reflects which of the
>>>>>> already assigned counters still need to be activated on domains.
>>>>>> If, for example, a user is expected to use this file to know how
>>>>>> many counters can still be assigned, should it not reflect the actual
>>>>>> available counters. In the above scenario it will then be:
>>>>>> 0=0;1=0;2=0
>>>>>
>>>>> We can also just print
>>>>> #cat available_mbm_cntrs
>>>>> 0
>>>>>
>>>>> The domain specific information is not important here.
>>>>> That was my original idea. We can go back to that definition. That is more clear to the user.
>>>>
>>>> Tony's response [1] still applies.
>>>>
>>>> I believe Tony's suggestion [2] considered that the available counters will be the
>>>> same for every domain for this implementation. That is why my example noted:
>>>> "0=0;1=0;2=0"
>>>
>>> yes. We can keep it like this.
>>>
>>>>
>>>> The confusion surrounding the global allocator seems to be prevalent ([3], [4]) as folks
>>>> familiar with resctrl attempt to digest the work. The struggle to make this documentation clear
>>>> makes me more concerned how this feature will be perceived by users who are not as familiar with
>>>> resctrl internals. I think that it may be worth it to take a moment and investigate what it will take
>>>> to implement a per-domain counter allocator. The hardware supports it and I suspect that the upfront
>>>> work to do the enabling will make it easier for users to adopt and understand the feature.
>>>>
>>>> What do you think?
>>>
>>> It adds more complexity for sure.
>>
>> I do see a difference in data structures used but the additional complexity is not
>> obvious to me. It seems like there will be one fewer data structure, the
>> global bitmap, and I think that will actually bring with it some simplification since
>> there is no longer the need to coordinate between the per-domain and global counters,
>> for example the logic that only frees a global counter if it is no longer used by a domain.
>>
>> This may also simplify the update of the monitor event config (BMEC) since it can be
>> done directly on counters of the domain instead of needing to go back and forth between
>> global and per-domain counters.
>>
>>>
>>> 1. Each group needs to remember counter ids in each domain for each event.
>>> For example:
>>> Resctrl group mon1
>>> Total event
>>> dom 0 cntr_id 1,
>>> dom 1 cntr_id 10
>>> dom 2 cntr_id 11
>>>
>>> Local event
>>> dom 0 cntr_id 2,
>>> dom 1 cntr_id 15
>>> dom 2 cntr_id 10
>>
>> Indeed. The challenge here is that domains may come and go so it cannot be a simple
>> static array. As an alternative it can be an xarray indexed by the domain ID with
>> pointers to a struct like below to contain the counters associated with the monitor
>> group:
>> struct cntr_id {
>> u32 mbm_total;
>> u32 mbm_local;
>> }
>>
>>
>> Thinking more about how this array needs to be managed made me wonder how the
>> current implementation deals with domains that come and go. I do not think
>> this is currently handled. For example, if a new domain comes online and
>> monitoring groups had counters dynamically assigned, then these counters are
>> not configured to the newly online domain.
I am trying to understand the details of your approach here.
>
> In my prototype, I allocated a counter id-indexed array to each
> monitoring domain structure for tracking the counter allocations,
> because the hardware counters are all domain-scoped. That way the
> tracking data goes away when the hardware does.
>
> I was focused on allowing all pending counter updates to a domain
> resulting from a single mbm_assign_control write to be batched and
> processed in a single IPI, so I structured the counter tracker
> something like this:
Not sure what you meant here. How are you batching two IPIs for two domains?
#echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
This is still a single write. Two IPIs are sent separately, one for each
domain.
Are you doing something different?
>
> struct resctrl_monitor_cfg {
> int closid;
> int rmid;
> int evtid;
> bool dirty;
> };
>
> This mirrors the info needed in whatever register configures the
> counter, plus a dirty flag to skip over the ones that don't need to be
> updated.
This is what my understanding of your implementation.
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index d94abba1c716..9cebf065cc97 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
u32 *mbps_val;
};
+struct resctrl_monitor_cfg {
+ int closid;
+ int rmid;
+ int evtid;
+ bool dirty;
+};
+
/**
* struct rdt_mon_domain - group of CPUs sharing a resctrl monitor
resource
* @hdr: common header for different domain types
@@ -116,6 +123,7 @@ struct rdt_mon_domain {
struct delayed_work cqm_limbo;
int mbm_work_cpu;
int cqm_work_cpu;
+ /* Allocate num_mbm_cntrs entries in each domain */
+ struct resctrl_monitor_cfg *mon_cfg;
};
When a user requests an assignment for total event to the default group
for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg
entry.
If there is an empty entry, then use that entry for assignment and
update closid, rmid, evtid and dirty = 1. We can get all these
information from default group here.
Does this make sense?
>
> For the benefit of displaying mbm_assign_control, I put a pointer back
> to any counter array entry allocated in the mbm_state struct only
> because it's an existing structure that exists for every rmid-domain
> combination.
Pointer in mbm_state may not be required here.
We are going to loop over resctrl groups. We can search the
rdt_mon_domain to see if specific closid, rmid, evtid is already
assigned or not in that domain.
>
> I didn't need to change the rdtgroup structure.
Ok. That is good.
>
>>
>>>
>>>
>>> 2. We should have a bitmap of "available counters" in each domain. We have
>>> this already. But allocation method changes.
>>
>> Would allocation/free not be simpler with only the per-domain bitmap needing
>> to be consulted?
>>
>> One implementation change I can think of is the dynamic assign of counters when
>> a monitor group is created. Now a free counter needs to be found in each
>> domain. Here it can be discussed if it should be an "all or nothing"
>> assignment but the handling does not seem to be complex and would need to be
>> solved eventually anyway.
>>
>>> 3. Dynamic allocation/free of the counters
>>>
>>> There could be some more things which I can't think right now. It might
>>> come up when we start working on it.
>>>
>>> It is doable. But, is it worth adding this complexity? I am not sure.
>>
>> Please elaborate where you see that this is too complex.
>>
>>>
>>> Peter mentioned earlier that he was not interested in domain specific
>>> assignments. He was only interested in all domain ("*") implementation.
>>
>> Peter's most recent message indicates otherwise:
>> https://lore.kernel.org/all/CALPaoCgiHEaY_cDbCo=537JJ7mkYZDFFDs9heYvtQ80fXuuvWQ@mail.gmail.com/
>
> For now, I'm focused on managing the domains locally whenever possible
> to avoid all IPIs, as this gives me the least overhead.
>
> I'm also prototyping the 'T' vs 't' approach that Reinette
> suggested[1], as this may take a lot of performance pressure off the
> mbm_assign_control interface, as most of the routine updates to
> counter assignments would be automated.
>
> -Peter
>
> [1] https://lore.kernel.org/lkml/7ee63634-3b55-4427-8283-8e3d38105f41@intel.com/
>
--
- Babu Moger
Powered by blists - more mailing lists