[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3f3b4ca6-e11e-4258-b60c-48b823b7db4f@intel.com>
Date: Tue, 14 Oct 2025 09:24:08 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Babu Moger <bmoger@....com>, <babu.moger@....com>, <tony.luck@...el.com>,
<Dave.Martin@....com>, <james.morse@....com>, <dave.hansen@...ux.intel.com>,
<bp@...en8.de>
CC: <kas@...nel.org>, <rick.p.edgecombe@...el.com>,
<linux-kernel@...r.kernel.org>, <x86@...nel.org>,
<linux-coco@...ts.linux.dev>, <kvm@...r.kernel.org>
Subject: Re: [PATCH] fs/resctrl: Fix MBM events being unconditionally enabled
in mbm_event mode
Hi Babu,
On 10/7/25 7:38 PM, Reinette Chatre wrote:
> On 10/7/25 10:36 AM, Babu Moger wrote:
>> On 10/6/25 20:23, Reinette Chatre wrote:
>>> On 10/6/25 1:38 PM, Moger, Babu wrote:
>>>> On 10/6/25 12:56, Reinette Chatre wrote:
>>>>> On 9/30/25 1:26 PM, Babu Moger wrote:
>>>>>> resctrl features can be enabled or disabled using boot-time kernel
>>>>>> parameters. To turn off the memory bandwidth events (mbmtotal and
>>>>>> mbmlocal), users need to pass the following parameter to the kernel:
>>>>>> "rdt=!mbmtotal,!mbmlocal".
>>>>>
>>>>> ah, indeed ... although, the intention behind the mbmtotal and mbmlocal kernel
>>>>> parameters was to connect them to the actual hardware features identified
>>>>> by X86_FEATURE_CQM_MBM_TOTAL and X86_FEATURE_CQM_MBM_LOCAL respectively.
>>>>>
>>>>>
>>>>>> Found that memory bandwidth events (mbmtotal and mbmlocal) cannot be
>>>>>> disabled when mbm_event mode is enabled. resctrl_mon_resource_init()
>>>>>> unconditionally enables these events without checking if the underlying
>>>>>> hardware supports them.
>>>>>
>>>>> Technically this is correct since if hardware supports ABMC then the
>>>>> hardware is no longer required to support X86_FEATURE_CQM_MBM_TOTAL and
>>>>> X86_FEATURE_CQM_MBM_LOCAL in order to provide mbm_total_bytes
>>>>> and mbm_local_bytes.
>>>>>
>>>>> I can see how this may be confusing to user space though ...
>>>>>
>>>>>>
>>>>>> Remove the unconditional enablement of MBM features in
>>>>>> resctrl_mon_resource_init() to fix the problem. The hardware support
>>>>>> verification is already done in get_rdt_mon_resources().
>>>>>
>>>>> I believe by "hardware support" you mean hardware support for
>>>>> X86_FEATURE_CQM_MBM_TOTAL and X86_FEATURE_CQM_MBM_LOCAL. Wouldn't a fix like
>>>>> this then require any system that supports ABMC to also support
>>>>> X86_FEATURE_CQM_MBM_TOTAL and X86_FEATURE_CQM_MBM_LOCAL to be able to
>>>>> support mbm_total_bytes and mbm_local_bytes?
>>>>
>>>> Yes. That is correct. Right now, ABMC and X86_FEATURE_CQM_MBM_TOTAL/
>>>> X86_FEATURE_CQM_MBM_LOCAL are kind of tightly coupled. We have not clearly
>>>> separated the that.
>>>
>>> Are you speaking from resctrl side since from what I understand these are
>>> independent features from the hardware side?
>>
>> It is independent from hardware side. I meant we still use legacy events from "default" mode.
>
> Thank you for confirming. I was wondering if we need to fix it via cpuid_deps[]
> and resctrl_cpu_detect() to address a hardware dependency. If hardware self
> does not have the dependency then we need to fix it another way.
>
>>
>>>
>>>>> This problem seems to be similar to the one solved by [1] since
>>>>> by supporting ABMC there is no "hardware does not support mbmtotal/mbmlocal"
>>>>> but instead there only needs to be a check if the feature has been disabled
>>>>> by command line. That is, add a rdt_is_feature_enabled() check to the
>>>>> existing "!resctrl_is_mon_event_enabled()" check?
>>>>
>>>> Enable or disable needs to be done at get_rdt_mon_resources(). It needs to
>>>> be done early in the initialization before calling domain_add_cpu() where
>>>> event data structures (mbm_states aarch_mbm_states) are allocated.
>>>
>>> Good point. My mistake to suggest the event should be enabled by
>>> resctrl fs.
>>
>>
>> How about adding another check in get_rdt_mon_resources()?
>>
>> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)
>> || rdt_is_feature_enabled(mbmtotal)) {
>> resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID);
>> ret = true;
>> }
>
> Something like this yes. I think it should be in rdt_get_mon_l3_config() though, within
> the ABMC feature settings. If not then there may be an issue if the user boots with
> rdt=!abmc? I cannot see why the rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL) check is needed,
> which flow are you addressing?
>
> Before we exchange code I would like to step back a bit just to be clear that we agree
> on the current issues and what user space may expect. After this it should be easier to
> exchange code. (more below)
>
>>
>> I need to take Tony's patch for this.
>>
>>>
>>>>
>>>>>
>>>>> But wait ... I think there may be a bigger problem when considering systems
>>>>> that support ABMC but not X86_FEATURE_CQM_MBM_TOTAL and X86_FEATURE_CQM_MBM_LOCAL.
>>>>> Shouldn't resctrl prevent such a system from switching to "default"
>>>>> mbm_assign_mode? Otherwise resctrl will happily let such a system switch
>>>>> to default mode and when user attempts to read an event file resctrl will
>>>>> attempt to read it via MSRs that are not supported.
>>>>> Looks like ABMC may need something similar to CONFIG_RESCTRL_ASSIGN_FIXED
>>>>> to handle this case in show() while preventing user space from switching to
>>>>> "default" mode on write()?
>>>>
>>>> This may not be an issue right now. When X86_FEATURE_CQM_MBM_TOTAL and
>>>> X86_FEATURE_CQM_MBM_LOCAL are not supported then mon_data files of these
>>>> events are not created.
>>>
>>> By "right now" I assume you mean the current implementation? I think your statement
>>> assumes that no CPUs come or go after resctrl_mon_resource_init() enables the MBM events?
>>> Current implementation will enable MBM events if ABMC is supported. When the
>>> first CPU of a domain comes online after that then resctrl will create the mon_data
>>> files. These files will remain if a user then switches to default mode and if
>>> the user then attempts to read one of these counters then I expect problems.
>>
>> Yes. It will be a problem in the that case.
>
> Thinking about this more the issue is not about the mon_data files being created since
> they are only created if resctrl is mounted and resctrl_mon_resource_init() is run
> before creating the mountpoint. From what I can tell current MBM events supported by
> ABMC will be enabled at the time resctrl can be mounted so if X86_FEATURE_CQM_MBM_TOTAL
> and X86_FEATURE_CQM_MBM_LOCAL are not supported but ABMC is then I believe the
> mon_data files will be created.
>
> There is a problem with the actual domain creation during resctrl initialization
> where the MBM state data structures are created and depend on the events being
> enabled then.
> resctrl assumes that if an event is enabled then that event's associated
> rdt_mon_domain::mbm_states and rdt_hw_mon_domain::arch_mbm_states exist and if
> those data structures are created (or not created) during CPU online and MBM
> event comes online later then there will be invalid memory accesses.
>
> The conclusion is the same though ... the events need to be initialized during
> resctrl initialization as you note above.
>
>>
>> I am not clear on using config option you mentioned above.
>
> This is more about what is accomplished by the config option than whether it is
> a config option that controls the flow. More below but I believe there may be
> scenarios where only mbm_event is supported and in that case I expect, even on AMD,
> it may be possible that there is no supported "default" mode and thus:
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> [mbm_event]
>
>>
>> What about using the check resctrl_is_mon_event_enabled() in
>>
>> resctrl_mbm_assign_mode_show() and resctrl_mbm_assign_mode_write() ?
>>
>
> Trying to think through how to support a system that can switch between default
> and mbm_event mode I see a couple of things to consider. This is as I am thinking
> through the flows without able to experiment. I think it may help if you could sanity
> check this with perhaps a few experiments to considering the flows yourself to see where
> I am missing things.
>
> When we are clear on the flows to support and how to interact with user space it will
> be easier to start exchanging code.
>
> a) MBM state data structures
> As mentioned above, rdt_mon_domain::mbm_states and rdt_hw_mon_domain::arch_mbm_states
> are created during CPU online based on MBM event enabled state. During runtime
> an enabled MBM event is assumed to have state.
> To me this implies that any possible MBM event should be enabled during early
> initialization.
> A consequence is that any possible MBM event will have its associated event file
> created even if the active mode of the time cannot support it. (I do not think
> we want to have event files come and go).
> b) Switching between modes.
> From what I can tell switching mode is always allowed as long as system supports
> assignable counters and that may not be correct. Consider a system that supports
> ABMC but does not support X86_FEATURE_CQM_MBM_TOTAL and/or X86_FEATURE_CQM_MBM_LOCAL ...
> should it be allowed to switch to "default" mode? At this time I believe this is allowed
> yet this is an unusable state (as far as MBM goes) and I expect any attempt at reading
> an event file will result in invalid MSR access?
> Complexity increases if there is a mismatch in supported events, for example if mbm_event
> mode supports total and local but default mode only supports one. Should it be allowed
> to switch modes? If so, user can then still read from both files, the check whether assignable
> counters is enabled will fail and resctrl will attempt to read both via the counter MSRs,
> even an unsupported event (continued below).
> c) Read of event file
> A user can read from event file any time even if active mode (default or mbm_event) does
> not support it. If mbm_event mode is enabled then resctrl will attempt to use counters,
> if default mode is enabled then resctrl will attempt to use MSRs.
> This currently entirely depends on whether mbm_event mode is enabled or not.
> Perhaps we should add checks here to prevent user from reading an event if the
> active mode does not support it? Alternatively prevent user from switching to a mode
> that cannot be supported.
>
> Look forward to how you view things and thoughts on how user may expect to interact with these
> features.
I am concerned about this issue. The original changelog only mentions that events are enabled when
they should not be but it looks to me that there is a more serious issue if the user then attempts
to read from such an event. Have you tried the scenario when a user boots with the parameters
mentioned in changelog (rdt=!mbmtotal,!mbmlocal) and then attempts to read one of these events?
Reading from the event will attempt to access its architectural state but from what I can tell
that will not be allocated since the events are not enabled at the time of the allocation.
This needs to be fixed during this cycle. A week has passed since my previous message so I do not
think that it will be possible to create a full featured solution that keeps X86_FEATURE_ABMC
and X86_FEATURE_CQM_MBM_TOTAL/X86_FEATURE_CQM_MBM_LOCAL independent.
What do you think of something like below that builds on your original change and additionally
enforces dependency between these features to support the resctrl assumptions? From what I understand
this is ok for current AMD hardware? A not-as-urgent follow-up can make these features independent
again?
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c8945610d455..fd42fe7b2fdc 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -452,7 +452,16 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
r->mon.mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;
}
- if (rdt_cpu_has(X86_FEATURE_ABMC)) {
+ /*
+ * resctrl assumes a system that supports assignable counters can
+ * switch to "default" mode. Ensure that there is a "default" mode
+ * to switch to. This enforces a dependency between the independent
+ * X86_FEATURE_ABMC and X86_FEATURE_CQM_MBM_TOTAL/X86_FEATURE_CQM_MBM_LOCAL
+ * hardware features.
+ */
+ if (rdt_cpu_has(X86_FEATURE_ABMC) &&
+ (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL) ||
+ rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL))) {
r->mon.mbm_cntr_assignable = true;
cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
r->mon.num_mbm_cntrs = (ebx & GENMASK(15, 0)) + 1;
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 4076336fbba6..572a9925bd6c 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -1782,15 +1782,13 @@ int resctrl_mon_resource_init(void)
mba_mbps_default_event = QOS_L3_MBM_TOTAL_EVENT_ID;
if (r->mon.mbm_cntr_assignable) {
- if (!resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
- resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID);
- if (!resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
- resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID);
- mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].evt_cfg = r->mon.mbm_cfg_mask;
- mon_event_all[QOS_L3_MBM_LOCAL_EVENT_ID].evt_cfg = r->mon.mbm_cfg_mask &
- (READS_TO_LOCAL_MEM |
- READS_TO_LOCAL_S_MEM |
- NON_TEMP_WRITE_TO_LOCAL_MEM);
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
+ mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].evt_cfg = r->mon.mbm_cfg_mask;
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
+ mon_event_all[QOS_L3_MBM_LOCAL_EVENT_ID].evt_cfg = r->mon.mbm_cfg_mask &
+ (READS_TO_LOCAL_MEM |
+ READS_TO_LOCAL_S_MEM |
+ NON_TEMP_WRITE_TO_LOCAL_MEM);
r->mon.mbm_assign_on_mkdir = true;
resctrl_file_fflags_init("num_mbm_cntrs",
RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
Powered by blists - more mailing lists