[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fe06aa33-3351-45d4-a687-ff88a689be6e@intel.com>
Date: Tue, 7 Oct 2025 13:23:36 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Babu Moger <babu.moger@....com>, <tony.luck@...el.com>,
<Dave.Martin@....com>, <james.morse@....com>, <tglx@...utronix.de>,
<mingo@...hat.com>, <bp@...en8.de>, <dave.hansen@...ux.intel.com>
CC: <x86@...nel.org>, <hpa@...or.com>, <linux-kernel@...r.kernel.org>,
<peternewman@...gle.com>, <eranian@...gle.com>, <gautham.shenoy@....com>
Subject: Re: [PATCH] x86/resctrl: Fix buggy overflow when reactivating
previously Unavailable RMID
Hi Babu,
Thank you for catching and preparing a fix for this issue.
On 10/6/25 4:13 PM, Babu Moger wrote:
> The issue was observed during testing on systems with multiple resctrl
> domains, where tasks were dynamically moved between domains.
(please let changelog stand on its own and not be a continuation of subject line)
>
> Users can create as many monitoring groups as the number of RMIDs supported
> by the hardware. However, on AMD systems, only a limited number of RMIDs
> are guaranteed to be actively tracked by the hardware. RMIDs that exceed
> this limit are placed in an "Unavailable" state.
>
> When an RMID transitions from "Unavailable" back to active, the hardware
> sets the IA32_QM_CTR.Unavailable (bit 62) on the first read from
(can drop "the" from "sets the IA32_QM_CTR.Unavailable (bit 62)")
(IA32_QM_CTR -> MSR_IA32_QM_CTR) (learning from touchup made during merge of ABMC work)
> MSR_IA32_QM_CTR. This indicates that the RMID was not previously tracked.
This seems like a contradiction to me. Some times, like above, it reads
that Unavailable bit is set on *first* read after counter returns to
"active" but this text (and the spec) also mentions the "Unavailable state" where
where the Unavailable bit is set on *every* read while RMID is in Unavailable
state.
For example, from the spec:
Potential causes of the U bit being set include (but are not limited to):
- RMID is not currently tracked by the hardware
> Once the hardware begins tracking the RMID, subsequent reads from that RMID
> will have the Unavailable bit cleared, as long as it remains tracked.
Can we thus expect that Unavailable bit is *always* set when RMID is not tracked?
>
> Problem scenario:
> 1. The resctrl filesystem is mounted, and a task is assigned to a
> monitoring group.
>
> $mount -t resctrl resctr /sys/fs/resctrl
(resctr -> resctrl)
> $mkdir /sys/fs/resctr/mon_groups/test1
(resctr -> resctrl)
> $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks
>
> $cat /sys/fs/resctrl/mon_groups/test1/mon_data/l3_mon_*/mbm_total_bytes
(l3_mon* -> mon_L3*)
> 21323 <- Total bytes on domain 0
> "Unavailable" <- Total bytes on domain 1
>
> Task is running on domain 0. Counter on domain 1 is "Unavailable".
This implies that the Unavailable bit is set while RMID is "in Unavailable state/
not tracked" which seems to support that this bit is always set while RMID is not
tracked, not just on first read after counter returns active.
>
> 2. The task runs on domain 0 for a while and then moves to domain 1. The
> counter starts incrementing on domain 1.
>
> $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes
(l3_mon* -> mon_L3*)
> 7345357 <- Total bytes on domain 0
> 4545 <- Total bytes on domain 1
>
>
> 3. At some point, the RMID in domain 0 transitions to the "Unavailable"
> state because the task is no longer executing in that domain.
>
> $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes
(l3_mon* -> mon_L3*)
> "Unavailable" <- Total bytes on domain 0
> 434341 <- Total bytes on domain 1
>
> 4. Since the task continues to migrate between domains, it may eventually
> return to domain 0.
>
> $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes
(l3_mon* -> mon_L3*)
> 17592178699059 <- Overflow on domain 0
> 3232332 <- Total bytes on domain 1
>
> Because the RMID transitions from the “Unavailable” state to the
This paragraph contains a few non-ascii characters ... like the quotes
around Unavailable above (and below) and a few other characters (’ and —)
below.
(This happened in ABMC work also but slipped through with Boris needing
to fix. Now I include a check for it so that it can be addressed sooner.
Please check your tools to prevent these from slipping in.)
> active state, the first read sets IA32_QM_CTR.Unavailable (bit 62).
(IA32_QM_CTR -> MSR_IA32_QM_CTR)
I expected the Unavailable bit to be set the entire time the RMID was
not tracked, not just this "first read"?
> The following read starts counting from zero, which can be lower than
> the previously saved MSR value (7345357). Consequently, the kernel’s
> overflow logic is triggered—it compares the previous value (7345357)
> with the new, smaller value and mistakenly interprets this as a counter
> overflow, adding a large delta. In reality, this is a false positive:
> the counter didn’t overflow but was simply reset when the RMID
> transitioned from “Unavailable” back to active.
>
> Fix the issue by resetting the prev_msr value to zero when hardware
"the prev_msr value" -> "arch_mbm_state::prev_msr"
Although, this can be seen from the code. It may be more useful to describe
the fix as "Reset the stored value of MSR_IA32_QM_CTR used to handle
counter overflow every time the RMID is unavailable ..."
> returns IA32_QM_CTR.Unavailable (bit 62) when the RMID becomes active from
(IA32_QM_CTR -> MSR_IA32_QM_CTR)
> "Unavailable".
Related to earlier comments I do not think "when the RMID becomes active from
Unavailable" is accurate. It seems more accurate to say that the value is
set to zero every time the RMID is unavailable/not tracked in preparation for
the RMID to be reset to zero when it starts to be tracked again.
>
> Here is the text from APM [1].
>
> "In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
> first QM_CTR read when it begins tracking an RMID that it was not
> previously tracking. The U bit will be zero for all subsequent reads from
> that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
> read with the U bit set when that RMID is in use by a processor can be
> considered 0 when calculating the difference with a subsequent read."
Indeed ... and APM also says that Unavailable bit is set on *every* read while
RMID is not tracked? :/
>
> The details are documented in APM [1] available from [2].
Seems unnecessary (copied from ABMC?)? Earlier quote can just be prefixed with:
Here is the text from APM [1] available from [2].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
> Bandwidth (MBM).
(another learning from ABMC work, this can be more readable with indentation)
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
Bandwidth (MBM).
>
> Signed-off-by: Babu Moger <babu.moger@....com>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
> ---
>
> Tested this on multiple AMD systems, but not on Intel systems.
> Need help with that. If everything goes well, this patch needs to
> go to all the stable kernels.
Needs a "Fixes" and "Cc: stable@...r.kernel.org"?
This patch will only apply to upstream though so would need backporting
support. Will you be able to support this?
> ---
> arch/x86/kernel/cpu/resctrl/monitor.c | 19 +++++++++++++++----
> 1 file changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index c8945610d455..a685370dd160 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -242,7 +242,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> u32 unused, u32 rmid, enum resctrl_event_id eventid,
> u64 *val, void *ignored)
> {
> + struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> int cpu = cpumask_any(&d->hdr.cpu_mask);
> + struct arch_mbm_state *am;
> u64 msr_val;
> u32 prmid;
> int ret;
> @@ -251,12 +253,21 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>
> prmid = logical_rmid_to_physical_rmid(cpu, rmid);
> ret = __rmid_read_phys(prmid, eventid, &msr_val);
> - if (ret)
> - return ret;
>
> - *val = get_corrected_val(r, d, rmid, eventid, msr_val);
> + switch (ret) {
> + case 0:
> + *val = get_corrected_val(r, d, rmid, eventid, msr_val);
> + break;
> + case -EINVAL:
> + am = get_arch_mbm_state(hw_dom, rmid, eventid);
> + if (am)
> + am->prev_msr = 0;
> + break;
> + default:
> + break;
> + }
>
> - return 0;
> + return ret;
> }
>
> static int __cntr_id_read(u32 cntr_id, u64 *val)
The fix looks good to me.
Reinette
Powered by blists - more mailing lists