linux-kernel - Re: [PATCH] x86/resctrl: Fix buggy overflow when reactivating previously Unavailable RMID

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fe06aa33-3351-45d4-a687-ff88a689be6e@intel.com>
Date: Tue, 7 Oct 2025 13:23:36 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Babu Moger <babu.moger@....com>, <tony.luck@...el.com>,
	<Dave.Martin@....com>, <james.morse@....com>, <tglx@...utronix.de>,
	<mingo@...hat.com>, <bp@...en8.de>, <dave.hansen@...ux.intel.com>
CC: <x86@...nel.org>, <hpa@...or.com>, <linux-kernel@...r.kernel.org>,
	<peternewman@...gle.com>, <eranian@...gle.com>, <gautham.shenoy@....com>
Subject: Re: [PATCH] x86/resctrl: Fix buggy overflow when reactivating
 previously Unavailable RMID

Hi Babu,

Thank you for catching and preparing a fix for this issue.

On 10/6/25 4:13 PM, Babu Moger wrote:
> The issue was observed during testing on systems with multiple resctrl
> domains, where tasks were dynamically moved between domains.

(please let changelog stand on its own and not be a continuation of subject line)

> 
> Users can create as many monitoring groups as the number of RMIDs supported
> by the hardware. However, on AMD systems, only a limited number of RMIDs
> are guaranteed to be actively tracked by the hardware. RMIDs that exceed
> this limit are placed in an "Unavailable" state.
> 
> When an RMID transitions from "Unavailable" back to active, the hardware
> sets the IA32_QM_CTR.Unavailable (bit 62) on the first read from

(can drop "the" from "sets the IA32_QM_CTR.Unavailable (bit 62)")

(IA32_QM_CTR -> MSR_IA32_QM_CTR) (learning from touchup made during merge of ABMC work)

> MSR_IA32_QM_CTR. This indicates that the RMID was not previously tracked.

This seems like a contradiction to me. Some times, like above, it reads
that Unavailable bit is set on *first* read after counter returns to
"active" but this text (and the spec) also mentions the "Unavailable state" where
where the Unavailable bit is set on *every* read while RMID is in Unavailable
state.

For example, from the spec:
	Potential causes of the U bit being set include (but are not limited to):
	- RMID is not currently tracked by the hardware

> Once the hardware begins tracking the RMID, subsequent reads from that RMID
> will have the Unavailable bit cleared, as long as it remains tracked.

Can we thus expect that Unavailable bit is *always* set when RMID is not tracked? 

> 
> Problem scenario:
> 1. The resctrl filesystem is mounted, and a task is assigned to a
>    monitoring group.
> 
>    $mount -t resctrl resctr /sys/fs/resctrl

(resctr -> resctrl)

>    $mkdir /sys/fs/resctr/mon_groups/test1

(resctr -> resctrl)

>    $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks
> 
>    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/l3_mon_*/mbm_total_bytes

(l3_mon* -> mon_L3*)

>    21323            <- Total bytes on domain 0
>    "Unavailable"    <- Total bytes on domain 1
> 
>    Task is running on domain 0. Counter on domain 1 is "Unavailable".

This implies that the Unavailable bit is set while RMID is "in Unavailable state/
not tracked" which seems to support that this bit is always set while RMID is not
tracked, not just on first read after counter returns active.

> 
> 2. The task runs on domain 0 for a while and then moves to domain 1. The
>    counter starts incrementing on domain 1.
> 
>    $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes

(l3_mon* -> mon_L3*)

>    7345357          <- Total bytes on domain 0
>    4545             <- Total bytes on domain 1
> 
> 
> 3. At some point, the RMID in domain 0 transitions to the "Unavailable"
>    state because the task is no longer executing in that domain.
> 
>    $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes

(l3_mon* -> mon_L3*)

>    "Unavailable"    <- Total bytes on domain 0
>    434341           <- Total bytes on domain 1
> 
> 4.  Since the task continues to migrate between domains, it may eventually
>     return to domain 0.
> 
>     $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes

(l3_mon* -> mon_L3*)

>     17592178699059  <- Overflow on domain 0
>     3232332         <- Total bytes on domain 1
> 
>     Because the RMID transitions from the “Unavailable” state to the

This paragraph contains a few non-ascii characters ... like the quotes
around Unavailable above (and below) and a few other characters (’ and —)
below. 

(This happened in ABMC work also but slipped through with Boris needing
to fix. Now I include a check for it so that it can be addressed sooner.
Please check your tools to prevent these from slipping in.)

>     active state, the first read sets IA32_QM_CTR.Unavailable (bit 62).

(IA32_QM_CTR -> MSR_IA32_QM_CTR)

I expected the Unavailable bit to be set the entire time the RMID was
not tracked, not just this "first read"?

>     The following read starts counting from zero, which can be lower than
>     the previously saved MSR value (7345357). Consequently, the kernel’s
>     overflow logic is triggered—it compares the previous value (7345357)
>     with the new, smaller value and mistakenly interprets this as a counter
>     overflow, adding a large delta. In reality, this is a false positive:
>     the counter didn’t overflow but was simply reset when the RMID
>     transitioned from “Unavailable” back to active.
> 
> Fix the issue by resetting the prev_msr value to zero when hardware

"the prev_msr value" -> "arch_mbm_state::prev_msr"
Although, this can be seen from the code. It may be more useful to describe
the fix as "Reset the stored value of MSR_IA32_QM_CTR used to handle
counter overflow every time the RMID is unavailable ..."

> returns IA32_QM_CTR.Unavailable (bit 62) when the RMID becomes active from

(IA32_QM_CTR -> MSR_IA32_QM_CTR)

> "Unavailable".

Related to earlier comments I do not think "when the RMID becomes active from
Unavailable" is accurate. It seems more accurate to say that the value is
set to zero every time the RMID is unavailable/not tracked in preparation for
the RMID to be reset to zero when it starts to be tracked again.

> 
> Here is the text from APM [1].
> 
> "In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
> first QM_CTR read when it begins tracking an RMID that it was not
> previously tracking. The U bit will be zero for all subsequent reads from
> that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
> read with the U bit set when that RMID is in use by a processor can be
> considered 0 when calculating the difference with a subsequent read."

Indeed ... and APM also says that Unavailable bit is set on *every* read while
RMID is not tracked? :/

> 
> The details are documented in APM [1] available from [2].

Seems unnecessary (copied from ABMC?)? Earlier quote can just be prefixed with:

	Here is the text from APM [1] available from [2].

> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
> Bandwidth (MBM).

(another learning from ABMC work, this can be more readable with indentation)

	[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
	    Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
	    Bandwidth (MBM).

> 
> Signed-off-by: Babu Moger <babu.moger@....com>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
> ---
> 
> Tested this on multiple AMD systems, but not on Intel systems.
> Need help with that. If everything goes well, this patch needs to
> go to all the stable kernels.

Needs a "Fixes" and "Cc: stable@...r.kernel.org"? 
This patch will only apply to upstream though so would need backporting
support. Will you be able to support this?

> ---
>  arch/x86/kernel/cpu/resctrl/monitor.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index c8945610d455..a685370dd160 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -242,7 +242,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>  			   u32 unused, u32 rmid, enum resctrl_event_id eventid,
>  			   u64 *val, void *ignored)
>  {
> +	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>  	int cpu = cpumask_any(&d->hdr.cpu_mask);
> +	struct arch_mbm_state *am;
>  	u64 msr_val;
>  	u32 prmid;
>  	int ret;
> @@ -251,12 +253,21 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>  
>  	prmid = logical_rmid_to_physical_rmid(cpu, rmid);
>  	ret = __rmid_read_phys(prmid, eventid, &msr_val);
> -	if (ret)
> -		return ret;
>  
> -	*val = get_corrected_val(r, d, rmid, eventid, msr_val);
> +	switch (ret) {
> +	case 0:
> +		*val = get_corrected_val(r, d, rmid, eventid, msr_val);
> +		break;
> +	case -EINVAL:
> +		am = get_arch_mbm_state(hw_dom, rmid, eventid);
> +		if (am)
> +			am->prev_msr = 0;
> +		break;
> +	default:
> +		break;
> +	}
>  
> -	return 0;
> +	return ret;
>  }
>  
>  static int __cntr_id_read(u32 cntr_id, u64 *val)

The fix looks good to me. 

Reinette