linux-kernel - [PATCH] x86/resctrl: Fix buggy overflow when reactivating previously Unavailable RMID

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <515a38328989e48d403ef5a7d6dd321ba3343a61.1759791957.git.babu.moger@amd.com>
Date: Mon, 6 Oct 2025 18:13:44 -0500
From: Babu Moger <babu.moger@....com>
To: <babu.moger@....com>, <tony.luck@...el.com>, <reinette.chatre@...el.com>,
	<Dave.Martin@....com>, <james.morse@....com>, <tglx@...utronix.de>,
	<mingo@...hat.com>, <bp@...en8.de>, <dave.hansen@...ux.intel.com>
CC: <x86@...nel.org>, <hpa@...or.com>, <linux-kernel@...r.kernel.org>,
	<peternewman@...gle.com>, <eranian@...gle.com>, <gautham.shenoy@....com>
Subject: [PATCH] x86/resctrl: Fix buggy overflow when reactivating previously Unavailable RMID

The issue was observed during testing on systems with multiple resctrl
domains, where tasks were dynamically moved between domains.

Users can create as many monitoring groups as the number of RMIDs supported
by the hardware. However, on AMD systems, only a limited number of RMIDs
are guaranteed to be actively tracked by the hardware. RMIDs that exceed
this limit are placed in an "Unavailable" state.

When an RMID transitions from "Unavailable" back to active, the hardware
sets the IA32_QM_CTR.Unavailable (bit 62) on the first read from
MSR_IA32_QM_CTR. This indicates that the RMID was not previously tracked.
Once the hardware begins tracking the RMID, subsequent reads from that RMID
will have the Unavailable bit cleared, as long as it remains tracked.

Problem scenario:
1. The resctrl filesystem is mounted, and a task is assigned to a
   monitoring group.

   $mount -t resctrl resctr /sys/fs/resctrl
   $mkdir /sys/fs/resctr/mon_groups/test1
   $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks

   $cat /sys/fs/resctrl/mon_groups/test1/mon_data/l3_mon_*/mbm_total_bytes
   21323            <- Total bytes on domain 0
   "Unavailable"    <- Total bytes on domain 1

   Task is running on domain 0. Counter on domain 1 is "Unavailable".

2. The task runs on domain 0 for a while and then moves to domain 1. The
   counter starts incrementing on domain 1.

   $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes
   7345357          <- Total bytes on domain 0
   4545             <- Total bytes on domain 1


3. At some point, the RMID in domain 0 transitions to the "Unavailable"
   state because the task is no longer executing in that domain.

   $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes
   "Unavailable"    <- Total bytes on domain 0
   434341           <- Total bytes on domain 1

4.  Since the task continues to migrate between domains, it may eventually
    return to domain 0.

    $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes
    17592178699059  <- Overflow on domain 0
    3232332         <- Total bytes on domain 1

    Because the RMID transitions from the “Unavailable” state to the
    active state, the first read sets IA32_QM_CTR.Unavailable (bit 62).
    The following read starts counting from zero, which can be lower than
    the previously saved MSR value (7345357). Consequently, the kernel’s
    overflow logic is triggered—it compares the previous value (7345357)
    with the new, smaller value and mistakenly interprets this as a counter
    overflow, adding a large delta. In reality, this is a false positive:
    the counter didn’t overflow but was simply reset when the RMID
    transitioned from “Unavailable” back to active.

Fix the issue by resetting the prev_msr value to zero when hardware
returns IA32_QM_CTR.Unavailable (bit 62) when the RMID becomes active from
"Unavailable".

Here is the text from APM [1].

"In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
first QM_CTR read when it begins tracking an RMID that it was not
previously tracking. The U bit will be zero for all subsequent reads from
that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
read with the U bit set when that RMID is in use by a processor can be
considered 0 when calculating the difference with a subsequent read."

The details are documented in APM [1] available from [2].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
Bandwidth (MBM).

Signed-off-by: Babu Moger <babu.moger@....com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
---

Tested this on multiple AMD systems, but not on Intel systems.
Need help with that. If everything goes well, this patch needs to
go to all the stable kernels.
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c8945610d455..a685370dd160 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -242,7 +242,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 unused, u32 rmid, enum resctrl_event_id eventid,
 			   u64 *val, void *ignored)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	int cpu = cpumask_any(&d->hdr.cpu_mask);
+	struct arch_mbm_state *am;
 	u64 msr_val;
 	u32 prmid;
 	int ret;
@@ -251,12 +253,21 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 
 	prmid = logical_rmid_to_physical_rmid(cpu, rmid);
 	ret = __rmid_read_phys(prmid, eventid, &msr_val);
-	if (ret)
-		return ret;
 
-	*val = get_corrected_val(r, d, rmid, eventid, msr_val);
+	switch (ret) {
+	case 0:
+		*val = get_corrected_val(r, d, rmid, eventid, msr_val);
+		break;
+	case -EINVAL:
+		am = get_arch_mbm_state(hw_dom, rmid, eventid);
+		if (am)
+			am->prev_msr = 0;
+		break;
+	default:
+		break;
+	}
 
-	return 0;
+	return ret;
 }
 
 static int __cntr_id_read(u32 cntr_id, u64 *val)
-- 
2.34.1