linux-kernel - Re: [PATCH v2] x86/resctrl: Fix buggy overflow when reactivating previously Unavailable RMID

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <78dcda7c-b3f2-4149-b6f8-3da695d83bdb@intel.com>
Date: Wed, 8 Oct 2025 19:00:35 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Babu Moger <babu.moger@....com>, <tony.luck@...el.com>,
	<Dave.Martin@....com>, <james.morse@....com>, <tglx@...utronix.de>,
	<mingo@...hat.com>, <bp@...en8.de>, <dave.hansen@...ux.intel.com>
CC: <x86@...nel.org>, <hpa@...or.com>, <linux-kernel@...r.kernel.org>,
	<peternewman@...gle.com>, <eranian@...gle.com>, <gautham.shenoy@....com>
Subject: Re: [PATCH v2] x86/resctrl: Fix buggy overflow when reactivating
 previously Unavailable RMID

Hi Babu,

On 10/8/25 12:39 PM, Babu Moger wrote:
> Users can create as many monitoring groups as the number of RMIDs supported
> by the hardware. However, on AMD systems, only a limited number of RMIDs
> are guaranteed to be actively tracked by the hardware. RMIDs that exceed
> this limit are placed in an "Unavailable" state. When a bandwidth counter
> is read for such an RMID, the hardware sets MSR_IA32_QM_CTR.Unavailable
> (bit 62).

To make this context complete I think you can append something like: 
	When such an RMID starts being tracked again the hardware counter is
	reset to zero. MSR_IA32_QM_CTR.Unavailable remains set on first read after
	tracking re-starts and is clear on all subsequent reads as long as the
	RMID is tracked.

> 
> The problem occurs when an RMID transitions from the “Unavailable” state

Which problem? (Please let changelog stand on its own and not be continuation of subject)

> back to the active state. When this happens, the hardware resets the
> counter to zero, but the kernel compares this new smaller value with the
> previously saved MSR value and mistakenly interprets it as an overflow.

I do not think this is just about overflow. Certainly this is the
most visible symptom but the stored counter value may also be smaller than the new
counter value resulting in undercounting of bandwidth? (ignoring that not
counting at all while RMID is unavailable is technically also undercounting).

Would something like below be accurate?

	resctrl miscounts the bandwidth events after an RMID transitions
	from the "Unavailable" state back to being tracked. This happens
	because when the hardware starts counting again after resetting the counter to
	zero, resctrl in turn compares the new count against the counter value
	stored from the previous time the RMID was tracked. This results in resctrl
	computing an event value that is either undercounting (when new counter is more than
	stored counter)	or a mistaken overflow (when new counter is less than stored counter).

If you agree with the summary then please update the subject to match. For example,
"x86/resctrl: Fix miscount of bandwidth event when reactivating previously Unavailable RMID"

I think Dave's feedback about changelog length is valid. The changelog can present the
fix at this point and leave the detailed description of the overflow scenario to the end of
changelog with a heading that reader can use to decide to skip over if problem is clear or use as
reference to see the problem in action. 

I also recommend that the fix be specific and avoid vague statement like "to resolve the issue".
For example,

	Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR to zero
	whenever the RMID is in the "Unavailable" state to ensure accurate
	counting after the RMID resets to zero when it starts to be tracked again

> 
> Problem scenario:

The portion below can have a heading to help reader identify its purpose. For example,

Example scenario that results in mistaken overflow
==================================================


> 1. The resctrl filesystem is mounted, and a task is assigned to a
>    monitoring group.
> 
>    $mount -t resctrl resctrl /sys/fs/resctrl
>    $mkdir /sys/fs/resctrl/mon_groups/test1/
>    $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks
> 
>    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>    21323            <- Total bytes on domain 0
>    "Unavailable"    <- Total bytes on domain 1
> 
>    Task is running on domain 0. Counter on domain 1 is "Unavailable".
> 
> 2. The task runs on domain 0 for a while and then moves to domain 1. The
>    counter starts incrementing on domain 1.
> 
>    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>    7345357          <- Total bytes on domain 0
>    4545             <- Total bytes on domain 1
> 
> 
> 3. At some point, the RMID in domain 0 transitions to the "Unavailable"
>    state because the task is no longer executing in that domain.
> 
>    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>    "Unavailable"    <- Total bytes on domain 0
>    434341           <- Total bytes on domain 1
> 
> 4.  Since the task continues to migrate between domains, it may eventually
>     return to domain 0.
> 
>     $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>     17592178699059  <- Overflow on domain 0
>     3232332         <- Total bytes on domain 1
> 

Is below intended to be indented?

>     In this case, the RMID on domain 0 transitions from “Unavailable”
>     state to the active state. The hardware sets MSR_IA32_QM_CTR.Unavailable

"active state" -> "tracked state" (to be consistent with terminology - not sure what
is preferred between "active" and "tracked" but please be consistent)

>     (bit 62) when the counter is read and begins tracking the RMID counting
>     from 0. Subsequent reads succeed but may return a value smaller than the

"may return" -> "returns"

>     previously saved MSR value (7345357). Consequently, the kernel’s overflow

"the kernel’s" -> "resctrl's"?

>     logic is triggered—it compares the previous value (7345357) with the new,
>     smaller value and incorrectly interprets this as a counter overflow,
>     adding a large delta. In reality, this is a false positive: the counter
>     did not overflow but was simply reset when the RMID transitioned from
>     “Unavailable” back to active.

Here is what I do to check for non-ascii characters:
$ b4 am <message ID>
$ grep -P '[^\t\n\x20-\x7E]' <downloaded patch>

Could you please try it out on this patch and fix the matches?

> 
> Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR, used
> for handling counter overflows, whenever the RMID transitions to the
> “Unavailable” state to resolve the issue.
> 
> Here is the text from APM [1] available from [2].
> 
> "In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
> first QM_CTR read when it begins tracking an RMID that it was not
> previously tracking. The U bit will be zero for all subsequent reads from
> that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
> read with the U bit set when that RMID is in use by a processor can be
> considered 0 when calculating the difference with a subsequent read."
> 
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>     Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
>     Bandwidth (MBM).
> 
> Cc: stable@...r.kernel.org # needs adjustments for <= v6.17

Tag ordering guide "Ordering of commit tags" found in
Documentation/process/maintainer-tip.rst places the "Cc" just before
the "Link:" tag.

> Fixes: 4d05bf71f157d ("x86/resctrl: Introduce AMD QOS feature")
> Signed-off-by: Babu Moger <babu.moger@....com>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
> ---

Reinette