linux-kernel - Re: [PATCH] x86/resctrl: Fix buggy overflow when reactivating previously Unavailable RMID

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3e1b459a-bd9a-400b-8c10-038b330e539f@amd.com>
Date: Tue, 7 Oct 2025 18:47:23 -0500
From: "Moger, Babu" <bmoger@....com>
To: Reinette Chatre <reinette.chatre@...el.com>,
 Babu Moger <babu.moger@....com>, tony.luck@...el.com, Dave.Martin@....com,
 james.morse@....com, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com
Cc: x86@...nel.org, hpa@...or.com, linux-kernel@...r.kernel.org,
 peternewman@...gle.com, eranian@...gle.com, gautham.shenoy@....com
Subject: Re: [PATCH] x86/resctrl: Fix buggy overflow when reactivating
 previously Unavailable RMID



On 10/7/2025 3:23 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> Thank you for catching and preparing a fix for this issue.
> 
> On 10/6/25 4:13 PM, Babu Moger wrote:
>> The issue was observed during testing on systems with multiple resctrl
>> domains, where tasks were dynamically moved between domains.
> 
> (please let changelog stand on its own and not be a continuation of subject line)

ok. Will  remove the above first line.
> 
>>
>> Users can create as many monitoring groups as the number of RMIDs supported
>> by the hardware. However, on AMD systems, only a limited number of RMIDs
>> are guaranteed to be actively tracked by the hardware. RMIDs that exceed
>> this limit are placed in an "Unavailable" state.
>>
>> When an RMID transitions from "Unavailable" back to active, the hardware
>> sets the IA32_QM_CTR.Unavailable (bit 62) on the first read from
> 
> (can drop "the" from "sets the IA32_QM_CTR.Unavailable (bit 62)")
> 
> (IA32_QM_CTR -> MSR_IA32_QM_CTR) (learning from touchup made during merge of ABMC work)

Sure.

>> MSR_IA32_QM_CTR. This indicates that the RMID was not previously tracked.
> 
> This seems like a contradiction to me. Some times, like above, it reads
> that Unavailable bit is set on *first* read after counter returns to
> "active" but this text (and the spec) also mentions the "Unavailable state" where
> where the Unavailable bit is set on *every* read while RMID is in Unavailable
> state.

You are right. Will change the text to make that clear.

> 
> For example, from the spec:
> 	Potential causes of the U bit being set include (but are not limited to):
> 	- RMID is not currently tracked by the hardware
> 
>> Once the hardware begins tracking the RMID, subsequent reads from that RMID
>> will have the Unavailable bit cleared, as long as it remains tracked.
> 
> Can we thus expect that Unavailable bit is *always* set when RMID is not tracked?

Yes.
> 
>>
>> Problem scenario:
>> 1. The resctrl filesystem is mounted, and a task is assigned to a
>>     monitoring group.
>>
>>     $mount -t resctrl resctr /sys/fs/resctrl
> 
> (resctr -> resctrl)

Sure.
> 
>>     $mkdir /sys/fs/resctr/mon_groups/test1
> 
> (resctr -> resctrl)

Sure.
> 
>>     $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks
>>
>>     $cat /sys/fs/resctrl/mon_groups/test1/mon_data/l3_mon_*/mbm_total_bytes
> 
> (l3_mon* -> mon_L3*)

Sure.
> 
>>     21323            <- Total bytes on domain 0
>>     "Unavailable"    <- Total bytes on domain 1
>>
>>     Task is running on domain 0. Counter on domain 1 is "Unavailable".
> 
> This implies that the Unavailable bit is set while RMID is "in Unavailable state/
> not tracked" which seems to support that this bit is always set while RMID is not
> tracked, not just on first read after counter returns active.

Correct.

> 
>>
>> 2. The task runs on domain 0 for a while and then moves to domain 1. The
>>     counter starts incrementing on domain 1.
>>
>>     $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes
> 
> (l3_mon* -> mon_L3*)

Sure.

> 
>>     7345357          <- Total bytes on domain 0
>>     4545             <- Total bytes on domain 1
>>
>>
>> 3. At some point, the RMID in domain 0 transitions to the "Unavailable"
>>     state because the task is no longer executing in that domain.
>>
>>     $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes
> 
> (l3_mon* -> mon_L3*)

Sure.

> 
>>     "Unavailable"    <- Total bytes on domain 0
>>     434341           <- Total bytes on domain 1
>>
>> 4.  Since the task continues to migrate between domains, it may eventually
>>      return to domain 0.
>>
>>      $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes
> 
> (l3_mon* -> mon_L3*)

Sure.

> 
>>      17592178699059  <- Overflow on domain 0
>>      3232332         <- Total bytes on domain 1
>>
>>      Because the RMID transitions from the “Unavailable” state to the
> 
> This paragraph contains a few non-ascii characters ... like the quotes
> around Unavailable above (and below) and a few other characters (’ and —)
> below.
> 
> (This happened in ABMC work also but slipped through with Boris needing
> to fix. Now I include a check for it so that it can be addressed sooner.
> Please check your tools to prevent these from slipping in.)

Let me check that.

> 
>>      active state, the first read sets IA32_QM_CTR.Unavailable (bit 62).
> 
> (IA32_QM_CTR -> MSR_IA32_QM_CTR)
> 
> I expected the Unavailable bit to be set the entire time the RMID was
> not tracked, not just this "first read"?

Yes. That is correct.

> 
>>      The following read starts counting from zero, which can be lower than
>>      the previously saved MSR value (7345357). Consequently, the kernel’s
>>      overflow logic is triggered—it compares the previous value (7345357)
>>      with the new, smaller value and mistakenly interprets this as a counter
>>      overflow, adding a large delta. In reality, this is a false positive:
>>      the counter didn’t overflow but was simply reset when the RMID
>>      transitioned from “Unavailable” back to active.
>>
>> Fix the issue by resetting the prev_msr value to zero when hardware
> 
> "the prev_msr value" -> "arch_mbm_state::prev_msr"
> Although, this can be seen from the code. It may be more useful to describe
> the fix as "Reset the stored value of MSR_IA32_QM_CTR used to handle
> counter overflow every time the RMID is unavailable ..."

Sure.

> 
>> returns IA32_QM_CTR.Unavailable (bit 62) when the RMID becomes active from
> 
> (IA32_QM_CTR -> MSR_IA32_QM_CTR)
> 
>> "Unavailable".
> 
> Related to earlier comments I do not think "when the RMID becomes active from
> Unavailable" is accurate. It seems more accurate to say that the value is
> set to zero every time the RMID is unavailable/not tracked in preparation for
> the RMID to be reset to zero when it starts to be tracked again.

That is correct.

> 
>>
>> Here is the text from APM [1].
>>
>> "In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
>> first QM_CTR read when it begins tracking an RMID that it was not
>> previously tracking. The U bit will be zero for all subsequent reads from
>> that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
>> read with the U bit set when that RMID is in use by a processor can be
>> considered 0 when calculating the difference with a subsequent read."
> 
> Indeed ... and APM also says that Unavailable bit is set on *every* read while
> RMID is not tracked? :/

Yes.

> 
>>
>> The details are documented in APM [1] available from [2].
> 
> Seems unnecessary (copied from ABMC?)? Earlier quote can just be prefixed with:
> 
> 	Here is the text from APM [1] available from [2].

Sure.

> 
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
>> Bandwidth (MBM).
> 
> (another learning from ABMC work, this can be more readable with indentation)

Sure.

> 
> 	[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> 	    Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
> 	    Bandwidth (MBM).
> 
>>
>> Signed-off-by: Babu Moger <babu.moger@....com>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
>> ---
>>
>> Tested this on multiple AMD systems, but not on Intel systems.
>> Need help with that. If everything goes well, this patch needs to
>> go to all the stable kernels.
> 
> Needs a "Fixes" and "Cc: stable@...r.kernel.org"?

Hmm.. Not sure. Which commit to add Fixes in. This is AMD related.
We should have taken care when adding support to AMD. Does this commit 
make sense?

Fixes: 4d05bf71f157d ("x86/resctrl: Introduce AMD QOS feature")


> This patch will only apply to upstream though so would need backporting
> support. Will you be able to support this?

Yes. I can do that.

> 
>> ---
>>   arch/x86/kernel/cpu/resctrl/monitor.c | 19 +++++++++++++++----
>>   1 file changed, 15 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index c8945610d455..a685370dd160 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -242,7 +242,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>>   			   u32 unused, u32 rmid, enum resctrl_event_id eventid,
>>   			   u64 *val, void *ignored)
>>   {
>> +	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>>   	int cpu = cpumask_any(&d->hdr.cpu_mask);
>> +	struct arch_mbm_state *am;
>>   	u64 msr_val;
>>   	u32 prmid;
>>   	int ret;
>> @@ -251,12 +253,21 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>>   
>>   	prmid = logical_rmid_to_physical_rmid(cpu, rmid);
>>   	ret = __rmid_read_phys(prmid, eventid, &msr_val);
>> -	if (ret)
>> -		return ret;
>>   
>> -	*val = get_corrected_val(r, d, rmid, eventid, msr_val);
>> +	switch (ret) {
>> +	case 0:
>> +		*val = get_corrected_val(r, d, rmid, eventid, msr_val);
>> +		break;
>> +	case -EINVAL:
>> +		am = get_arch_mbm_state(hw_dom, rmid, eventid);
>> +		if (am)
>> +			am->prev_msr = 0;
>> +		break;
>> +	default:
>> +		break;
>> +	}
>>   
>> -	return 0;
>> +	return ret;
>>   }
>>   
>>   static int __cntr_id_read(u32 cntr_id, u64 *val)
> 
> The fix looks good to me.

Thanks
Babu