[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3e1b459a-bd9a-400b-8c10-038b330e539f@amd.com>
Date: Tue, 7 Oct 2025 18:47:23 -0500
From: "Moger, Babu" <bmoger@....com>
To: Reinette Chatre <reinette.chatre@...el.com>,
Babu Moger <babu.moger@....com>, tony.luck@...el.com, Dave.Martin@....com,
james.morse@....com, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com
Cc: x86@...nel.org, hpa@...or.com, linux-kernel@...r.kernel.org,
peternewman@...gle.com, eranian@...gle.com, gautham.shenoy@....com
Subject: Re: [PATCH] x86/resctrl: Fix buggy overflow when reactivating
previously Unavailable RMID
On 10/7/2025 3:23 PM, Reinette Chatre wrote:
> Hi Babu,
>
> Thank you for catching and preparing a fix for this issue.
>
> On 10/6/25 4:13 PM, Babu Moger wrote:
>> The issue was observed during testing on systems with multiple resctrl
>> domains, where tasks were dynamically moved between domains.
>
> (please let changelog stand on its own and not be a continuation of subject line)
ok. Will remove the above first line.
>
>>
>> Users can create as many monitoring groups as the number of RMIDs supported
>> by the hardware. However, on AMD systems, only a limited number of RMIDs
>> are guaranteed to be actively tracked by the hardware. RMIDs that exceed
>> this limit are placed in an "Unavailable" state.
>>
>> When an RMID transitions from "Unavailable" back to active, the hardware
>> sets the IA32_QM_CTR.Unavailable (bit 62) on the first read from
>
> (can drop "the" from "sets the IA32_QM_CTR.Unavailable (bit 62)")
>
> (IA32_QM_CTR -> MSR_IA32_QM_CTR) (learning from touchup made during merge of ABMC work)
Sure.
>> MSR_IA32_QM_CTR. This indicates that the RMID was not previously tracked.
>
> This seems like a contradiction to me. Some times, like above, it reads
> that Unavailable bit is set on *first* read after counter returns to
> "active" but this text (and the spec) also mentions the "Unavailable state" where
> where the Unavailable bit is set on *every* read while RMID is in Unavailable
> state.
You are right. Will change the text to make that clear.
>
> For example, from the spec:
> Potential causes of the U bit being set include (but are not limited to):
> - RMID is not currently tracked by the hardware
>
>> Once the hardware begins tracking the RMID, subsequent reads from that RMID
>> will have the Unavailable bit cleared, as long as it remains tracked.
>
> Can we thus expect that Unavailable bit is *always* set when RMID is not tracked?
Yes.
>
>>
>> Problem scenario:
>> 1. The resctrl filesystem is mounted, and a task is assigned to a
>> monitoring group.
>>
>> $mount -t resctrl resctr /sys/fs/resctrl
>
> (resctr -> resctrl)
Sure.
>
>> $mkdir /sys/fs/resctr/mon_groups/test1
>
> (resctr -> resctrl)
Sure.
>
>> $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks
>>
>> $cat /sys/fs/resctrl/mon_groups/test1/mon_data/l3_mon_*/mbm_total_bytes
>
> (l3_mon* -> mon_L3*)
Sure.
>
>> 21323 <- Total bytes on domain 0
>> "Unavailable" <- Total bytes on domain 1
>>
>> Task is running on domain 0. Counter on domain 1 is "Unavailable".
>
> This implies that the Unavailable bit is set while RMID is "in Unavailable state/
> not tracked" which seems to support that this bit is always set while RMID is not
> tracked, not just on first read after counter returns active.
Correct.
>
>>
>> 2. The task runs on domain 0 for a while and then moves to domain 1. The
>> counter starts incrementing on domain 1.
>>
>> $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes
>
> (l3_mon* -> mon_L3*)
Sure.
>
>> 7345357 <- Total bytes on domain 0
>> 4545 <- Total bytes on domain 1
>>
>>
>> 3. At some point, the RMID in domain 0 transitions to the "Unavailable"
>> state because the task is no longer executing in that domain.
>>
>> $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes
>
> (l3_mon* -> mon_L3*)
Sure.
>
>> "Unavailable" <- Total bytes on domain 0
>> 434341 <- Total bytes on domain 1
>>
>> 4. Since the task continues to migrate between domains, it may eventually
>> return to domain 0.
>>
>> $cat /sys/fs/resctrl/mon_groups/mon_data/l3_mon_*/mbm_total_bytes
>
> (l3_mon* -> mon_L3*)
Sure.
>
>> 17592178699059 <- Overflow on domain 0
>> 3232332 <- Total bytes on domain 1
>>
>> Because the RMID transitions from the “Unavailable” state to the
>
> This paragraph contains a few non-ascii characters ... like the quotes
> around Unavailable above (and below) and a few other characters (’ and —)
> below.
>
> (This happened in ABMC work also but slipped through with Boris needing
> to fix. Now I include a check for it so that it can be addressed sooner.
> Please check your tools to prevent these from slipping in.)
Let me check that.
>
>> active state, the first read sets IA32_QM_CTR.Unavailable (bit 62).
>
> (IA32_QM_CTR -> MSR_IA32_QM_CTR)
>
> I expected the Unavailable bit to be set the entire time the RMID was
> not tracked, not just this "first read"?
Yes. That is correct.
>
>> The following read starts counting from zero, which can be lower than
>> the previously saved MSR value (7345357). Consequently, the kernel’s
>> overflow logic is triggered—it compares the previous value (7345357)
>> with the new, smaller value and mistakenly interprets this as a counter
>> overflow, adding a large delta. In reality, this is a false positive:
>> the counter didn’t overflow but was simply reset when the RMID
>> transitioned from “Unavailable” back to active.
>>
>> Fix the issue by resetting the prev_msr value to zero when hardware
>
> "the prev_msr value" -> "arch_mbm_state::prev_msr"
> Although, this can be seen from the code. It may be more useful to describe
> the fix as "Reset the stored value of MSR_IA32_QM_CTR used to handle
> counter overflow every time the RMID is unavailable ..."
Sure.
>
>> returns IA32_QM_CTR.Unavailable (bit 62) when the RMID becomes active from
>
> (IA32_QM_CTR -> MSR_IA32_QM_CTR)
>
>> "Unavailable".
>
> Related to earlier comments I do not think "when the RMID becomes active from
> Unavailable" is accurate. It seems more accurate to say that the value is
> set to zero every time the RMID is unavailable/not tracked in preparation for
> the RMID to be reset to zero when it starts to be tracked again.
That is correct.
>
>>
>> Here is the text from APM [1].
>>
>> "In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
>> first QM_CTR read when it begins tracking an RMID that it was not
>> previously tracking. The U bit will be zero for all subsequent reads from
>> that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
>> read with the U bit set when that RMID is in use by a processor can be
>> considered 0 when calculating the difference with a subsequent read."
>
> Indeed ... and APM also says that Unavailable bit is set on *every* read while
> RMID is not tracked? :/
Yes.
>
>>
>> The details are documented in APM [1] available from [2].
>
> Seems unnecessary (copied from ABMC?)? Earlier quote can just be prefixed with:
>
> Here is the text from APM [1] available from [2].
Sure.
>
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
>> Bandwidth (MBM).
>
> (another learning from ABMC work, this can be more readable with indentation)
Sure.
>
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
> Bandwidth (MBM).
>
>>
>> Signed-off-by: Babu Moger <babu.moger@....com>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
>> ---
>>
>> Tested this on multiple AMD systems, but not on Intel systems.
>> Need help with that. If everything goes well, this patch needs to
>> go to all the stable kernels.
>
> Needs a "Fixes" and "Cc: stable@...r.kernel.org"?
Hmm.. Not sure. Which commit to add Fixes in. This is AMD related.
We should have taken care when adding support to AMD. Does this commit
make sense?
Fixes: 4d05bf71f157d ("x86/resctrl: Introduce AMD QOS feature")
> This patch will only apply to upstream though so would need backporting
> support. Will you be able to support this?
Yes. I can do that.
>
>> ---
>> arch/x86/kernel/cpu/resctrl/monitor.c | 19 +++++++++++++++----
>> 1 file changed, 15 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index c8945610d455..a685370dd160 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -242,7 +242,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>> u32 unused, u32 rmid, enum resctrl_event_id eventid,
>> u64 *val, void *ignored)
>> {
>> + struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>> int cpu = cpumask_any(&d->hdr.cpu_mask);
>> + struct arch_mbm_state *am;
>> u64 msr_val;
>> u32 prmid;
>> int ret;
>> @@ -251,12 +253,21 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>>
>> prmid = logical_rmid_to_physical_rmid(cpu, rmid);
>> ret = __rmid_read_phys(prmid, eventid, &msr_val);
>> - if (ret)
>> - return ret;
>>
>> - *val = get_corrected_val(r, d, rmid, eventid, msr_val);
>> + switch (ret) {
>> + case 0:
>> + *val = get_corrected_val(r, d, rmid, eventid, msr_val);
>> + break;
>> + case -EINVAL:
>> + am = get_arch_mbm_state(hw_dom, rmid, eventid);
>> + if (am)
>> + am->prev_msr = 0;
>> + break;
>> + default:
>> + break;
>> + }
>>
>> - return 0;
>> + return ret;
>> }
>>
>> static int __cntr_id_read(u32 cntr_id, u64 *val)
>
> The fix looks good to me.
Thanks
Babu
Powered by blists - more mailing lists