[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dcc38ba0-cbb4-494e-bc10-2df2b4aa2cb0@google.com>
Date: Tue, 22 Jul 2025 14:51:21 -0400
From: Barret Rhoden <brho@...gle.com>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: Tony Luck <tony.luck@...el.com>, Dave Martin <Dave.Martin@....com>,
James Morse <james.morse@....com>, linux-kernel@...r.kernel.org,
"x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH] x86/resctrl: avoid divide by 0 num_rmid
On 7/22/25 2:19 PM, Reinette Chatre wrote:
> Hi Barret,
>
> On 7/21/25 11:00 AM, Barret Rhoden wrote:
>> x86_cache_max_rmid's default is -1. If the hardware or VM doesn't set
>> the right cpuid bits, num_rmid can be 0.
>>
>> Signed-off-by: Barret Rhoden <brho@...gle.com>
>>
>> ---
>> I ran into this on a VM on granite rapids. I guess the VMM told the
>> kernel it was a GNR, but didn't set all the cache/rsctl bits.
>>
>
> The -1 default of x86_cache_max_rmid is assigned if the hardware does not
> support *any* L3 monitoring. Specifically:
>
> resctrl_cpu_detect():
> if (!cpu_has(c, X86_FEATURE_CQM_LLC)) {
> c->x86_cache_max_rmid = -1;
> ...
> }
>
> The function modified by this patch, rdt_get_mon_l3_config() only runs if
> the hardware supports one or more of the L3 monitoring sub-features
> (X86_FEATURE_CQM_OCCUP_LLC, X86_FEATURE_CQM_MBM_TOTAL, or
> X86_FEATURE_CQM_MBM_LOCAL) that depend on X86_FEATURE_CQM_LLC per cpuid_deps[].
>
> I tried to reproduce the issue on real hardware by using clearcpuid to
> disable X86_FEATURE_CQM_LLC and the CPUID dependencies did the right thing
> by automatically disabling X86_FEATURE_CQM_OCCUP_LLC, X86_FEATURE_CQM_MBM_TOTAL,
> X86_FEATURE_CQM_MBM_LOCAL, not running rdt_get_mon_l3_config() at all, and
> not even attempt to enumerate any of the L3 monitoring details.
>
> What are the symptoms when you encounter this issue?
Linux crashes during boot with a divide error, and the splat backtrace
is in rdt_get_mon_l3_config().
> Would it be possible to send me the CPUID flags of leaf 7, subleaf 0 as
> well as all sub-leaves of leaf 0xF?
# ./cpuid 0x7 0
CPUID for Leaf 0x00000007, Sublevel 0x00000000:
eax: 00000002
ebx: f1bf2ffb
ecx: 1b415f7e
edx: bc814410
# ./cpuid 0x7 1
CPUID for Leaf 0x00000007, Sublevel 0x00000001:
eax: 00201c30
ebx: 00000000
ecx: 00000000
edx: 00084000
# ./cpuid 0x7 2
CPUID for Leaf 0x00000007, Sublevel 0x00000002:
eax: 00000000
ebx: 00000000
ecx: 00000000
edx: 0000003f
> Could you please also elaborate what the impact of this issue is? Is this
> a VM that has been released with many users impacted or something encountered
> during development of this VM?
This is with cloud-hypervisor. We do have a couple of local patches for
running on machines with more than 256 cpus. I didn't see anything in
our changes related to cpuid 0x7, but maybe it's on our end.
But I imagine the problem isn't widespread and could be considered
developmental.
I'll keep poking on my end - maybe I had some other cruft in my system
(in the kernel build or in cloud_hypervisor).
Thanks,
Barret
Powered by blists - more mailing lists