lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 30 May 2024 15:35:55 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Peter Schneider <pschneider1968@...glemail.com>
Cc: LKML <linux-kernel@...r.kernel.org>, x86@...nel.org,
 stable@...r.kernel.org, regressions@...ts.linux.dev
Subject: Re: Kernel 6.9 regression: X86: Bogus messages from topology detection

Peter!

On Thu, May 30 2024 at 12:06, Peter Schneider wrote:
> Am 30.05.24 um 10:30 schrieb Thomas Gleixner:
>
>> Can you please apply the debug patch below ad provide the full dmesg
>> after boot?
>
> Here you go... The patch applied cleanly against 6.9.3, which I saw
> was just released by Greg, so I used that. If you want, I can repeat
> the test against 6.9.2, too.

3 is fine

> Please note: to be able to boot any kernel >= 6.8.4 on my machine, I also had to apply 
> this patch by Martin Petersen, fixing another (unrelated SCSI) regression I reported some 
> time ago, see here:
>
> https://lore.kernel.org/all/20240521023040.2703884-1-martin.petersen@oracle.com/
>
> But I think these two issues are not connected in any way. It was during testing the above 
> patch by Martin that I noticed this new issue in 6.9 BTW.

Right. It's a seperate problem.

> I have attached resulting file dmesg_6.9.3-dirty_Bad_wDebugInfo.txt,
> and I hope you can make some sense of it.

It's exactly what I expected but it does not make any sense at all.

>     [    0.000000] Legacy: 2 5 5

So that means that during early boot where the topology parameters are
decoded from CPUID the CPUID evaluation code sees that the maximum
supported CPUID leaf is 0x02 and it therefore reads complete non-sense.

Later on when the full CPUID evaluation happens it sees the full space
and uses leaf 0xb.

>     [    1.687649] L:b 0 0 S:1 N:2 T:1
>     [    1.687652] D: 0
>     [    1.687653] L:b 1 1 S:5 N:24 T:2
>     [    1.687655] D: 1
>     [    1.687656] L:b 2 2 S:0 N:0 T:0
>     [    1.687658] [Firmware Bug]: CPU0: Topology domain 0 shift 1 != 5

And this obviously sees the proper numbers and complains about the
inconsistency.

So something on this CPU is broken. The same problem exists on all APs:

>     [    1.790035] .... node  #0, CPUs:        #4
>     [    1.790312] .... node  #1, CPUs:   #12 #16
>     [    0.011992] Legacy: 2 5 5
>     [    0.011992] Legacy: 2 5 5
>     [    0.011992] Legacy: 2 5 5
>     [    0.011992] Legacy: 2 5 5
      .....

Now the million-dollar question is what unlocks CPUID to read the proper
value of EAX of leaf 0. All I could come up with is to sprinkle a dozen
of printks into that code. Updated debug patch below.

Thanks,

        tglx
---
--- a/arch/x86/kernel/cpu/topology_common.c
+++ b/arch/x86/kernel/cpu/topology_common.c
@@ -65,6 +65,7 @@ static void parse_legacy(struct topo_sca
 		cores <<= smt_shift;
 	}
 
+	pr_info("Legacy: %u %u %u\n", c->cpuid_level, smt_shift, core_shift);
 	topology_set_dom(tscan, TOPO_SMT_DOMAIN, smt_shift, 1U << smt_shift);
 	topology_set_dom(tscan, TOPO_CORE_DOMAIN, core_shift, cores);
 }
--- a/arch/x86/kernel/cpu/topology_ext.c
+++ b/arch/x86/kernel/cpu/topology_ext.c
@@ -72,6 +72,9 @@ static inline bool topo_subleaf(struct t
 
 	cpuid_subleaf(leaf, subleaf, &sl);
 
+	pr_info("L:%0x %0x %0x S:%u N:%u T:%u\n", leaf, subleaf, sl.level, sl.x2apic_shift,
+		sl.num_processors, sl.type);
+
 	if (!sl.num_processors || sl.type == INVALID_TYPE)
 		return false;
 
@@ -97,6 +100,7 @@ static inline bool topo_subleaf(struct t
 			     leaf, subleaf, tscan->c->topo.initial_apicid, sl.x2apic_id);
 	}
 
+	pr_info("D: %u\n", dom);
 	topology_set_dom(tscan, dom, sl.x2apic_shift, sl.num_processors);
 	return true;
 }
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1584,22 +1584,30 @@ static void __init early_identify_cpu(st
 	/* cyrix could have cpuid enabled via c_identify()*/
 	if (have_cpuid_p()) {
 		cpu_detect(c);
+		pr_info("MAXL1: %x\n", cpuid_eax(0));
 		get_cpu_vendor(c);
+		pr_info("MAXL2: %x\n", cpuid_eax(0));
 		get_cpu_cap(c);
+		pr_info("MAXL3: %x\n", cpuid_eax(0));
 		setup_force_cpu_cap(X86_FEATURE_CPUID);
 		get_cpu_address_sizes(c);
+		pr_info("MAXL4: %x\n", cpuid_eax(0));
 		cpu_parse_early_param();
+		pr_info("MAXL5: %x\n", cpuid_eax(0));
 
 		cpu_init_topology(c);
+		pr_info("MAXL6: %x\n", cpuid_eax(0));
 
 		if (this_cpu->c_early_init)
 			this_cpu->c_early_init(c);
+		pr_info("MAXL7: %x\n", cpuid_eax(0));
 
 		c->cpu_index = 0;
 		filter_cpuid_features(c, false);
 
 		if (this_cpu->c_bsp_init)
 			this_cpu->c_bsp_init(c);
+		pr_info("MAXL8: %x\n", cpuid_eax(0));
 	} else {
 		setup_clear_cpu_cap(X86_FEATURE_CPUID);
 		get_cpu_address_sizes(c);
@@ -1797,9 +1805,12 @@ static void identify_cpu(struct cpuinfo_
 #ifdef CONFIG_X86_VMX_FEATURE_NAMES
 	memset(&c->vmx_capability, 0, sizeof(c->vmx_capability));
 #endif
+	pr_info("MAXLG1: %x\n", cpuid_eax(0));
 
 	generic_identify(c);
 
+	pr_info("MAXLG2: %x\n", cpuid_eax(0));
+
 	cpu_parse_topology(c);
 
 	if (this_cpu->c_identify)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ