lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y0UgeUIJSFNR4mQB@feng-clx>
Date:   Tue, 11 Oct 2022 15:51:21 +0800
From:   Feng Tang <feng.tang@...el.com>
To:     Dave Hansen <dave.hansen@...el.com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        "H . Peter Anvin" <hpa@...or.com>, <x86@...nel.org>,
        <linux-kernel@...r.kernel.org>, <rui.zhang@...el.com>,
        <tim.c.chen@...el.com>, Xiongfeng Wang <wangxiongfeng2@...wei.com>,
        Yu Liao <liaoyu15@...wei.com>
Subject: Re: [PATCH] x86/tsc: Extend the watchdog check exemption to 4S/8S
 machine

On Tue, Oct 11, 2022 at 09:09:12AM +0800, Feng Tang wrote:
> On Mon, Oct 10, 2022 at 07:23:10AM -0700, Dave Hansen wrote:
> > On 10/9/22 18:23, Feng Tang wrote:
> > >>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > >>> index cafacb2e58cc..b4ea79cb1d1a 100644
> > >>> --- a/arch/x86/kernel/tsc.c
> > >>> +++ b/arch/x86/kernel/tsc.c
> > >>> @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
> > >>>  	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > >>>  	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > >>>  	    boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> > >>> -	    nr_online_nodes <= 2)
> > >>> +	    nr_online_nodes <= 8)
> > >> So you're saying all 8 socket systems since Broadwell (?) are TSC
> > >> sync'ed ?
> > > No, I didn't mean that. I haven't got chance to any 8 sockets
> > > machine, and I got a report last month that on one 8S machine,
> > > the TSC was judged 'unstable' by HPET as watchdog.
> > 
> > That's not a great check.  Think about numa=fake=4U, for instance.  Or a
> > single-socket system with persistent memory and high bandwidth memory.
> > 
> > Basically 'nr_online_nodes' is a software construct.  It's going to be
> > really hard to infer anything from it about what the _hardware_ is.
> 
> You are right! How to get the socket number was indeed a trouble when
> I worked on commit b50db7095fe0, the problem is related to the
> initialization order. This tsc check needs to be done in tsc_init(),
> while the node_stats[] get initialized in later's call of smp_init().
> 
> For the case you mentioned above, I dug out some old logs which showed
> its init order:
> 
>   numa=fake=4 on a SKL desktop
>   ================
>   [    0.000066] [tsc_early_init()]: nr_online_nodes = 1
>   [    0.000068] [tsc_early_init()]: nr_cpu_nodes = 0
>   [    0.000070] [tsc_early_init()]: nr_mem_nodes = 0
>   [    0.104015] [tsc_init()]: nr_online_nodes = 4
>   [    0.104019] [tsc_init()]: nr_cpu_nodes = 0
>   [    0.104022] [tsc_init()]: nr_mem_nodes = 4
>   [    0.124778] smp: Brought up 4 nodes, 4 CPUs
>   [    0.760915] [init_tsc_clocksource()]: nr_online_nodes = 4
>   [    0.760919] [init_tsc_clocksource()]: nr_cpu_nodes = 4
>   [    0.760922] [init_tsc_clocksource()]: nr_mem_nodes = 4
>   
>   QEMU with 2 CPU-DRAM nodes + 2 Persistent memory nodes 
>   ========================================================
>   [    0.066651] [tsc_early_init()]: nr_online_nodes = 1
>   [    0.067494] [tsc_early_init()]: nr_cpu_nodes = 0
>   [    0.068288] [tsc_early_init()]: nr_mem_nodes = 0
>   [    0.677694] [tsc_init()]: nr_online_nodes = 4
>   [    0.678862] [tsc_init()]: nr_cpu_nodes = 0
>   [    0.679962] [tsc_init()]: nr_mem_nodes = 4
>   [    1.139240] [init_tsc_clocksource()]: nr_online_nodes = 4
>   [    1.140576] [init_tsc_clocksource()]: nr_cpu_nodes = 2
>   [    1.141823] [init_tsc_clocksource()]: nr_mem_nodes = 4
>   [    1.660100] [kernel_init()]: nr_online_nodes = 4
>   [    1.661234] [kernel_init()]: nr_cpu_nodes = 2
>   [    1.662300] [kernel_init()]: nr_mem_nodes = 4
> 
> The 'nr_online_nodes' was chosed in the hope of that, in worse case
> the patch is just a nop and won't wrongly lift the check.
> 
> One possible solution for this problem is to leverage the SRAT table
> early init which is called before tsc_init(), and can provide CPU
> nodes info. Will try this way.

Th simple patch below is to have a dedicate CPU nodemask and set it in
early SRAT CPU parsing, still it has problem when sub-numa is enabled
in BIOS where there are more NUMA nodes in SRAT table. (also I'm
not sure the change to amdtopology.c is right)

Thanks,
Feng

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index e3bae2b60a0d..e745053a5f9a 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -31,6 +31,7 @@ extern int numa_off;
  */
 extern s16 __apicid_to_node[MAX_LOCAL_APIC];
 extern nodemask_t numa_nodes_parsed __initdata;
+extern nodemask_t numa_cpu_nodes __initdata;
 
 extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 extern void __init numa_set_distance(int from, int to, int distance);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 179e0b1ba5cc..a2a7fc5aa15c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -29,6 +29,7 @@
 #include <asm/intel-family.h>
 #include <asm/i8259.h>
 #include <asm/uv/uv.h>
+#include <asm/numa.h>
 
 unsigned int __read_mostly cpu_khz;	/* TSC clocks / usec, not used here */
 EXPORT_SYMBOL(cpu_khz);
@@ -1218,7 +1219,7 @@ first_dump();
 	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
 	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
 	    boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
-	    nr_online_nodes <= 2)
+	    nodes_weight(numa_cpu_nodes) <= 2)
 		tsc_disable_clocksource_watchdog();
 }
 
diff --git a/arch/x86/mm/amdtopology.c b/arch/x86/mm/amdtopology.c
index b3ca7d23e4b0..6b982a16cc38 100644
--- a/arch/x86/mm/amdtopology.c
+++ b/arch/x86/mm/amdtopology.c
@@ -152,6 +152,7 @@ int __init amd_numa_init(void)
 		prevbase = base;
 		numa_add_memblk(nodeid, base, limit);
 		node_set(nodeid, numa_nodes_parsed);
+		node_set(nodeid, numa_cpu_nodes);
 	}
 
 	if (nodes_empty(numa_nodes_parsed))
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 090125b3ee1f..82798fee97a2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -21,6 +21,7 @@
 
 int numa_off;
 nodemask_t numa_nodes_parsed __initdata;
+nodemask_t numa_cpu_nodes __initdata;
 
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 7688117ac2f4..11b08b317306 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -59,6 +59,7 @@ acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa)
 	}
 	set_apicid_to_node(apic_id, node);
 	node_set(node, numa_nodes_parsed);
+	node_set(node, numa_cpu_nodes);
 
 	printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%04x -> Node %u\n",
@@ -106,6 +107,7 @@ acpi_numa_processor_affinity_init(struct acpi_srat_cpu_affinity *pa)
 
 	set_apicid_to_node(apic_id, node);
 	node_set(node, numa_nodes_parsed);
+	node_set(node, numa_cpu_nodes);
 
 	printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%02x -> Node %u\n",

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ