linux-kernel - Re: [PATCH] NUMA: Early use of cpu_to_node() returns 0 instead of the correct node id

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1e39190e-3cb9-41f5-bd60-1f1124823e4a@amperemail.onmicrosoft.com>
Date: Mon, 22 Jan 2024 15:32:27 +0800
From: Shijie Huang <shijie@...eremail.onmicrosoft.com>
To: Yury Norov <yury.norov@...il.com>
Cc: Mike Rapoport <rppt@...nel.org>,
 Huang Shijie <shijie@...amperecomputing.com>, gregkh@...uxfoundation.org,
 patches@...erecomputing.com, rafael@...nel.org, paul.walmsley@...ive.com,
 palmer@...belt.com, aou@...s.berkeley.edu, kuba@...nel.org,
 vschneid@...hat.com, mingo@...nel.org, akpm@...ux-foundation.org,
 vbabka@...e.cz, tglx@...utronix.de, jpoimboe@...nel.org,
 ndesaulniers@...gle.com, mikelley@...rosoft.com, mhiramat@...nel.org,
 arnd@...db.de, linux-kernel@...r.kernel.org,
 linux-riscv@...ts.infradead.org, linux-arm-kernel@...ts.infradead.org,
 catalin.marinas@....com, will@...nel.org, mark.rutland@....com,
 mpe@...erman.id.au, linuxppc-dev@...ts.ozlabs.org, chenhuacai@...nel.org,
 jiaxun.yang@...goat.com, linux-mips@...r.kernel.org,
 cl@...amperecomputing.com
Subject: Re: [PATCH] NUMA: Early use of cpu_to_node() returns 0 instead of the
 correct node id


在 2024/1/20 2:02, Yury Norov 写道:
> [EXTERNAL EMAIL NOTICE: This email originated from an external sender. Please be mindful of safe email handling and proprietary information protection practices.]
>
>
> On Fri, Jan 19, 2024 at 04:50:53PM +0800, Shijie Huang wrote:
>> 在 2024/1/19 16:42, Mike Rapoport 写道:
>>> On Fri, Jan 19, 2024 at 02:46:16PM +0800, Shijie Huang wrote:
>>>> 在 2024/1/19 12:42, Yury Norov 写道:
>>>>> This adds another level of indirection, I think. Currently cpu_to_node
>>>>> is a simple inliner. After the patch it would be a real function with
>>>>> all the associate overhead. Can you share a bloat-o-meter output here?
>>>> #./scripts/bloat-o-meter vmlinux vmlinux.new
>>>> add/remove: 6/1 grow/shrink: 61/51 up/down: 1168/-588 (580)
>>>> Function                                     old     new   delta
>>>> numa_update_cpu                              148     244     +96
>>>>
>>>>    ...................................................................................................................................(to many to skip)
>>>>
>>>> Total: Before=32990130, After=32990710, chg +0.00%
>>> It's not only about text size, the indirect call also hurts performance
>> The cpu_to_node() is called at very low frequency, most of the times is in
>> the kernel booting time.
> That doesn't matter. This function is a simple inliner that dereferences
> a pointer, and I believe all of us want to keep it simple.

Yes. I agree.

I also want to keep it simple too.


>>>>> Regardless, I don't think that the approach is correct. As per your
>>>>> description, some initialization functions erroneously call
>>>>> cpu_to_node() instead of early_cpu_to_node() which exists specifically
>>>>> for that case.
>>>>>
>>>>> If the above correct, it's clearly a caller problem, and the fix is to
>>>>> simply switch all those callers to use early version.
>>>> It is easy to change to early_cpu_to_node() for sched_init(),
>>>> init_sched_fair_class()
>>>>
>>>> and workqueue_init_early(). These three places call the cpu_to_node() in the
>>>> __init function.
>>>>
>>>>
>>>> But it is a little hard to change the early_trace_init(), since it calls
>>>> cpu_to_node in the deep
>>>>
>>>> function stack:
>>>>
>>>>     early_trace_init() --> ring_buffer_alloc() -->rb_allocate_cpu_buffer()
>>>>
>>>>
>>>> For early_trace_init(), we need to change more code.
>>>>
>>>>
>>>> Anyway, If we think it is not a good idea to change the common code, I am
>>>> oaky too.
>>> Is there a fundamental reason to have early_cpu_to_node() at all?
>> The early_cpu_to_node does not work on some ARCHs (which support the NUMA),
>> such
>>
>> as  SPARC, MIPS and S390.
> So, your approach wouldn't work either, right? I think you've got a
> testing bot report on it already...

IMHO, my patch works fine for them.

They have their own cpu_to_node.


The x86 reported an compiling error, because the x86 does not compile

the driver/base/arch_numa.c.

I have fixed it by moving the cpu_to_node from

      driver/base/arch_numa.c to driver/base/node.c


The driver/base/node.c is  built-in for all the NUMA ARCHs.

> You can make it like this:
>
>    #ifdef CONFIG_ARCH_NO_EARLY_CPU_TO_NODE
>    #define early_cpu_to_node cpu_to_node
>    #endif

Thanks. Add this make it more complicated..


>>> It seems that all the mappings are known by the end of setup_arch() and the
>>> initialization of numa_node can be moved earlier.
>>>>> I would also initialize the numa_node with NUMA_NO_NODE at declaration,
>>>>> so that if someone calls cpu_to_node() before the variable is properly
>>>>> initialized at runtime, he'll get NO_NODE, which is obviously an error.
>>>> Even we set the numa_node with NUMA_NO_NODE, it does not always produce
>>>> error.
> You can print this error yourself:
>
>    #ifndef cpu_to_node
>    static inline int cpu_to_node(int cpu)
>    {
>          int node = per_cpu(numa_node, cpu);
>
>    #ifdef CONFIG_DEBUG_PER_CPU_MAPS
>          if (node == NUMA_NO_NODE)
>                  pr_err(...);
>    #endif
>
>            return node;
>    }
>    #endif

Thanks.  I had a samiliar private to detect it.

After my patch, there is no need to detect the error again.


Thanks

Huang Shijie