linux-kernel - Re: [PATCH v3] lib: optimize cpumask_local

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20191107194942.734bc867e1c9578d07cf1712@linux-foundation.org>
Date:   Thu, 7 Nov 2019 19:49:42 -0800
From:   Andrew Morton <akpm@...ux-foundation.org>
To:     Shaokun Zhang <zhangshaokun@...ilicon.com>
Cc:     <linux-kernel@...r.kernel.org>, yuqi jin <jinyuqi@...wei.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Paul Burton <paul.burton@...s.com>,
        Michal Hocko <mhocko@...e.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Anshuman Khandual <anshuman.khandual@....com>
Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread()

On Thu, 7 Nov 2019 09:44:08 +0800 Shaokun Zhang <zhangshaokun@...ilicon.com> wrote:

> In the multi-processors and NUMA system, I/O driver will find cpu cores
> that which shall be bound IRQ. When cpu cores in the local numa have
> been used, it is better to find the node closest to the local numa node,
> instead of choosing any online cpu immediately.
> 
> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu
> system(0 - 1). We perform PS (parameter server) business test, the
> behavior of the service is that the client initiates a request through
> the network card, the server responds to the request after calculation. 
> When two PS processes run on node2 and node3 separately and the
> network card is located on 'node2' which is in cpu1, the performance
> of node2 (26W QPS) and node3 (22W QPS) was different.
> It is better that the NIC queues are bound to the cpu1 cores in turn,
> then XPS will also be properly initialized, while cpumask_local_spread
> only considers the local node. When the number of NIC queues exceeds
> the number of cores in the local node, it returns to the online core
> directly. So when PS runs on node3 sending a calculated request,
> the performance is not as good as the node2. It is considered that
> the NIC and other I/O devices shall initialize the interrupt binding,
> if the cores of the local node are used up, it is reasonable to return
> the node closest to it.
> 
> Let's optimize it and find the nearest node through NUMA distance for the
> non-local NUMA nodes. The performance will be better if it return the
> nearest node than the random node.
> 
> After this patch, the performance of the node3 is the same as node2
> that is 26W QPS when the network card is still in 'node2'. Since it will
> return the closest non-local NUMA code rather than random node, it is no
> harm to others at least.

This is a little nicer:

--- a/lib/cpumask.c~lib-optimize-cpumask_local_spread-v3-fix
+++ a/lib/cpumask.c
@@ -254,7 +254,6 @@ static unsigned int __cpumask_local_spre
 	BUG();
 }
 
-static DEFINE_SPINLOCK(spread_lock);
 /**
  * cpumask_local_spread - select the i'th cpu with local numa cpu's first
  * @i: index number
@@ -270,6 +269,7 @@ unsigned int cpumask_local_spread(unsign
 {
 	static int node_dist[MAX_NUMNODES];
 	static bool used[MAX_NUMNODES];
+	static DEFINE_SPINLOCK(spread_lock);
 	unsigned long flags;
 	int cpu, j, id;
 
_