[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20191107194942.734bc867e1c9578d07cf1712@linux-foundation.org>
Date: Thu, 7 Nov 2019 19:49:42 -0800
From: Andrew Morton <akpm@...ux-foundation.org>
To: Shaokun Zhang <zhangshaokun@...ilicon.com>
Cc: <linux-kernel@...r.kernel.org>, yuqi jin <jinyuqi@...wei.com>,
Mike Rapoport <rppt@...ux.ibm.com>,
Paul Burton <paul.burton@...s.com>,
Michal Hocko <mhocko@...e.com>,
Michael Ellerman <mpe@...erman.id.au>,
Anshuman Khandual <anshuman.khandual@....com>
Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread()
On Thu, 7 Nov 2019 09:44:08 +0800 Shaokun Zhang <zhangshaokun@...ilicon.com> wrote:
> In the multi-processors and NUMA system, I/O driver will find cpu cores
> that which shall be bound IRQ. When cpu cores in the local numa have
> been used, it is better to find the node closest to the local numa node,
> instead of choosing any online cpu immediately.
>
> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu
> system(0 - 1). We perform PS (parameter server) business test, the
> behavior of the service is that the client initiates a request through
> the network card, the server responds to the request after calculation.
> When two PS processes run on node2 and node3 separately and the
> network card is located on 'node2' which is in cpu1, the performance
> of node2 (26W QPS) and node3 (22W QPS) was different.
> It is better that the NIC queues are bound to the cpu1 cores in turn,
> then XPS will also be properly initialized, while cpumask_local_spread
> only considers the local node. When the number of NIC queues exceeds
> the number of cores in the local node, it returns to the online core
> directly. So when PS runs on node3 sending a calculated request,
> the performance is not as good as the node2. It is considered that
> the NIC and other I/O devices shall initialize the interrupt binding,
> if the cores of the local node are used up, it is reasonable to return
> the node closest to it.
>
> Let's optimize it and find the nearest node through NUMA distance for the
> non-local NUMA nodes. The performance will be better if it return the
> nearest node than the random node.
>
> After this patch, the performance of the node3 is the same as node2
> that is 26W QPS when the network card is still in 'node2'. Since it will
> return the closest non-local NUMA code rather than random node, it is no
> harm to others at least.
This is a little nicer:
--- a/lib/cpumask.c~lib-optimize-cpumask_local_spread-v3-fix
+++ a/lib/cpumask.c
@@ -254,7 +254,6 @@ static unsigned int __cpumask_local_spre
BUG();
}
-static DEFINE_SPINLOCK(spread_lock);
/**
* cpumask_local_spread - select the i'th cpu with local numa cpu's first
* @i: index number
@@ -270,6 +269,7 @@ unsigned int cpumask_local_spread(unsign
{
static int node_dist[MAX_NUMNODES];
static bool used[MAX_NUMNODES];
+ static DEFINE_SPINLOCK(spread_lock);
unsigned long flags;
int cpu, j, id;
_
Powered by blists - more mailing lists