netdev - Re: [PATCH v3] lib: optimize cpumask_local

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <00856999-739f-fd73-eddd-d71e4e94962e@hisilicon.com>
Date:   Wed, 13 Nov 2019 10:46:05 +0800
From:   Shaokun Zhang <zhangshaokun@...ilicon.com>
To:     Michal Hocko <mhocko@...nel.org>
CC:     <linux-kernel@...r.kernel.org>, yuqi jin <jinyuqi@...wei.com>,
        "Andrew Morton" <akpm@...ux-foundation.org>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        "Paul Burton" <paul.burton@...s.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Anshuman Khandual <anshuman.khandual@....com>,
        <netdev@...r.kernel.org>
Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread()

Hi Michal,

On 2019/11/12 19:56, Michal Hocko wrote:
> On Mon 11-11-19 10:02:37, Shaokun Zhang wrote:
>> Hi Michal,
>>
>> On 2019/11/8 18:31, Michal Hocko wrote:
>>> This changelog looks better, thanks! I still have some questions though.
>>> Btw. cpumask_local_spread is used by the networking code but I do not
>>> see net guys involved (Cc netdev)
>>
>> Oh, I forgot to involve the net guys, sorry.
>>
>>>
>>> On Thu 07-11-19 09:44:08, Shaokun Zhang wrote:
>>>> From: yuqi jin <jinyuqi@...wei.com>
>>>>
>>>> In the multi-processors and NUMA system, I/O driver will find cpu cores
>>>> that which shall be bound IRQ. When cpu cores in the local numa have
>>>> been used, it is better to find the node closest to the local numa node,
>>>> instead of choosing any online cpu immediately.
>>>>
>>>> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu
>>>> system(0 - 1).
>>>
>>> Please send a topology of this server (numactl -H).
>>>
>>
>> available: 4 nodes (0-3)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
>> node 0 size: 63379 MB
>> node 0 free: 61899 MB
>> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
>> node 1 size: 64509 MB
>> node 1 free: 63942 MB
>> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
>> node 2 size: 64509 MB
>> node 2 free: 63056 MB
>> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
>> node 3 size: 63997 MB
>> node 3 free: 63420 MB
>> node distances:
>> node   0   1   2   3
>>   0:  10  16  32  33
>>   1:  16  10  25  32
>>   2:  32  25  10  16
>>   3:  33  32  16  10
>>
>>>> We perform PS (parameter server) business test, the
>>>> behavior of the service is that the client initiates a request through
>>>> the network card, the server responds to the request after calculation. 
>>>
>>> Is the benchmark any ublicly available?
>>>
>>
>> Sorry, the PS which we test is not open, but I think redis is the same as PS
>> on the macro level. When there are both 24 redis servers on node2 and node3.
>> if the 24-47 irqs and xps of NIC are not bound to node3, the redis servers
>> on node3 will not performance good.
> 
> Are there any other benchmarks showing improvements?
>

Sorry, I don't have it. The issue is clear and the patch is helpful for the
actual Parameter Server and Redis test.

>>>> When two PS processes run on node2 and node3 separately and the
>>>> network card is located on 'node2' which is in cpu1, the performance
>>>> of node2 (26W QPS) and node3 (22W QPS) was different.
>>>> It is better that the NIC queues are bound to the cpu1 cores in turn,
>>>> then XPS will also be properly initialized, while cpumask_local_spread
>>>> only considers the local node. When the number of NIC queues exceeds
>>>> the number of cores in the local node, it returns to the online core
>>>> directly. So when PS runs on node3 sending a calculated request,
>>>> the performance is not as good as the node2. It is considered that
>>>> the NIC and other I/O devices shall initialize the interrupt binding,
>>>> if the cores of the local node are used up, it is reasonable to return
>>>> the node closest to it.
>>>
>>> Can you post cpu affinities before and after this patch?
>>>
>>
>> Before this patch
>> Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
>> 2
>> Euler:~ # cat /proc/irq/345/smp_affinity    #IRQ0
>> 00000000,00010000,00000000
> 
> This representation is awkward to parse. Could you add smp_affinity_list
> please? It would save quite some head scratching.
> 

before patch
Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
2
Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list
48
Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
0
Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list
24
Euler:/sys/bus/pci #

after patch
Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
2
Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list
48
Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
72
Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list
24
Euler:/sys/bus/pci #

Thanks,
Shaokun