[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <80cfa8a8-5143-df42-2524-6ce4cade1592@hisilicon.com>
Date: Thu, 25 Jul 2019 17:05:09 +0800
From: Zhangshaokun <zhangshaokun@...ilicon.com>
To: Eric Dumazet <eric.dumazet@...il.com>,
Eric Dumazet <edumazet@...gle.com>,
Jiri Pirko <jiri@...nulli.us>, <netdev@...r.kernel.org>,
<linux-kernel@...r.kernel.org>
CC: "David S. Miller" <davem@...emloft.net>,
"guoyang (C)" <guoyang2@...wei.com>,
"zhudacai@...ilicon.com" <zhudacai@...ilicon.com>
Subject: Re: [RFC] performance regression with commit-id<adb03115f459> ("net:
get rid of an signed integer overflow in ip_idents_reserve()")
Hi Eric,
Thanks your quick reply.
On 2019/7/24 16:56, Eric Dumazet wrote:
>
>
> On 7/24/19 10:38 AM, Zhangshaokun wrote:
>> Hi,
>>
>> I've observed an significant performance regression with the following commit-id <adb03115f459>
>> ("net: get rid of an signed integer overflow in ip_idents_reserve()").
>
> Yes this UBSAN false positive has been painful
>
>
>
>>
>> Here are my test scenes:
>> ----Server----
>> Cmd: iperf3 -s xxx.xxx.xxxx.xxx -p 10000 -i 0 -A 0
>> Kenel: 4.19.34
>> Server number: 32
>> Port: 10000 – 10032
>> CPU affinity: 0 – 32
>> CPU architecture: aarch64
>> NUMA node0 CPU(s): 0-23
>> NUMA node1 CPU(s): 24-47
>>
>> ----Client----
>> Cmd: iperf3 -u -c xxx.xxx.xxxx.xxx -p 10000 -l 16 -b 0 -t 0 -i 0 -A 8
>> Kenel: 4.19.34
>> Client number: 32
>> Port: 10000 – 10032
>> CPU affinity: 0 – 32
>> CPU architecture: aarch64
>> NUMA node0 CPU(s): 0-23
>> NUMA node1 CPU(s): 24-47
>>
>> Firstly, With patch <adb03115f459> ("net: get rid of an signed integer overflow in ip_idents_reserve()") ,
>> client’s cpu is 100%, and function ip_idents_reserve() cpu usage is very high, but the result is not good.
>> 03:08:32 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
>> 03:08:33 AM eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> 03:08:33 AM eth1 0.00 3461296.00 0.00 196049.97 0.00 0.00 0.00 0.00
>> 03:08:33 AM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>
>> Secondly, revert that patch, use atomic_add_return() instead, the result is better, as below:
>> 03:23:24 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
>> 03:23:25 AM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> 03:23:25 AM eth1 0.00 12834590.00 0.00 726959.20 0.00 0.00 0.00 0.00
>> 03:23:25 AM eth0 7.00 11.00 0.40 2.95 0.00 0.00 0.00 0.00
>>
>> Thirdly, atomic is not used in ip_idents_reserve() completely ,while each cpu core allocates its own ID segment,
>> Such as: cpu core0 allocate ID 0 – 1023, cpu core1 allocate 1024 – 2047, …,etc
>> the result is the best:
>
> Not sure what you mean.
>
> Less entropy in IPv4 ID is not going to help when fragments _are_ needed.
>
> Send 40,000 datagrams of 2000 bytes each, add delays, reorders, and boom, most of the packets will be lost.
>
> This is not because your use case does not need proper IP ID that we can mess with them.
>
Got it, thanks your more explanation.
> If you need to send packets very fast, maybe use AF_PACKET ?
>
Ok, I will try it later.
>> 03:27:06 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
>> 03:27:07 AM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> 03:27:07 AM eth1 0.00 14275505.00 0.00 808573.53 0.00 0.00 0.00 0.00
>> 03:27:07 AM eth0 0.00 2.00 0.00 0.18 0.00 0.00 0.00 0.00
>>
>> Because atomic operation performance is bottleneck when cpu core number increase, Can we revert the patch or
>> use ID segment for each cpu core instead?
>
>
> This has been discussed in the past.
>
> https://lore.kernel.org/lkml/b0160f4b-b996-b0ee-405a-3d5f1866272e@gmail.com/
>
> We can revert now UBSAN has been fixed.
>
> Or even use Peter patch : https://lore.kernel.org/lkml/20181101172739.GA3196@hirez.programming.kicks-ass.net/
>
I have tried this patch under the condition that I remove try_cmpxchg because there is no this API in arm64 :
09:21:16 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
09:21:17 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:21:17 PM eth1 0.00 10434613.00 0.00 591023.00 0.00 0.00 0.00 0.00
09:21:17 PM eth0 1.00 0.00 0.12 0.00 0.00 0.00 0.00 0.00
The result is 10434613.00 pps and it is less than the atomic_add_return(12834590.00 pps).
Any thoughts?
Thanks,
Shaokun
> However, you will still hit badly a shared cache line, not matter what.
>
> Some arches are known to have terrible LL/SC implementation :/
>
>
> .
>
Powered by blists - more mailing lists