netdev - Re: [RFC] performance regression with commit-id<adb03115f459> ("net: get rid of an signed integer overflow in ip_idents

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <80cfa8a8-5143-df42-2524-6ce4cade1592@hisilicon.com>
Date:   Thu, 25 Jul 2019 17:05:09 +0800
From:   Zhangshaokun <zhangshaokun@...ilicon.com>
To:     Eric Dumazet <eric.dumazet@...il.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Jiri Pirko <jiri@...nulli.us>, <netdev@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>
CC:     "David S. Miller" <davem@...emloft.net>,
        "guoyang (C)" <guoyang2@...wei.com>,
        "zhudacai@...ilicon.com" <zhudacai@...ilicon.com>
Subject: Re: [RFC] performance regression with commit-id<adb03115f459> ("net:
 get rid of an signed integer overflow in ip_idents_reserve()")

Hi Eric,

Thanks your quick reply.

On 2019/7/24 16:56, Eric Dumazet wrote:
> 
> 
> On 7/24/19 10:38 AM, Zhangshaokun wrote:
>> Hi,
>>
>> I've observed an significant performance regression with the following commit-id <adb03115f459>
>> ("net: get rid of an signed integer overflow in ip_idents_reserve()").
> 
> Yes this UBSAN false positive has been painful
> 
> 
> 
>>
>> Here are my test scenes:
>> ----Server----
>> Cmd: iperf3 -s xxx.xxx.xxxx.xxx -p 10000 -i 0 -A 0
>> Kenel: 4.19.34
>> Server number: 32
>> Port: 10000 – 10032
>> CPU affinity: 0 – 32
>> CPU architecture: aarch64
>> NUMA node0 CPU(s): 0-23
>> NUMA node1 CPU(s): 24-47
>>
>> ----Client----
>> Cmd: iperf3 -u -c xxx.xxx.xxxx.xxx -p 10000 -l 16 -b 0 -t 0 -i 0 -A 8
>> Kenel: 4.19.34
>> Client number: 32
>> Port: 10000 – 10032
>> CPU affinity: 0 – 32
>> CPU architecture: aarch64
>> NUMA node0 CPU(s): 0-23
>> NUMA node1 CPU(s): 24-47
>>
>> Firstly, With patch <adb03115f459> ("net: get rid of an signed integer overflow in ip_idents_reserve()") ,
>> client’s cpu is 100%, and function ip_idents_reserve() cpu usage is very high, but the result is not good.
>> 03:08:32 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
>> 03:08:33 AM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
>> 03:08:33 AM      eth1      0.00 3461296.00      0.00 196049.97      0.00      0.00      0.00      0.00
>> 03:08:33 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
>>
>> Secondly, revert that patch, use atomic_add_return() instead, the result is better, as below:
>> 03:23:24 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
>> 03:23:25 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
>> 03:23:25 AM      eth1      0.00 12834590.00      0.00 726959.20      0.00      0.00      0.00      0.00
>> 03:23:25 AM      eth0      7.00     11.00      0.40      2.95      0.00      0.00      0.00      0.00
>>
>> Thirdly, atomic is not used in ip_idents_reserve() completely ,while each cpu core allocates its own ID segment,
>> Such as: cpu core0 allocate ID 0 – 1023, cpu core1 allocate 1024 – 2047, …,etc
>> the result is the best:
> 
> Not sure what you mean.
> 
> Less entropy in IPv4 ID is not going to help when fragments _are_ needed.
> 
> Send 40,000 datagrams of 2000 bytes each, add delays, reorders, and boom, most of the packets will be lost.
> 
> This is not because your use case does not need proper IP ID that we can mess with them.
> 

Got it, thanks your more explanation.

> If you need to send packets very fast,  maybe use AF_PACKET ?
> 

Ok, I will try it later.

>> 03:27:06 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
>> 03:27:07 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
>> 03:27:07 AM      eth1      0.00 14275505.00      0.00 808573.53      0.00      0.00      0.00      0.00
>> 03:27:07 AM      eth0      0.00      2.00      0.00      0.18      0.00      0.00      0.00      0.00
>>
>> Because atomic operation performance is bottleneck when cpu core number increase, Can we revert the patch or
>> use ID segment for each cpu core instead?
> 
> 
> This has been discussed in the past.
> 
> https://lore.kernel.org/lkml/b0160f4b-b996-b0ee-405a-3d5f1866272e@gmail.com/
> 
> We can revert now UBSAN has been fixed.
> 
> Or even use Peter patch : https://lore.kernel.org/lkml/20181101172739.GA3196@hirez.programming.kicks-ass.net/
> 

I have tried this patch under the condition that I remove try_cmpxchg because there is no this API in arm64 :
09:21:16 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
09:21:17 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:21:17 PM      eth1      0.00 10434613.00      0.00 591023.00      0.00      0.00      0.00      0.00
09:21:17 PM      eth0      1.00      0.00      0.12      0.00      0.00      0.00      0.00      0.00

The result is 10434613.00 pps and it is less than the atomic_add_return(12834590.00 pps).
Any thoughts?

Thanks,
Shaokun

> However, you will still hit badly a shared cache line, not matter what.
> 
> Some arches are known to have terrible LL/SC implementation :/
> 
> 
> .
>