netdev - Re: Bad performance in RX with sfc 40G

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <beef3b28-6818-df7b-eaad-8569cac5d79b@gmail.com>
Date:   Thu, 18 Nov 2021 09:19:39 -0800
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Íñigo Huguet <ihuguet@...hat.com>,
        Edward Cree <ecree.xilinx@...il.com>, habetsm.xilinx@...il.com
Cc:     netdev@...r.kernel.org, Dinan Gunawardena <dinang@...inx.com>,
        Pablo Cascon <pabloc@...inx.com>
Subject: Re: Bad performance in RX with sfc 40G



On 11/18/21 7:14 AM, Íñigo Huguet wrote:
> Hello,
> 
> Doing some tests a few weeks ago I noticed a very low performance in
> RX using 40G Solarflare NICs. Doing tests with iperf3 I got more than
> 30Gbps in TX, but just around 15Gbps in RX. Other NICs from other
> vendors could send and receive over 30Gbps.
> 
> I was doing the tests with multiple threads in iperf3 (-P 8).
> 
> The models used are SFC9140 and SFC9220.
> 
> Perf showed that most of the time was being expended in
> `native_queued_spin_lock_slowpath`. Tracing the calls to it with
> bpftrace I got that most of the calls were from __napi_poll > efx_poll
>> efx_fast_push_rx_descriptors > __alloc_pages >
> get_page_from_freelist > ...
> 
> Please can you help me investigate the issue? At first sight, it seems
> a not very optimal memory allocation strategy, or maybe a failure in
> pages recycling strategy...
> 
> This is the output of bpftrace, the 2 call chains that repeat more
> times, both from sfc
> 
> @[
>     native_queued_spin_lock_slowpath+1
>     _raw_spin_lock+26
>     rmqueue_bulk+76
>     get_page_from_freelist+2295
>     __alloc_pages+214
>     efx_fast_push_rx_descriptors+640
>     efx_poll+660
>     __napi_poll+42
>     net_rx_action+547
>     __softirqentry_text_start+208
>     __irq_exit_rcu+179
>     common_interrupt+131
>     asm_common_interrupt+30
>     cpuidle_enter_state+199
>     cpuidle_enter+41
>     do_idle+462
>     cpu_startup_entry+25
>     start_kernel+2465
>     secondary_startup_64_no_verify+194
> ]: 2650
> @[
>     native_queued_spin_lock_slowpath+1
>     _raw_spin_lock+26
>     rmqueue_bulk+76
>     get_page_from_freelist+2295
>     __alloc_pages+214
>     efx_fast_push_rx_descriptors+640
>     efx_poll+660
>     __napi_poll+42
>     net_rx_action+547
>     __softirqentry_text_start+208
>     __irq_exit_rcu+179
>     common_interrupt+131
>     asm_common_interrupt+30
>     cpuidle_enter_state+199
>     cpuidle_enter+41
>     do_idle+462
>     cpu_startup_entry+25
>     secondary_startup_64_no_verify+194
> ]: 17119
> 
> --
> Íñigo Huguet
> 


You could try to :

Make the RX ring buffers bigger (ethtool -G eth0 rx 8192)

and/or

Make sure your tcp socket receive buffer is smaller than number of frames in the ring buffer

echo "4096 131072 2097152" >/proc/sys/net/ipv4/tcp_rmem

You can also try latest net-next, as TCP got something to help this case.

f35f821935d8df76f9c92e2431a225bdff938169 tcp: defer skb freeing after socket lock is released