netdev - Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <80701f7a-e7c6-eb86-4018-67033f0823bf@gmail.com>
Date:   Mon, 23 Aug 2021 20:34:11 -0700
From:   David Ahern <dsahern@...il.com>
To:     Yunsheng Lin <linyunsheng@...wei.com>, davem@...emloft.net,
        kuba@...nel.org
Cc:     alexander.duyck@...il.com, linux@...linux.org.uk, mw@...ihalf.com,
        linuxarm@...neuler.org, yisen.zhuang@...wei.com,
        salil.mehta@...wei.com, thomas.petazzoni@...tlin.com,
        hawk@...nel.org, ilias.apalodimas@...aro.org, ast@...nel.org,
        daniel@...earbox.net, john.fastabend@...il.com,
        akpm@...ux-foundation.org, peterz@...radead.org, will@...nel.org,
        willy@...radead.org, vbabka@...e.cz, fenghua.yu@...el.com,
        guro@...com, peterx@...hat.com, feng.tang@...el.com, jgg@...pe.ca,
        mcroce@...rosoft.com, hughd@...gle.com, jonathan.lemon@...il.com,
        alobakin@...me, willemb@...gle.com, wenxu@...oud.cn,
        cong.wang@...edance.com, haokexin@...il.com, nogikh@...gle.com,
        elver@...gle.com, yhs@...com, kpsingh@...nel.org,
        andrii@...nel.org, kafai@...com, songliubraving@...com,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
        bpf@...r.kernel.org, chenhao288@...ilicon.com, edumazet@...gle.com,
        yoshfuji@...ux-ipv6.org, dsahern@...nel.org, memxor@...il.com,
        linux@...pel-privat.de, atenart@...nel.org, weiwan@...gle.com,
        ap420073@...il.com, arnd@...db.de,
        mathew.j.martineau@...ux.intel.com, aahringo@...hat.com,
        ceggers@...i.de, yangbo.lu@....com, fw@...len.de,
        xiangxia.m.yue@...il.com, linmiaohe@...wei.com
Subject: Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support

On 8/22/21 9:32 PM, Yunsheng Lin wrote:
> 
> I assumed the "either Rx or Tx is cpu bound" meant either Rx or Tx is the
> bottleneck?

yes.

> 
> It seems iperf3 support the Tx ZC, I retested using the iperf3, Rx settings
> is not changed when testing, MTU is 1500:

-Z == sendfile API. That works fine to a point and that point is well
below 100G.

I mean TCP with MSG_ZEROCOPY and SO_ZEROCOPY.

> 
> IOMMU in strict mode:
> 1. Tx ZC case:
>    22Gbit with Tx being bottleneck(cpu bound)
> 2. Tx non-ZC case with pfrag pool enabled:
>    40Git with Rx being bottleneck(cpu bound)
> 3. Tx non-ZC case with pfrag pool disabled:
>    30Git, the bottleneck seems not to be cpu bound, as the Rx and Tx does
>    not have a single CPU reaching about 100% usage.
> 
>>
>> At 1500 MTU lowering CPU usage on the Tx side does not accomplish much
>> on throughput since the Rx is 100% cpu.
> 
> As above performance data, enabling ZC does not seems to help when IOMMU
> is involved, which has about 30% performance degrade when pfrag pool is
> disabled and 50% performance degrade when pfrag pool is enabled.

In a past response you should numbers for Tx ZC API with a custom
program. That program showed the dramatic reduction in CPU cycles for Tx
with the ZC API.

> 
>>
>> At 3300 MTU you have ~47% the pps for the same throughput. Lower pps
>> reduces Rx processing and lower CPU to process the incoming stream. Then
>> using the Tx ZC API you lower the Tx overehad allowing a single stream
>> to faster - sending more data which in the end results in much higher
>> pps and throughput. At the limit you are CPU bound (both ends in my
>> testing as Rx side approaches the max pps, and Tx side as it continually
>> tries to send data).
>>
>> Lowering CPU usage on Tx the side is a win regardless of whether there
>> is a big increase on the throughput at 1500 MTU since that configuration
>> is an Rx CPU bound problem. Hence, my point that we have a good start
>> point for lowering CPU usage on the Tx side; we should improve it rather
>> than add per-socket page pools.
> 
> Acctually it is not a per-socket page pools, the page pool is still per
> NAPI, this patchset adds multi allocation context to the page pool, so that
> the tx can reuse the same page pool with rx, which is quite usefully if the
> ARFS is enabled.
> 
>>
>> You can stress the Tx side and emphasize its overhead by modifying the
>> receiver to drop the data on Rx rather than copy to userspace which is a
>> huge bottleneck (e.g., MSG_TRUNC on recv). This allows the single flow
> 
> As the frag page is supported in page pool for Rx, the Rx probably is not
> a bottleneck any more, at least not for IOMMU in strict mode.
> 
> It seems iperf3 does not support MSG_TRUNC yet, any testing tool supporting
> MSG_TRUNC? Or do I have to hack the kernel or iperf3 tool to do that?

https://github.com/dsahern/iperf, mods branch

--zc_api is the Tx ZC API; --rx_drop adds MSG_TRUNC to recv.


> 
>> stream to go faster and emphasize Tx bottlenecks as the pps at 3300
>> approaches the top pps at 1500. e.g., doing this with iperf3 shows the
>> spinlock overhead with tcp_sendmsg, overhead related to 'select' and
>> then gup_pgd_range.
> 
> When IOMMU is in strict mode, the overhead with IOMMU seems to be much
> bigger than spinlock(23% to 10%).
> 
> Anyway, I still think ZC mostly benefit to packet which is bigger than a
> specific size and IOMMU disabling case.
> 
> 
>> .
>>