[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <80701f7a-e7c6-eb86-4018-67033f0823bf@gmail.com>
Date: Mon, 23 Aug 2021 20:34:11 -0700
From: David Ahern <dsahern@...il.com>
To: Yunsheng Lin <linyunsheng@...wei.com>, davem@...emloft.net,
kuba@...nel.org
Cc: alexander.duyck@...il.com, linux@...linux.org.uk, mw@...ihalf.com,
linuxarm@...neuler.org, yisen.zhuang@...wei.com,
salil.mehta@...wei.com, thomas.petazzoni@...tlin.com,
hawk@...nel.org, ilias.apalodimas@...aro.org, ast@...nel.org,
daniel@...earbox.net, john.fastabend@...il.com,
akpm@...ux-foundation.org, peterz@...radead.org, will@...nel.org,
willy@...radead.org, vbabka@...e.cz, fenghua.yu@...el.com,
guro@...com, peterx@...hat.com, feng.tang@...el.com, jgg@...pe.ca,
mcroce@...rosoft.com, hughd@...gle.com, jonathan.lemon@...il.com,
alobakin@...me, willemb@...gle.com, wenxu@...oud.cn,
cong.wang@...edance.com, haokexin@...il.com, nogikh@...gle.com,
elver@...gle.com, yhs@...com, kpsingh@...nel.org,
andrii@...nel.org, kafai@...com, songliubraving@...com,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
bpf@...r.kernel.org, chenhao288@...ilicon.com, edumazet@...gle.com,
yoshfuji@...ux-ipv6.org, dsahern@...nel.org, memxor@...il.com,
linux@...pel-privat.de, atenart@...nel.org, weiwan@...gle.com,
ap420073@...il.com, arnd@...db.de,
mathew.j.martineau@...ux.intel.com, aahringo@...hat.com,
ceggers@...i.de, yangbo.lu@....com, fw@...len.de,
xiangxia.m.yue@...il.com, linmiaohe@...wei.com
Subject: Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
On 8/22/21 9:32 PM, Yunsheng Lin wrote:
>
> I assumed the "either Rx or Tx is cpu bound" meant either Rx or Tx is the
> bottleneck?
yes.
>
> It seems iperf3 support the Tx ZC, I retested using the iperf3, Rx settings
> is not changed when testing, MTU is 1500:
-Z == sendfile API. That works fine to a point and that point is well
below 100G.
I mean TCP with MSG_ZEROCOPY and SO_ZEROCOPY.
>
> IOMMU in strict mode:
> 1. Tx ZC case:
> 22Gbit with Tx being bottleneck(cpu bound)
> 2. Tx non-ZC case with pfrag pool enabled:
> 40Git with Rx being bottleneck(cpu bound)
> 3. Tx non-ZC case with pfrag pool disabled:
> 30Git, the bottleneck seems not to be cpu bound, as the Rx and Tx does
> not have a single CPU reaching about 100% usage.
>
>>
>> At 1500 MTU lowering CPU usage on the Tx side does not accomplish much
>> on throughput since the Rx is 100% cpu.
>
> As above performance data, enabling ZC does not seems to help when IOMMU
> is involved, which has about 30% performance degrade when pfrag pool is
> disabled and 50% performance degrade when pfrag pool is enabled.
In a past response you should numbers for Tx ZC API with a custom
program. That program showed the dramatic reduction in CPU cycles for Tx
with the ZC API.
>
>>
>> At 3300 MTU you have ~47% the pps for the same throughput. Lower pps
>> reduces Rx processing and lower CPU to process the incoming stream. Then
>> using the Tx ZC API you lower the Tx overehad allowing a single stream
>> to faster - sending more data which in the end results in much higher
>> pps and throughput. At the limit you are CPU bound (both ends in my
>> testing as Rx side approaches the max pps, and Tx side as it continually
>> tries to send data).
>>
>> Lowering CPU usage on Tx the side is a win regardless of whether there
>> is a big increase on the throughput at 1500 MTU since that configuration
>> is an Rx CPU bound problem. Hence, my point that we have a good start
>> point for lowering CPU usage on the Tx side; we should improve it rather
>> than add per-socket page pools.
>
> Acctually it is not a per-socket page pools, the page pool is still per
> NAPI, this patchset adds multi allocation context to the page pool, so that
> the tx can reuse the same page pool with rx, which is quite usefully if the
> ARFS is enabled.
>
>>
>> You can stress the Tx side and emphasize its overhead by modifying the
>> receiver to drop the data on Rx rather than copy to userspace which is a
>> huge bottleneck (e.g., MSG_TRUNC on recv). This allows the single flow
>
> As the frag page is supported in page pool for Rx, the Rx probably is not
> a bottleneck any more, at least not for IOMMU in strict mode.
>
> It seems iperf3 does not support MSG_TRUNC yet, any testing tool supporting
> MSG_TRUNC? Or do I have to hack the kernel or iperf3 tool to do that?
https://github.com/dsahern/iperf, mods branch
--zc_api is the Tx ZC API; --rx_drop adds MSG_TRUNC to recv.
>
>> stream to go faster and emphasize Tx bottlenecks as the pps at 3300
>> approaches the top pps at 1500. e.g., doing this with iperf3 shows the
>> spinlock overhead with tcp_sendmsg, overhead related to 'select' and
>> then gup_pgd_range.
>
> When IOMMU is in strict mode, the overhead with IOMMU seems to be much
> bigger than spinlock(23% to 10%).
>
> Anyway, I still think ZC mostly benefit to packet which is bigger than a
> specific size and IOMMU disabling case.
>
>
>> .
>>
Powered by blists - more mailing lists