netdev - Re: [PATCH net-next 6/7] net: keep sk->sk_forward

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e9ad936f-a091-e3ed-3e18-335bc0ff009e@amd.com>
Date:   Fri, 14 Oct 2022 14:00:12 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        netdev <netdev@...r.kernel.org>,
        Soheil Hassas Yeganeh <soheil@...gle.com>,
        Wei Wang <weiwan@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        Gautham Shenoy <gautham.shenoy@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Chen Yu <yu.c.chen@...el.com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Yicong Yang <yangyicong@...ilicon.com>
Subject: Re: [PATCH net-next 6/7] net: keep sk->sk_forward_alloc as small as
 possible

Hello Eric,

Thank you for taking a look at the report.

On 10/13/2022 8:05 PM, Eric Dumazet wrote:
> On Thu, Oct 13, 2022 at 6:16 AM K Prateek Nayak <kprateek.nayak@....com> wrote:
>>
>> Hello Eric,
>>
>> I might have stumbled upon a possible performance regression observed in
>> some microbenchmarks caused by this series.
>>
>> tl;dr
>>
>> o When performing regression test against tip:sched/core, I noticed a
>>   regression in tbench for the baseline kernel. After ruling out
>>   scheduler changes, bisecting on tip:sched/core, then on Linus' tree and
>>   then on netdev/net-next led me to this series. Patch 6 of the series
>>   which makes changes based on the new reclaim strategy seem to be exact
>>   commit where the regression first started. Regression is also observed
>>   for netperf-tcp but not for netperf-udp after applying this series.
>>
> 
> Hi Prateek
> 
> Thanks for this detailed report.
> 
> Possibly your version of netperf is still using very small writes ?

netperf indeed does small writes. In the report

Hmean     256       6803.96 (   0.00%)     6427.55 *  -5.53%*

           ^
           |

This is the number of bytes per message sent / received
per call passed to netperf via the -m and -M option.
(https://hewlettpackard.github.io/netperf/doc/netperf.html#Options-common-to-TCP-UDP-and-SCTP-tests)
I'm running netperf from mmtests (https://github.com/gormanm/mmtests)
tbench too only sends short messages.

> netperf uses /proc/sys/net/ipv4/tcp_wmem, to read tcp_wmem[1],
> and we have increased years ago /proc/sys/net/ipv4/tcp_wmem
> to match modern era needs.
> 
> # cat /proc/sys/net/ipv4/tcp_wmem
> 4096 262144 67108864

I noticed defaults on my machine was:

$ cat /proc/sys/net/ipv4/tcp_wmem
4096    16384   4194304

I've reran tbench after modifying it to the following value:

cat /proc/sys/net/ipv4/tcp_wmem
4096    262144  67108864

Following are the results:

Clients:      good                 good + series         good  +series + larger wmem
    1    574.93 (0.00 pct)       554.42 (-3.56 pct)      552.92 (-3.82 pct)
    2    1135.60 (0.00 pct)      1034.76 (-8.87 pct)     1036.94 (-8.68 pct)
    4    2117.29 (0.00 pct)      1796.97 (-15.12 pct)    1539.21 (-27.30 pct)
    8    3799.57 (0.00 pct)      3020.87 (-20.49 pct)    2797.98 (-26.36 pct)
   16    6129.79 (0.00 pct)      4536.99 (-25.98 pct)    4301.20 (-29.83 pct)
   32    11630.67 (0.00 pct)     8674.74 (-25.41 pct)    8199.28 (-29.50 pct)
   64    20895.77 (0.00 pct)     14417.26 (-31.00 pct)   14473.34 (-30.73 pct)
  128    31989.55 (0.00 pct)     20611.47 (-35.56 pct)   19671.08 (-38.50 pct)
  256    56388.57 (0.00 pct)     48822.72 (-13.41 pct)   48455.77 (-14.06 pct)
  512    59326.33 (0.00 pct)     43960.03 (-25.90 pct)   43968.59 (-25.88 pct)
 1024    58281.10 (0.00 pct)     41256.18 (-29.21 pct)   40550.97 (-30.42 pct)

Given the message size is small, I think wmem size does not
impact the benchmark results much.

> 
> (Well written applications tend to use large sendmsg() sizes)
> 
> What kind of NIC is used ? It seems it does not use GRO ?

For tbench and netperf, both the client and servers are
running on the same machine using the loopback interface. I'm
not sure if NIC comes into picture but following is detail
gathered by running
$ lspci | egrep -i --color 'network|ethernet|wireless|wi-fi'

Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe


> The only regression that has been noticed was when memcg was in the picture.
> Shakeel Butt sent patches to address this specific mm issue.
> Not sure what happened to the series (
> https://patchwork.kernel.org/project/linux-mm/list/?series=669584 )

I'm not running the benchmark in a cgroup / container so I doubt
if I'm hitting this issue. Based on Shakeel's suggestion, I'll
rerun the tests on v6.0-rc1

> 
> We were aware of the possible performance implications, depending on the setup.
> At Google, we use RFS (Documentation/networking/scaling.rst) so that
> incoming ACK are handled on the cpu
> who did the sendmsg(), so the same per-cpu cache is used for the
> charge/uncharge.

I've noticed tbench tasks migrate quite a bit in the system. For
2-clients it is in the 100s, for 32 clients it is in 1000s and for
128-clients, I observe 8325405 task migration over a 60 second run.
I can try some strategic pinning and see if things change.

> 
> Thanks
> 
> [..snip..]
>
 
--
Thanks and Regards,
Prateek