netdev - Re: [PATCH net-next 6/7] net: keep sk->sk_forward

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKpaJsqeMDQYySmUr2=n8D+dyXKtK0u7hF_8kW10mMm1A@mail.gmail.com>
Date:   Thu, 13 Oct 2022 07:35:26 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     K Prateek Nayak <kprateek.nayak@....com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        netdev <netdev@...r.kernel.org>,
        Soheil Hassas Yeganeh <soheil@...gle.com>,
        Wei Wang <weiwan@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        Gautham Shenoy <gautham.shenoy@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Chen Yu <yu.c.chen@...el.com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Yicong Yang <yangyicong@...ilicon.com>
Subject: Re: [PATCH net-next 6/7] net: keep sk->sk_forward_alloc as small as possible

On Thu, Oct 13, 2022 at 6:16 AM K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> Hello Eric,
>
> I might have stumbled upon a possible performance regression observed in
> some microbenchmarks caused by this series.
>
> tl;dr
>
> o When performing regression test against tip:sched/core, I noticed a
>   regression in tbench for the baseline kernel. After ruling out
>   scheduler changes, bisecting on tip:sched/core, then on Linus' tree and
>   then on netdev/net-next led me to this series. Patch 6 of the series
>   which makes changes based on the new reclaim strategy seem to be exact
>   commit where the regression first started. Regression is also observed
>   for netperf-tcp but not for netperf-udp after applying this series.
>

Hi Prateek

Thanks for this detailed report.

Possibly your version of netperf is still using very small writes ?
netperf uses /proc/sys/net/ipv4/tcp_wmem, to read tcp_wmem[1],
and we have increased years ago /proc/sys/net/ipv4/tcp_wmem
to match modern era needs.

# cat /proc/sys/net/ipv4/tcp_wmem
4096 262144 67108864

(Well written applications tend to use large sendmsg() sizes)

What kind of NIC is used ? It seems it does not use GRO ?
The only regression that has been noticed was when memcg was in the picture.
Shakeel Butt sent patches to address this specific mm issue.
Not sure what happened to the series (
https://patchwork.kernel.org/project/linux-mm/list/?series=669584 )

We were aware of the possible performance implications, depending on the setup.
At Google, we use RFS (Documentation/networking/scaling.rst) so that
incoming ACK are handled on the cpu
who did the sendmsg(), so the same per-cpu cache is used for the
charge/uncharge.

Thanks

> I would like to know if this regression is expected based on some of the
> design consideration in the patch. I'll leave a detailed account of
> discovery, bisection, benchmark results and some preliminary analysis
> below. I've also attached the configs used for testing on AMD and Intel
> system.
>
> Details:
>
> When testing community patches, I observed a large degradation in baseline
> tbench numbers for tip:sched/core between older test reports
> (Example: https://lore.kernel.org/lkml/d49aeabd-ee4e-cc81-06d1-b16029a901ee@amd.com/)
> and recent test reports on the AMD Zen3 system I was testing on.
> (Example: https://lore.kernel.org/lkml/7975dcbe-97b3-7e6c-4697-5f316731c287@amd.com/).
>
> Following is the direct baseline to baseline comparison for tbench
> from the two reports mentioned above on AMD Zen3 system (2 x 64C/128T):
>
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
>
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
>
> Note: All tests were performed with performance governor.
>
> commit 5531ecffa4b9 ("sched: Add update_current_exec_runtime helper")
> commit 7e9518baed4c ("sched/fair: Move call to list_last_entry() in detach_tasks")
>
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
>
> NPS1
>
> Clients: tip (5531ecffa4b9)      tip (7e9518baed4c)
>     1    573.26 (0.00 pct)       550.66 (-3.94 pct)
>     2    1131.19 (0.00 pct)      1009.69 (-10.74 pct)
>     4    2100.07 (0.00 pct)      1795.32 (-14.51 pct)
>     8    3809.88 (0.00 pct)      2971.16 (-22.01 pct)
>    16    6560.72 (0.00 pct)      4627.98 (-29.45 pct)
>    32    12203.23 (0.00 pct)     8065.15 (-33.90 pct)
>    64    22389.81 (0.00 pct)     14994.32 (-33.03 pct)
>   128    32449.37 (0.00 pct)     5175.73 (-84.04 pct) *
>   256    58962.40 (0.00 pct)     48763.57 (-17.29 pct)
>   512    59608.71 (0.00 pct)     43780.78 (-26.55 pct)
>  1024    58037.02 (0.00 pct)     40341.84 (-30.48 pct)
>
> NPS2
>
> Clients: tip (5531ecffa4b9)      tip (7e9518baed4c)
>     1    574.20 (0.00 pct)       551.06 (-4.02 pct)
>     2    1131.56 (0.00 pct)      1000.76 (-11.55 pct)
>     4    2132.26 (0.00 pct)      1737.02 (-18.53 pct)
>     8    3812.20 (0.00 pct)      2992.31 (-21.50 pct)
>    16    6457.61 (0.00 pct)      4579.29 (-29.08 pct)
>    32    12263.82 (0.00 pct)     9120.73 (-25.62 pct)
>    64    22224.11 (0.00 pct)     14918.58 (-32.87 pct)
>   128    33040.38 (0.00 pct)     20830.61 (-36.95 pct)
>   256    56547.25 (0.00 pct)     47708.18 (-15.63 pct)
>   512    56220.67 (0.00 pct)     43721.79 (-22.23 pct)
>  1024    56048.88 (0.00 pct)     40920.49 (-26.99 pct)
>
> NPS4
>
> Clients: tip (5531ecffa4b9)      tip (7e9518baed4c)
>     1    575.50 (0.00 pct)       549.22 (-4.56 pct)
>     2    1138.70 (0.00 pct)      1000.08 (-12.17 pct)
>     4    2070.66 (0.00 pct)      1794.78 (-13.32 pct)
>     8    3811.70 (0.00 pct)      3008.50 (-21.07 pct)
>    16    6312.80 (0.00 pct)      4804.71 (-23.88 pct)
>    32    11418.14 (0.00 pct)     9156.57 (-19.80 pct)
>    64    19671.16 (0.00 pct)     14901.45 (-24.24 pct)
>   128    30258.53 (0.00 pct)     20771.20 (-31.35 pct)
>   256    55838.10 (0.00 pct)     47033.88 (-15.76 pct)
>   512    55586.44 (0.00 pct)     43429.01 (-21.87 pct)
>  1024    56370.35 (0.00 pct)     39271.27 (-30.33 pct)
>
> * Note: Ignore the data point as tbench runs into ACPI idle driver issue
> (https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/)
>
> When bisecting on tip:sched/core, I found the offending commit to be the
> following merge commit:
>
> o commit: 53aa930dc4ba ("Merge branch 'sched/warnings' into sched/core, to pick up WARN_ON_ONCE() conversion commit")
>
> This regression was also observed on Linus' tree and started between
> v5.19 and v6.0-rc1. Bisecting on Linus' tree led us to the following
> merge commit as the offending commit:
>
> o commit: f86d1fbbe785 ("Merge tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next")
>
> Bisecting the problem on netdev/net-next between the changes that went in
> v6.0-rc1 led me to the following commit as the offending commit:
>
> o commit: 4890b686f408 ("net: keep sk->sk_forward_alloc as small as possible")
>
> This change was tracked back to the series "net: reduce
> tcp_memory_allocated inflation"
> (https://lore.kernel.org/netdev/20220609063412.2205738-1-eric.dumazet@gmail.com/)
>
> The commit 4890b686f408 ("net: keep sk->sk_forward_alloc as small as
> possible") alone does not make sense as it assumes that the reclaims are
> less expensive as a result of per-cpu reserves implemented in
>
> o commit: 0defbb0af775 ("net: add per_cpu_fw_alloc field to struct proto")
> o commit: 3cd3399dd7a8 ("net: implement per-cpu reserves for memory_allocated")
>
> which is part of this series. Following are the results of tbench and
> netperf after applying the series on Linus's tree on top of last
> good commit:
>
> good: 526942b8134c ("Merge tag 'ata-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata")
>
> On dual socket AMD 3rd Generation EPYC Processor
> (2 x 64C/128T AMD EPYC 7713) in NPS1 mode:
>
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
>
> Clients:      good                 good + series
>     1    574.93 (0.00 pct)       554.42 (-3.56 pct)
>     2    1135.60 (0.00 pct)      1034.76 (-8.87 pct)
>     4    2117.29 (0.00 pct)      1796.97 (-15.12 pct)
>     8    3799.57 (0.00 pct)      3020.87 (-20.49 pct)
>    16    6129.79 (0.00 pct)      4536.99 (-25.98 pct)
>    32    11630.67 (0.00 pct)     8674.74 (-25.41 pct)
>    64    20895.77 (0.00 pct)     14417.26 (-31.00 pct)
>   128    31989.55 (0.00 pct)     20611.47 (-35.56 pct)
>   256    56388.57 (0.00 pct)     48822.72 (-13.41 pct)
>   512    59326.33 (0.00 pct)     43960.03 (-25.90 pct)
>  1024    58281.10 (0.00 pct)     41256.18 (-29.21 pct)
>
> ~~~~~~~~~~~
> ~ netperf ~
> ~~~~~~~~~~~
>
> - netperf-udp
>
> kernel                     good                   good + series
> Hmean     send-64         346.45 (   0.00%)      346.53 (   0.02%)
> Hmean     send-128        688.39 (   0.00%)      688.53 (   0.02%)
> Hmean     send-256       1355.60 (   0.00%)     1358.59 (   0.22%)
> Hmean     send-1024      5314.81 (   0.00%)     5302.48 (  -0.23%)
> Hmean     send-2048      9757.81 (   0.00%)     9996.26 *   2.44%*
> Hmean     send-3312     15033.99 (   0.00%)    15289.91 (   1.70%)
> Hmean     send-4096     16009.90 (   0.00%)    16441.11 *   2.69%*
> Hmean     send-8192     25039.37 (   0.00%)    24316.10 (  -2.89%)
> Hmean     send-16384    46928.16 (   0.00%)    47746.29 (   1.74%)
> Hmean     recv-64         346.45 (   0.00%)      346.53 (   0.02%)
> Hmean     recv-128        688.39 (   0.00%)      688.53 (   0.02%)
> Hmean     recv-256       1355.60 (   0.00%)     1358.59 (   0.22%)
> Hmean     recv-1024      5314.80 (   0.00%)     5302.47 (  -0.23%)
> Hmean     recv-2048      9757.76 (   0.00%)     9996.25 *   2.44%*
> Hmean     recv-3312     15033.95 (   0.00%)    15289.83 (   1.70%)
> Hmean     recv-4096     16009.84 (   0.00%)    16441.05 *   2.69%*
> Hmean     recv-8192     25039.12 (   0.00%)    24315.81 (  -2.89%)
> Hmean     recv-16384    46927.59 (   0.00%)    47746.12 (   1.74%)
>
> - netperf-tcp
>
> kernel                good                   good + series
> Hmean     64        1846.16 (   0.00%)     1795.84 *  -2.73%*
> Hmean     128       3583.91 (   0.00%)     3448.49 *  -3.78%*
> Hmean     256       6803.96 (   0.00%)     6427.55 *  -5.53%*
> Hmean     1024     21474.74 (   0.00%)    17722.92 * -17.47%*
> Hmean     2048     32904.31 (   0.00%)    28104.16 * -14.59%*
> Hmean     3312     42468.33 (   0.00%)    35616.94 * -16.13%*
> Hmean     4096     45453.37 (   0.00%)    38130.18 * -16.11%*
> Hmean     8192     54372.39 (   0.00%)    47438.78 * -12.75%*
> Hmean     16384    61173.73 (   0.00%)    55459.64 *  -9.34%*
>
> On dual socket 3rd Generation Intel Xeon Scalable Processor
> (2 x 32C/64T Intel Xeon Platinum 8362):
>
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
>
> Clients:     good                  good + series
>     1    424.31 (0.00 pct)       399.00 (-5.96 pct)
>     2    844.12 (0.00 pct)       797.10 (-5.57 pct)
>     4    1667.07 (0.00 pct)      1543.72 (-7.39 pct)
>     8    3289.42 (0.00 pct)      3036.96 (-7.67 pct)
>    16    6611.76 (0.00 pct)      6095.99 (-7.80 pct)
>    32    12760.69 (0.00 pct)     11451.82 (-10.25 pct)
>    64    17750.13 (0.00 pct)     15796.17 (-11.00 pct)
>   128    15282.56 (0.00 pct)     14492.78 (-5.16 pct)
>   256    36000.91 (0.00 pct)     31496.12 (-12.51 pct)
>   512    35020.84 (0.00 pct)     28975.34 (-17.26 pct)
>
> ~~~~~~~~~~~
> ~ netperf ~
> ~~~~~~~~~~~
>
> - netperf-udp
>
> kernel                     good                   good + series
> Hmean     send-64         234.69 (   0.00%)      232.32 *  -1.01%*
> Hmean     send-128        471.02 (   0.00%)      469.08 *  -0.41%*
> Hmean     send-256        934.75 (   0.00%)      914.79 *  -2.14%*
> Hmean     send-1024      3594.09 (   0.00%)     3562.71 *  -0.87%*
> Hmean     send-2048      6625.58 (   0.00%)     6720.12 *   1.43%*
> Hmean     send-3312     10843.34 (   0.00%)    10818.02 *  -0.23%*
> Hmean     send-4096     12291.20 (   0.00%)    12329.75 *   0.31%*
> Hmean     send-8192     19017.73 (   0.00%)    19348.73 *   1.74%*
> Hmean     send-16384    34952.23 (   0.00%)    34886.12 *  -0.19%*
> Hmean     recv-64         234.69 (   0.00%)      232.32 *  -1.01%*
> Hmean     recv-128        471.02 (   0.00%)      469.08 *  -0.41%*
> Hmean     recv-256        934.75 (   0.00%)      914.79 *  -2.14%*
> Hmean     recv-1024      3594.09 (   0.00%)     3562.71 *  -0.87%*
> Hmean     recv-2048      6625.58 (   0.00%)     6720.12 *   1.43%*
> Hmean     recv-3312     10843.34 (   0.00%)    10817.95 *  -0.23%*
> Hmean     recv-4096     12291.20 (   0.00%)    12329.75 *   0.31%*
> Hmean     recv-8192     19017.72 (   0.00%)    19348.73 *   1.74%*
> Hmean     recv-16384    34952.23 (   0.00%)    34886.12 *  -0.19%*
>
> - netperf-tcp
>
> kernel                good                   good + series
> Hmean     64        2032.37 (   0.00%)     1979.42 *  -2.61%*
> Hmean     128       3951.42 (   0.00%)     3789.31 *  -4.10%*
> Hmean     256       7295.39 (   0.00%)     6989.24 *  -4.20%*
> Hmean     1024     19844.93 (   0.00%)    18863.06 *  -4.95%*
> Hmean     2048     27493.40 (   0.00%)    25395.34 *  -7.63%*
> Hmean     3312     33224.91 (   0.00%)    30145.59 *  -9.27%*
> Hmean     4096     35082.60 (   0.00%)    31510.58 * -10.18%*
> Hmean     8192     39842.02 (   0.00%)    36776.27 *  -7.69%*
> Hmean     16384    44765.12 (   0.00%)    41373.83 *  -7.58%*
>
> On the Zen3 system, running
> perf record -a -e ibs_op//pp --raw-samples -- ./tbench_32_clients.sh
> following are the reports from kernel based on:
>
> o good (11483.6 MB/sec)
>
>    3.54%  swapper          [kernel.vmlinux]          [k] acpi_processor_ffh_cstate_enter
>    2.01%  tbench_srv       [kernel.vmlinux]          [k] copy_user_generic_string
>    1.59%  tbench           [kernel.vmlinux]          [k] net_rx_action
>    1.58%  tbench_srv       [kernel.vmlinux]          [k] net_rx_action
>    1.46%  swapper          [kernel.vmlinux]          [k] psi_group_change
>    1.45%  tbench_srv       [kernel.vmlinux]          [k] read_tsc
>    1.43%  tbench           [kernel.vmlinux]          [k] read_tsc
>    1.24%  tbench           [kernel.vmlinux]          [k] copy_user_generic_string
>    1.15%  swapper          [kernel.vmlinux]          [k] check_preemption_disabled
>    1.10%  tbench           [kernel.vmlinux]          [k] __entry_text_start
>    1.10%  tbench           [kernel.vmlinux]          [k] tcp_ack
>    1.00%  tbench_srv       [kernel.vmlinux]          [k] tcp_ack
>    0.95%  tbench           [kernel.vmlinux]          [k] psi_group_change
>    0.94%  swapper          [kernel.vmlinux]          [k] read_tsc
>    0.93%  tbench_srv       [kernel.vmlinux]          [k] psi_group_change
>    0.91%  swapper          [kernel.vmlinux]          [k] menu_select
>    0.87%  swapper          [kernel.vmlinux]          [k] __switch_to
>
> o good + series (7903.55 MB/sec)
>
>    3.66%  tbench_srv       [kernel.vmlinux]          [k] tcp_cleanup_rbuf
>    3.31%  tbench           [kernel.vmlinux]          [k] tcp_cleanup_rbuf
>    3.30%  tbench           [kernel.vmlinux]          [k] tcp_recvmsg_locked
>    3.16%  tbench_srv       [kernel.vmlinux]          [k] tcp_recvmsg_locked
>    2.76%  swapper          [kernel.vmlinux]          [k] acpi_processor_ffh_cstate_enter
>    2.10%  tbench           [kernel.vmlinux]          [k] tcp_ack_update_rtt
>    2.05%  tbench_srv       [kernel.vmlinux]          [k] copy_user_generic_string
>    2.04%  tbench_srv       [kernel.vmlinux]          [k] tcp_ack_update_rtt
>    1.84%  tbench           [kernel.vmlinux]          [k] check_preemption_disabled
>    1.47%  tbench           [kernel.vmlinux]          [k] __sk_mem_reduce_allocated
>    1.25%  tbench_srv       [kernel.vmlinux]          [k] __sk_mem_reduce_allocated
>    1.23%  tbench_srv       [kernel.vmlinux]          [k] check_preemption_disabled
>    1.11%  swapper          [kernel.vmlinux]          [k] psi_group_change
>    1.10%  tbench           [kernel.vmlinux]          [k] copy_user_generic_string
>    0.95%  tbench           [kernel.vmlinux]          [k] entry_SYSCALL_64
>    0.87%  swapper          [kernel.vmlinux]          [k] check_preemption_disabled
>    0.85%  tbench           [kernel.vmlinux]          [k] read_tsc
>    0.84%  tbench_srv       [kernel.vmlinux]          [k] read_tsc
>    0.82%  swapper          [kernel.vmlinux]          [k] __switch_to
>    0.81%  tbench           [kernel.vmlinux]          [k] __mod_memcg_state
>    0.76%  tbench           [kernel.vmlinux]          [k] psi_group_change
>
> On Intel system, running
> perf record -a -e cycles:ppp -- ./tbench_32_clients.sh
> following are the reports from kernel based on:
>
> o good (12561 MB/sec)
>
>   20.62%  swapper          [kernel.vmlinux]          [k] mwait_idle_with_hints.constprop.0
>    1.55%  tbench_srv       [kernel.vmlinux]          [k] copy_user_enhanced_fast_string
>    1.37%  swapper          [kernel.vmlinux]          [k] psi_group_change
>    0.89%  swapper          [kernel.vmlinux]          [k] check_preemption_disabled
>    0.86%  tbench           tbench                    [.] child_run
>    0.84%  tbench           [kernel.vmlinux]          [k] nft_do_chain
>    0.83%  tbench_srv       [kernel.vmlinux]          [k] nft_do_chain
>    0.79%  tbench_srv       [kernel.vmlinux]          [k] strncpy
>    0.77%  tbench           [kernel.vmlinux]          [k] strncpy
>
> o good + series (11213 MB/sec):
>
>   19.11%  swapper          [kernel.vmlinux]          [k] mwait_idle_with_hints.constprop.0
>    1.90%  tbench_srv       [kernel.vmlinux]          [k] __sk_mem_reduce_allocated
>    1.86%  tbench           [kernel.vmlinux]          [k] __sk_mem_reduce_allocated
>    1.40%  tbench_srv       [kernel.vmlinux]          [k] copy_user_enhanced_fast_string
>    1.31%  swapper          [kernel.vmlinux]          [k] psi_group_change
>    0.86%  tbench           tbench                    [.] child_run
>    0.83%  tbench_srv       [kernel.vmlinux]          [k] check_preemption_disabled
>    0.82%  swapper          [kernel.vmlinux]          [k] check_preemption_disabled
>    0.80%  tbench           [kernel.vmlinux]          [k] check_preemption_disabled
>    0.78%  tbench           [kernel.vmlinux]          [k] nft_do_chain
>    0.78%  tbench_srv       [kernel.vmlinux]          [k] update_sd_lb_stats.constprop.0
>    0.77%  tbench_srv       [kernel.vmlinux]          [k] nft_do_chain
>
> I've inlined some comments below.
>
> On 6/9/2022 12:04 PM, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@...gle.com>
> >
> > Currently, tcp_memory_allocated can hit tcp_mem[] limits quite fast.
> >
> > Each TCP socket can forward allocate up to 2 MB of memory, even after
> > flow became less active.
> >
> > 10,000 sockets can have reserved 20 GB of memory,
> > and we have no shrinker in place to reclaim that.
> >
> > Instead of trying to reclaim the extra allocations in some places,
> > just keep sk->sk_forward_alloc values as small as possible.
> >
> > This should not impact performance too much now we have per-cpu
> > reserves: Changes to tcp_memory_allocated should not be too frequent.
> >
> > For sockets not using SO_RESERVE_MEM:
> >  - idle sockets (no packets in tx/rx queues) have zero forward alloc.
> >  - non idle sockets have a forward alloc smaller than one page.
> >
> > Note:
> >
> >  - Removal of SK_RECLAIM_CHUNK and SK_RECLAIM_THRESHOLD
> >    is left to MPTCP maintainers as a follow up.
> >
> > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > ---
> >  include/net/sock.h           | 29 ++---------------------------
> >  net/core/datagram.c          |  3 ---
> >  net/ipv4/tcp.c               |  7 -------
> >  net/ipv4/tcp_input.c         |  4 ----
> >  net/ipv4/tcp_timer.c         | 19 ++++---------------
> >  net/iucv/af_iucv.c           |  2 --
> >  net/mptcp/protocol.c         |  2 +-
> >  net/sctp/sm_statefuns.c      |  2 --
> >  net/sctp/socket.c            |  5 -----
> >  net/sctp/stream_interleave.c |  2 --
> >  net/sctp/ulpqueue.c          |  4 ----
> >  11 files changed, 7 insertions(+), 72 deletions(-)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index cf288f7e9019106dfb466be707d34dacf33b339c..0063e8410a4e3ed91aef9cf34eb1127f7ce33b93 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -1627,19 +1627,6 @@ static inline void sk_mem_reclaim_final(struct sock *sk)
> >       sk_mem_reclaim(sk);
> >  }
> >
> > -static inline void sk_mem_reclaim_partial(struct sock *sk)
> > -{
> > -     int reclaimable;
> > -
> > -     if (!sk_has_account(sk))
> > -             return;
> > -
> > -     reclaimable = sk->sk_forward_alloc - sk_unused_reserved_mem(sk);
> > -
> > -     if (reclaimable > (int)PAGE_SIZE)
> > -             __sk_mem_reclaim(sk, reclaimable - 1);
> > -}
> > -
> >  static inline void sk_mem_charge(struct sock *sk, int size)
> >  {
> >       if (!sk_has_account(sk))
> > @@ -1647,29 +1634,17 @@ static inline void sk_mem_charge(struct sock *sk, int size)
> >       sk->sk_forward_alloc -= size;
> >  }
> >
> > -/* the following macros control memory reclaiming in sk_mem_uncharge()
> > +/* the following macros control memory reclaiming in mptcp_rmem_uncharge()
> >   */
> >  #define SK_RECLAIM_THRESHOLD (1 << 21)
> >  #define SK_RECLAIM_CHUNK     (1 << 20)
> >
> >  static inline void sk_mem_uncharge(struct sock *sk, int size)
> >  {
> > -     int reclaimable;
> > -
> >       if (!sk_has_account(sk))
> >               return;
> >       sk->sk_forward_alloc += size;
> > -     reclaimable = sk->sk_forward_alloc - sk_unused_reserved_mem(sk);
> > -
> > -     /* Avoid a possible overflow.
> > -      * TCP send queues can make this happen, if sk_mem_reclaim()
> > -      * is not called and more than 2 GBytes are released at once.
> > -      *
> > -      * If we reach 2 MBytes, reclaim 1 MBytes right now, there is
> > -      * no need to hold that much forward allocation anyway.
> > -      */
> > -     if (unlikely(reclaimable >= SK_RECLAIM_THRESHOLD))
> > -             __sk_mem_reclaim(sk, SK_RECLAIM_CHUNK);
> > +     sk_mem_reclaim(sk);
>
> Following are the difference in the count of call to few functions
> between the good and the bad kernel when running tbench with
> 32 clients on both machines:
>
> o AMD Zen3 Machine
>
>         +-------------------------+------------------+---------------+
>         | Kernel                  |            Good  | Good + Series |
>         +-------------------------+------------------+---------------+
>         | Benchmark Result (MB/s) |         11227.9  |        7458.7 |
>         | __sk_mem_reclaim        |             197  |      65293205 | *
>         | skb_release_head_state  |       607219812  |     406127581 |
>         | tcp_ack_update_rtt      |       297442779  |     198937384 |
>         | tcp_cleanup_rbuf        |       892648815  |     596972242 |
>         | tcp_recvmsg_locked      |       594885088  |     397874278 |
>         +-------------------------+------------------+---------------+
>
> o Intel Xeon Machine
>
>         +-------------------------+------------------+---------------+
>         | Kernel                  |            Good  | Good + Series |
>         +-------------------------+------------------+---------------+
>         | Benchmark Result (MB/s) |         11584.9  |       10509.7 |
>         | __sk_mem_reclaim        |             198  |      91139810 | *
>         | skb_release_head_state  |       623382197  |     566914077 |
>         | tcp_ack_update_rtt      |       305357022  |     277699272 |
>         | tcp_cleanup_rbuf        |       916296601  |     833239328 |
>         | tcp_recvmsg_locked      |       610713561  |     555398063 |
>         +-------------------------+------------------+---------------+
>
> As we see, there is a sharp increase in the number of times __sk_mem_reclaim
> is called. I believe we might be doing a reclaim too often and the overhead
> is adding up.
>
> Following is the kstack for most calls to __sk_mem_reclaim taken on
> the AMD Zen3 system using bpftrace on good + series when running
> 32 tbench clients:
> (Found by running: bpftrace -e 'kprobe:__sk_mem_reclaim { @[kstack] = count(); }')
>
> @[
>     __sk_mem_reclaim+1
>     tcp_rcv_established+377
>     tcp_v4_do_rcv+348
>     tcp_v4_rcv+3286
>     ip_protocol_deliver_rcu+33
>     ip_local_deliver_finish+128
>     ip_local_deliver+111
>     ip_rcv+373
>     __netif_receive_skb_one_core+138
>     __netif_receive_skb+21
>     process_backlog+150
>     __napi_poll+51
>     net_rx_action+335
>     __softirqentry_text_start+259
>     do_softirq.part.0+164
>     __local_bh_enable_ip+135
>     ip_finish_output2+413
>     __ip_finish_output+156
>     ip_finish_output+46
>     ip_output+120
>     ip_local_out+94
>     __ip_queue_xmit+391
>     ip_queue_xmit+21
>     __tcp_transmit_skb+2771
>     tcp_write_xmit+914
>     __tcp_push_pending_frames+55
>     tcp_push+264
>     tcp_sendmsg_locked+697
>     tcp_sendmsg+45
>     inet_sendmsg+67
>     sock_sendmsg+98
>     __sys_sendto+286
>     __x64_sys_sendto+36
>     do_syscall_64+92
>     entry_SYSCALL_64_after_hwframe+99
> ]: 28986799
>
> >
> >  [..snip..]
> >
>
> If this is expected based on the tradeoff this series makes, I'll
> continue using the latest baseline numbers for testing. Please let
> me know if there is something obvious that I might have missed.
>
> If you would like me to gather any data on the test systems,
> I'll be happy to get it for you.
> --
> Thanks and Regards,
> Prateek