netdev - Re: [PATCH net-next 6/7] net: keep sk->sk_forward

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <684c6220-9288-3838-a938-0792b57c5968@amd.com>
Date:   Thu, 13 Oct 2022 18:45:43 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Eric Dumazet <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>
Cc:     netdev <netdev@...r.kernel.org>,
        Soheil Hassas Yeganeh <soheil@...gle.com>,
        Wei Wang <weiwan@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Gautham Shenoy <gautham.shenoy@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Chen Yu <yu.c.chen@...el.com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Yicong Yang <yangyicong@...ilicon.com>
Subject: Re: [PATCH net-next 6/7] net: keep sk->sk_forward_alloc as small as
 possible

Hello Eric,

I might have stumbled upon a possible performance regression observed in
some microbenchmarks caused by this series.

tl;dr

o When performing regression test against tip:sched/core, I noticed a
  regression in tbench for the baseline kernel. After ruling out
  scheduler changes, bisecting on tip:sched/core, then on Linus' tree and
  then on netdev/net-next led me to this series. Patch 6 of the series
  which makes changes based on the new reclaim strategy seem to be exact
  commit where the regression first started. Regression is also observed
  for netperf-tcp but not for netperf-udp after applying this series.

I would like to know if this regression is expected based on some of the
design consideration in the patch. I'll leave a detailed account of
discovery, bisection, benchmark results and some preliminary analysis
below. I've also attached the configs used for testing on AMD and Intel
system.

Details:

When testing community patches, I observed a large degradation in baseline
tbench numbers for tip:sched/core between older test reports
(Example: https://lore.kernel.org/lkml/d49aeabd-ee4e-cc81-06d1-b16029a901ee@amd.com/)
and recent test reports on the AMD Zen3 system I was testing on.
(Example: https://lore.kernel.org/lkml/7975dcbe-97b3-7e6c-4697-5f316731c287@amd.com/).

Following is the direct baseline to baseline comparison for tbench
from the two reports mentioned above on AMD Zen3 system (2 x 64C/128T):

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Note: All tests were performed with performance governor.

commit 5531ecffa4b9 ("sched: Add update_current_exec_runtime helper")
commit 7e9518baed4c ("sched/fair: Move call to list_last_entry() in detach_tasks")

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

NPS1

Clients: tip (5531ecffa4b9)      tip (7e9518baed4c)
    1	 573.26 (0.00 pct)	 550.66 (-3.94 pct)
    2	 1131.19 (0.00 pct)	 1009.69 (-10.74 pct)
    4	 2100.07 (0.00 pct)	 1795.32 (-14.51 pct)
    8	 3809.88 (0.00 pct)	 2971.16 (-22.01 pct)
   16	 6560.72 (0.00 pct)	 4627.98 (-29.45 pct)
   32	 12203.23 (0.00 pct)	 8065.15 (-33.90 pct)
   64	 22389.81 (0.00 pct)	 14994.32 (-33.03 pct)
  128	 32449.37 (0.00 pct)	 5175.73 (-84.04 pct) *
  256	 58962.40 (0.00 pct)	 48763.57 (-17.29 pct)
  512	 59608.71 (0.00 pct)	 43780.78 (-26.55 pct)
 1024	 58037.02 (0.00 pct)	 40341.84 (-30.48 pct)

NPS2

Clients: tip (5531ecffa4b9)      tip (7e9518baed4c)
    1	 574.20 (0.00 pct)	 551.06 (-4.02 pct)
    2	 1131.56 (0.00 pct)	 1000.76 (-11.55 pct)
    4	 2132.26 (0.00 pct)	 1737.02 (-18.53 pct)
    8	 3812.20 (0.00 pct)	 2992.31 (-21.50 pct)
   16	 6457.61 (0.00 pct)	 4579.29 (-29.08 pct)
   32	 12263.82 (0.00 pct)	 9120.73 (-25.62 pct)
   64	 22224.11 (0.00 pct)	 14918.58 (-32.87 pct)
  128	 33040.38 (0.00 pct)	 20830.61 (-36.95 pct)
  256	 56547.25 (0.00 pct)	 47708.18 (-15.63 pct)
  512	 56220.67 (0.00 pct)	 43721.79 (-22.23 pct)
 1024	 56048.88 (0.00 pct)	 40920.49 (-26.99 pct)

NPS4

Clients: tip (5531ecffa4b9)      tip (7e9518baed4c)
    1	 575.50 (0.00 pct)	 549.22 (-4.56 pct)
    2	 1138.70 (0.00 pct)	 1000.08 (-12.17 pct)
    4	 2070.66 (0.00 pct)	 1794.78 (-13.32 pct)
    8	 3811.70 (0.00 pct)	 3008.50 (-21.07 pct)
   16	 6312.80 (0.00 pct)	 4804.71 (-23.88 pct)
   32	 11418.14 (0.00 pct)	 9156.57 (-19.80 pct)
   64	 19671.16 (0.00 pct)	 14901.45 (-24.24 pct)
  128	 30258.53 (0.00 pct)	 20771.20 (-31.35 pct)
  256	 55838.10 (0.00 pct)	 47033.88 (-15.76 pct)
  512	 55586.44 (0.00 pct)	 43429.01 (-21.87 pct)
 1024	 56370.35 (0.00 pct)	 39271.27 (-30.33 pct)

* Note: Ignore the data point as tbench runs into ACPI idle driver issue
(https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/)

When bisecting on tip:sched/core, I found the offending commit to be the
following merge commit:

o commit: 53aa930dc4ba ("Merge branch 'sched/warnings' into sched/core, to pick up WARN_ON_ONCE() conversion commit")

This regression was also observed on Linus' tree and started between
v5.19 and v6.0-rc1. Bisecting on Linus' tree led us to the following
merge commit as the offending commit:

o commit: f86d1fbbe785 ("Merge tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next")

Bisecting the problem on netdev/net-next between the changes that went in
v6.0-rc1 led me to the following commit as the offending commit:

o commit: 4890b686f408 ("net: keep sk->sk_forward_alloc as small as possible")

This change was tracked back to the series "net: reduce
tcp_memory_allocated inflation"
(https://lore.kernel.org/netdev/20220609063412.2205738-1-eric.dumazet@gmail.com/)

The commit 4890b686f408 ("net: keep sk->sk_forward_alloc as small as
possible") alone does not make sense as it assumes that the reclaims are
less expensive as a result of per-cpu reserves implemented in

o commit: 0defbb0af775 ("net: add per_cpu_fw_alloc field to struct proto")
o commit: 3cd3399dd7a8 ("net: implement per-cpu reserves for memory_allocated")

which is part of this series. Following are the results of tbench and
netperf after applying the series on Linus's tree on top of last
good commit:

good: 526942b8134c ("Merge tag 'ata-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata")

On dual socket AMD 3rd Generation EPYC Processor
(2 x 64C/128T AMD EPYC 7713) in NPS1 mode:

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

Clients:      good                 good + series
    1    574.93 (0.00 pct)       554.42 (-3.56 pct)
    2    1135.60 (0.00 pct)      1034.76 (-8.87 pct)
    4    2117.29 (0.00 pct)      1796.97 (-15.12 pct)
    8    3799.57 (0.00 pct)      3020.87 (-20.49 pct)
   16    6129.79 (0.00 pct)      4536.99 (-25.98 pct)
   32    11630.67 (0.00 pct)     8674.74 (-25.41 pct)
   64    20895.77 (0.00 pct)     14417.26 (-31.00 pct)
  128    31989.55 (0.00 pct)     20611.47 (-35.56 pct)
  256    56388.57 (0.00 pct)     48822.72 (-13.41 pct)
  512    59326.33 (0.00 pct)     43960.03 (-25.90 pct)
 1024    58281.10 (0.00 pct)     41256.18 (-29.21 pct)

~~~~~~~~~~~
~ netperf ~
~~~~~~~~~~~

- netperf-udp

kernel                     good                   good + series
Hmean     send-64         346.45 (   0.00%)      346.53 (   0.02%)
Hmean     send-128        688.39 (   0.00%)      688.53 (   0.02%)
Hmean     send-256       1355.60 (   0.00%)     1358.59 (   0.22%)
Hmean     send-1024      5314.81 (   0.00%)     5302.48 (  -0.23%)
Hmean     send-2048      9757.81 (   0.00%)     9996.26 *   2.44%*
Hmean     send-3312     15033.99 (   0.00%)    15289.91 (   1.70%)
Hmean     send-4096     16009.90 (   0.00%)    16441.11 *   2.69%*
Hmean     send-8192     25039.37 (   0.00%)    24316.10 (  -2.89%)
Hmean     send-16384    46928.16 (   0.00%)    47746.29 (   1.74%)
Hmean     recv-64         346.45 (   0.00%)      346.53 (   0.02%)
Hmean     recv-128        688.39 (   0.00%)      688.53 (   0.02%)
Hmean     recv-256       1355.60 (   0.00%)     1358.59 (   0.22%)
Hmean     recv-1024      5314.80 (   0.00%)     5302.47 (  -0.23%)
Hmean     recv-2048      9757.76 (   0.00%)     9996.25 *   2.44%*
Hmean     recv-3312     15033.95 (   0.00%)    15289.83 (   1.70%)
Hmean     recv-4096     16009.84 (   0.00%)    16441.05 *   2.69%*
Hmean     recv-8192     25039.12 (   0.00%)    24315.81 (  -2.89%)
Hmean     recv-16384    46927.59 (   0.00%)    47746.12 (   1.74%)

- netperf-tcp

kernel                good                   good + series
Hmean     64        1846.16 (   0.00%)     1795.84 *  -2.73%*
Hmean     128       3583.91 (   0.00%)     3448.49 *  -3.78%*
Hmean     256       6803.96 (   0.00%)     6427.55 *  -5.53%*
Hmean     1024     21474.74 (   0.00%)    17722.92 * -17.47%*
Hmean     2048     32904.31 (   0.00%)    28104.16 * -14.59%*
Hmean     3312     42468.33 (   0.00%)    35616.94 * -16.13%*
Hmean     4096     45453.37 (   0.00%)    38130.18 * -16.11%*
Hmean     8192     54372.39 (   0.00%)    47438.78 * -12.75%*
Hmean     16384    61173.73 (   0.00%)    55459.64 *  -9.34%*

On dual socket 3rd Generation Intel Xeon Scalable Processor
(2 x 32C/64T Intel Xeon Platinum 8362):

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

Clients:     good                  good + series
    1    424.31 (0.00 pct)       399.00 (-5.96 pct)
    2    844.12 (0.00 pct)       797.10 (-5.57 pct)
    4    1667.07 (0.00 pct)      1543.72 (-7.39 pct)
    8    3289.42 (0.00 pct)      3036.96 (-7.67 pct)
   16    6611.76 (0.00 pct)      6095.99 (-7.80 pct)
   32    12760.69 (0.00 pct)     11451.82 (-10.25 pct)
   64    17750.13 (0.00 pct)     15796.17 (-11.00 pct)
  128    15282.56 (0.00 pct)     14492.78 (-5.16 pct)
  256    36000.91 (0.00 pct)     31496.12 (-12.51 pct)
  512    35020.84 (0.00 pct)     28975.34 (-17.26 pct)

~~~~~~~~~~~
~ netperf ~
~~~~~~~~~~~

- netperf-udp

kernel                     good                   good + series
Hmean     send-64         234.69 (   0.00%)      232.32 *  -1.01%*
Hmean     send-128        471.02 (   0.00%)      469.08 *  -0.41%*
Hmean     send-256        934.75 (   0.00%)      914.79 *  -2.14%*
Hmean     send-1024      3594.09 (   0.00%)     3562.71 *  -0.87%*
Hmean     send-2048      6625.58 (   0.00%)     6720.12 *   1.43%*
Hmean     send-3312     10843.34 (   0.00%)    10818.02 *  -0.23%*
Hmean     send-4096     12291.20 (   0.00%)    12329.75 *   0.31%*
Hmean     send-8192     19017.73 (   0.00%)    19348.73 *   1.74%*
Hmean     send-16384    34952.23 (   0.00%)    34886.12 *  -0.19%*
Hmean     recv-64         234.69 (   0.00%)      232.32 *  -1.01%*
Hmean     recv-128        471.02 (   0.00%)      469.08 *  -0.41%*
Hmean     recv-256        934.75 (   0.00%)      914.79 *  -2.14%*
Hmean     recv-1024      3594.09 (   0.00%)     3562.71 *  -0.87%*
Hmean     recv-2048      6625.58 (   0.00%)     6720.12 *   1.43%*
Hmean     recv-3312     10843.34 (   0.00%)    10817.95 *  -0.23%*
Hmean     recv-4096     12291.20 (   0.00%)    12329.75 *   0.31%*
Hmean     recv-8192     19017.72 (   0.00%)    19348.73 *   1.74%*
Hmean     recv-16384    34952.23 (   0.00%)    34886.12 *  -0.19%*

- netperf-tcp

kernel                good                   good + series
Hmean     64        2032.37 (   0.00%)     1979.42 *  -2.61%*
Hmean     128       3951.42 (   0.00%)     3789.31 *  -4.10%*
Hmean     256       7295.39 (   0.00%)     6989.24 *  -4.20%*
Hmean     1024     19844.93 (   0.00%)    18863.06 *  -4.95%*
Hmean     2048     27493.40 (   0.00%)    25395.34 *  -7.63%*
Hmean     3312     33224.91 (   0.00%)    30145.59 *  -9.27%*
Hmean     4096     35082.60 (   0.00%)    31510.58 * -10.18%*
Hmean     8192     39842.02 (   0.00%)    36776.27 *  -7.69%*
Hmean     16384    44765.12 (   0.00%)    41373.83 *  -7.58%*

On the Zen3 system, running
perf record -a -e ibs_op//pp --raw-samples -- ./tbench_32_clients.sh
following are the reports from kernel based on:

o good (11483.6 MB/sec)

   3.54%  swapper          [kernel.vmlinux]          [k] acpi_processor_ffh_cstate_enter
   2.01%  tbench_srv       [kernel.vmlinux]          [k] copy_user_generic_string
   1.59%  tbench           [kernel.vmlinux]          [k] net_rx_action
   1.58%  tbench_srv       [kernel.vmlinux]          [k] net_rx_action
   1.46%  swapper          [kernel.vmlinux]          [k] psi_group_change
   1.45%  tbench_srv       [kernel.vmlinux]          [k] read_tsc
   1.43%  tbench           [kernel.vmlinux]          [k] read_tsc
   1.24%  tbench           [kernel.vmlinux]          [k] copy_user_generic_string
   1.15%  swapper          [kernel.vmlinux]          [k] check_preemption_disabled
   1.10%  tbench           [kernel.vmlinux]          [k] __entry_text_start
   1.10%  tbench           [kernel.vmlinux]          [k] tcp_ack
   1.00%  tbench_srv       [kernel.vmlinux]          [k] tcp_ack
   0.95%  tbench           [kernel.vmlinux]          [k] psi_group_change
   0.94%  swapper          [kernel.vmlinux]          [k] read_tsc
   0.93%  tbench_srv       [kernel.vmlinux]          [k] psi_group_change
   0.91%  swapper          [kernel.vmlinux]          [k] menu_select
   0.87%  swapper          [kernel.vmlinux]          [k] __switch_to

o good + series (7903.55 MB/sec)

   3.66%  tbench_srv       [kernel.vmlinux]          [k] tcp_cleanup_rbuf
   3.31%  tbench           [kernel.vmlinux]          [k] tcp_cleanup_rbuf
   3.30%  tbench           [kernel.vmlinux]          [k] tcp_recvmsg_locked
   3.16%  tbench_srv       [kernel.vmlinux]          [k] tcp_recvmsg_locked
   2.76%  swapper          [kernel.vmlinux]          [k] acpi_processor_ffh_cstate_enter
   2.10%  tbench           [kernel.vmlinux]          [k] tcp_ack_update_rtt
   2.05%  tbench_srv       [kernel.vmlinux]          [k] copy_user_generic_string
   2.04%  tbench_srv       [kernel.vmlinux]          [k] tcp_ack_update_rtt
   1.84%  tbench           [kernel.vmlinux]          [k] check_preemption_disabled
   1.47%  tbench           [kernel.vmlinux]          [k] __sk_mem_reduce_allocated
   1.25%  tbench_srv       [kernel.vmlinux]          [k] __sk_mem_reduce_allocated
   1.23%  tbench_srv       [kernel.vmlinux]          [k] check_preemption_disabled
   1.11%  swapper          [kernel.vmlinux]          [k] psi_group_change
   1.10%  tbench           [kernel.vmlinux]          [k] copy_user_generic_string
   0.95%  tbench           [kernel.vmlinux]          [k] entry_SYSCALL_64
   0.87%  swapper          [kernel.vmlinux]          [k] check_preemption_disabled
   0.85%  tbench           [kernel.vmlinux]          [k] read_tsc
   0.84%  tbench_srv       [kernel.vmlinux]          [k] read_tsc
   0.82%  swapper          [kernel.vmlinux]          [k] __switch_to
   0.81%  tbench           [kernel.vmlinux]          [k] __mod_memcg_state
   0.76%  tbench           [kernel.vmlinux]          [k] psi_group_change

On Intel system, running
perf record -a -e cycles:ppp -- ./tbench_32_clients.sh
following are the reports from kernel based on:

o good (12561 MB/sec)

  20.62%  swapper          [kernel.vmlinux]          [k] mwait_idle_with_hints.constprop.0
   1.55%  tbench_srv       [kernel.vmlinux]          [k] copy_user_enhanced_fast_string
   1.37%  swapper          [kernel.vmlinux]          [k] psi_group_change
   0.89%  swapper          [kernel.vmlinux]          [k] check_preemption_disabled
   0.86%  tbench           tbench                    [.] child_run
   0.84%  tbench           [kernel.vmlinux]          [k] nft_do_chain
   0.83%  tbench_srv       [kernel.vmlinux]          [k] nft_do_chain
   0.79%  tbench_srv       [kernel.vmlinux]          [k] strncpy
   0.77%  tbench           [kernel.vmlinux]          [k] strncpy

o good + series (11213 MB/sec):

  19.11%  swapper          [kernel.vmlinux]          [k] mwait_idle_with_hints.constprop.0
   1.90%  tbench_srv       [kernel.vmlinux]          [k] __sk_mem_reduce_allocated
   1.86%  tbench           [kernel.vmlinux]          [k] __sk_mem_reduce_allocated
   1.40%  tbench_srv       [kernel.vmlinux]          [k] copy_user_enhanced_fast_string
   1.31%  swapper          [kernel.vmlinux]          [k] psi_group_change
   0.86%  tbench           tbench                    [.] child_run
   0.83%  tbench_srv       [kernel.vmlinux]          [k] check_preemption_disabled
   0.82%  swapper          [kernel.vmlinux]          [k] check_preemption_disabled
   0.80%  tbench           [kernel.vmlinux]          [k] check_preemption_disabled
   0.78%  tbench           [kernel.vmlinux]          [k] nft_do_chain
   0.78%  tbench_srv       [kernel.vmlinux]          [k] update_sd_lb_stats.constprop.0
   0.77%  tbench_srv       [kernel.vmlinux]          [k] nft_do_chain

I've inlined some comments below.

On 6/9/2022 12:04 PM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@...gle.com>
> 
> Currently, tcp_memory_allocated can hit tcp_mem[] limits quite fast.
> 
> Each TCP socket can forward allocate up to 2 MB of memory, even after
> flow became less active.
> 
> 10,000 sockets can have reserved 20 GB of memory,
> and we have no shrinker in place to reclaim that.
> 
> Instead of trying to reclaim the extra allocations in some places,
> just keep sk->sk_forward_alloc values as small as possible.
> 
> This should not impact performance too much now we have per-cpu
> reserves: Changes to tcp_memory_allocated should not be too frequent.
> 
> For sockets not using SO_RESERVE_MEM:
>  - idle sockets (no packets in tx/rx queues) have zero forward alloc.
>  - non idle sockets have a forward alloc smaller than one page.
> 
> Note:
> 
>  - Removal of SK_RECLAIM_CHUNK and SK_RECLAIM_THRESHOLD
>    is left to MPTCP maintainers as a follow up.
> 
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> ---
>  include/net/sock.h           | 29 ++---------------------------
>  net/core/datagram.c          |  3 ---
>  net/ipv4/tcp.c               |  7 -------
>  net/ipv4/tcp_input.c         |  4 ----
>  net/ipv4/tcp_timer.c         | 19 ++++---------------
>  net/iucv/af_iucv.c           |  2 --
>  net/mptcp/protocol.c         |  2 +-
>  net/sctp/sm_statefuns.c      |  2 --
>  net/sctp/socket.c            |  5 -----
>  net/sctp/stream_interleave.c |  2 --
>  net/sctp/ulpqueue.c          |  4 ----
>  11 files changed, 7 insertions(+), 72 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index cf288f7e9019106dfb466be707d34dacf33b339c..0063e8410a4e3ed91aef9cf34eb1127f7ce33b93 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1627,19 +1627,6 @@ static inline void sk_mem_reclaim_final(struct sock *sk)
>  	sk_mem_reclaim(sk);
>  }
>  
> -static inline void sk_mem_reclaim_partial(struct sock *sk)
> -{
> -	int reclaimable;
> -
> -	if (!sk_has_account(sk))
> -		return;
> -
> -	reclaimable = sk->sk_forward_alloc - sk_unused_reserved_mem(sk);
> -
> -	if (reclaimable > (int)PAGE_SIZE)
> -		__sk_mem_reclaim(sk, reclaimable - 1);
> -}
> -
>  static inline void sk_mem_charge(struct sock *sk, int size)
>  {
>  	if (!sk_has_account(sk))
> @@ -1647,29 +1634,17 @@ static inline void sk_mem_charge(struct sock *sk, int size)
>  	sk->sk_forward_alloc -= size;
>  }
>  
> -/* the following macros control memory reclaiming in sk_mem_uncharge()
> +/* the following macros control memory reclaiming in mptcp_rmem_uncharge()
>   */
>  #define SK_RECLAIM_THRESHOLD	(1 << 21)
>  #define SK_RECLAIM_CHUNK	(1 << 20)
>  
>  static inline void sk_mem_uncharge(struct sock *sk, int size)
>  {
> -	int reclaimable;
> -
>  	if (!sk_has_account(sk))
>  		return;
>  	sk->sk_forward_alloc += size;
> -	reclaimable = sk->sk_forward_alloc - sk_unused_reserved_mem(sk);
> -
> -	/* Avoid a possible overflow.
> -	 * TCP send queues can make this happen, if sk_mem_reclaim()
> -	 * is not called and more than 2 GBytes are released at once.
> -	 *
> -	 * If we reach 2 MBytes, reclaim 1 MBytes right now, there is
> -	 * no need to hold that much forward allocation anyway.
> -	 */
> -	if (unlikely(reclaimable >= SK_RECLAIM_THRESHOLD))
> -		__sk_mem_reclaim(sk, SK_RECLAIM_CHUNK);
> +	sk_mem_reclaim(sk);

Following are the difference in the count of call to few functions
between the good and the bad kernel when running tbench with
32 clients on both machines:

o AMD Zen3 Machine

        +-------------------------+------------------+---------------+
        | Kernel                  |            Good  | Good + Series |
        +-------------------------+------------------+---------------+
        | Benchmark Result (MB/s) |         11227.9  |        7458.7 |
        | __sk_mem_reclaim        |             197  |      65293205 | *
        | skb_release_head_state  |       607219812  |     406127581 |
        | tcp_ack_update_rtt      |       297442779  |     198937384 |
        | tcp_cleanup_rbuf        |       892648815  |     596972242 |
        | tcp_recvmsg_locked      |       594885088  |     397874278 |
        +-------------------------+------------------+---------------+

o Intel Xeon Machine

        +-------------------------+------------------+---------------+
        | Kernel                  |            Good  | Good + Series |
        +-------------------------+------------------+---------------+
        | Benchmark Result (MB/s) |         11584.9  |       10509.7 |
        | __sk_mem_reclaim        |             198  |      91139810 | *
        | skb_release_head_state  |       623382197  |     566914077 |
        | tcp_ack_update_rtt      |       305357022  |     277699272 |
        | tcp_cleanup_rbuf        |       916296601  |     833239328 |
        | tcp_recvmsg_locked      |       610713561  |     555398063 |
        +-------------------------+------------------+---------------+

As we see, there is a sharp increase in the number of times __sk_mem_reclaim
is called. I believe we might be doing a reclaim too often and the overhead
is adding up.

Following is the kstack for most calls to __sk_mem_reclaim taken on
the AMD Zen3 system using bpftrace on good + series when running
32 tbench clients:
(Found by running: bpftrace -e 'kprobe:__sk_mem_reclaim { @[kstack] = count(); }')

@[
    __sk_mem_reclaim+1
    tcp_rcv_established+377
    tcp_v4_do_rcv+348
    tcp_v4_rcv+3286
    ip_protocol_deliver_rcu+33
    ip_local_deliver_finish+128
    ip_local_deliver+111
    ip_rcv+373
    __netif_receive_skb_one_core+138
    __netif_receive_skb+21
    process_backlog+150
    __napi_poll+51
    net_rx_action+335
    __softirqentry_text_start+259
    do_softirq.part.0+164
    __local_bh_enable_ip+135
    ip_finish_output2+413
    __ip_finish_output+156
    ip_finish_output+46
    ip_output+120
    ip_local_out+94
    __ip_queue_xmit+391
    ip_queue_xmit+21
    __tcp_transmit_skb+2771
    tcp_write_xmit+914
    __tcp_push_pending_frames+55
    tcp_push+264
    tcp_sendmsg_locked+697
    tcp_sendmsg+45
    inet_sendmsg+67
    sock_sendmsg+98
    __sys_sendto+286
    __x64_sys_sendto+36
    do_syscall_64+92
    entry_SYSCALL_64_after_hwframe+99
]: 28986799

>
>  [..snip..]
>

If this is expected based on the tradeoff this series makes, I'll
continue using the latest baseline numbers for testing. Please let
me know if there is something obvious that I might have missed.

If you would like me to gather any data on the test systems,
I'll be happy to get it for you.
--
Thanks and Regards,
Prateek
View attachment "zen3_config" of type "text/plain" (164887 bytes)

View attachment "intel_config" of type "text/plain" (163803 bytes)