[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <684c6220-9288-3838-a938-0792b57c5968@amd.com>
Date: Thu, 13 Oct 2022 18:45:43 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Eric Dumazet <eric.dumazet@...il.com>,
"David S . Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>
Cc: netdev <netdev@...r.kernel.org>,
Soheil Hassas Yeganeh <soheil@...gle.com>,
Wei Wang <weiwan@...gle.com>,
Shakeel Butt <shakeelb@...gle.com>,
Neal Cardwell <ncardwell@...gle.com>,
Eric Dumazet <edumazet@...gle.com>,
Gautham Shenoy <gautham.shenoy@....com>,
Mel Gorman <mgorman@...hsingularity.net>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Chen Yu <yu.c.chen@...el.com>,
Abel Wu <wuyun.abel@...edance.com>,
Yicong Yang <yangyicong@...ilicon.com>
Subject: Re: [PATCH net-next 6/7] net: keep sk->sk_forward_alloc as small as
possible
Hello Eric,
I might have stumbled upon a possible performance regression observed in
some microbenchmarks caused by this series.
tl;dr
o When performing regression test against tip:sched/core, I noticed a
regression in tbench for the baseline kernel. After ruling out
scheduler changes, bisecting on tip:sched/core, then on Linus' tree and
then on netdev/net-next led me to this series. Patch 6 of the series
which makes changes based on the new reclaim strategy seem to be exact
commit where the regression first started. Regression is also observed
for netperf-tcp but not for netperf-udp after applying this series.
I would like to know if this regression is expected based on some of the
design consideration in the patch. I'll leave a detailed account of
discovery, bisection, benchmark results and some preliminary analysis
below. I've also attached the configs used for testing on AMD and Intel
system.
Details:
When testing community patches, I observed a large degradation in baseline
tbench numbers for tip:sched/core between older test reports
(Example: https://lore.kernel.org/lkml/d49aeabd-ee4e-cc81-06d1-b16029a901ee@amd.com/)
and recent test reports on the AMD Zen3 system I was testing on.
(Example: https://lore.kernel.org/lkml/7975dcbe-97b3-7e6c-4697-5f316731c287@amd.com/).
Following is the direct baseline to baseline comparison for tbench
from the two reports mentioned above on AMD Zen3 system (2 x 64C/128T):
NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:
NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.
Node 0: 0-63, 128-191
Node 1: 64-127, 192-255
NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.
Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255
NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.
Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255
Note: All tests were performed with performance governor.
commit 5531ecffa4b9 ("sched: Add update_current_exec_runtime helper")
commit 7e9518baed4c ("sched/fair: Move call to list_last_entry() in detach_tasks")
~~~~~~~~~~
~ tbench ~
~~~~~~~~~~
NPS1
Clients: tip (5531ecffa4b9) tip (7e9518baed4c)
1 573.26 (0.00 pct) 550.66 (-3.94 pct)
2 1131.19 (0.00 pct) 1009.69 (-10.74 pct)
4 2100.07 (0.00 pct) 1795.32 (-14.51 pct)
8 3809.88 (0.00 pct) 2971.16 (-22.01 pct)
16 6560.72 (0.00 pct) 4627.98 (-29.45 pct)
32 12203.23 (0.00 pct) 8065.15 (-33.90 pct)
64 22389.81 (0.00 pct) 14994.32 (-33.03 pct)
128 32449.37 (0.00 pct) 5175.73 (-84.04 pct) *
256 58962.40 (0.00 pct) 48763.57 (-17.29 pct)
512 59608.71 (0.00 pct) 43780.78 (-26.55 pct)
1024 58037.02 (0.00 pct) 40341.84 (-30.48 pct)
NPS2
Clients: tip (5531ecffa4b9) tip (7e9518baed4c)
1 574.20 (0.00 pct) 551.06 (-4.02 pct)
2 1131.56 (0.00 pct) 1000.76 (-11.55 pct)
4 2132.26 (0.00 pct) 1737.02 (-18.53 pct)
8 3812.20 (0.00 pct) 2992.31 (-21.50 pct)
16 6457.61 (0.00 pct) 4579.29 (-29.08 pct)
32 12263.82 (0.00 pct) 9120.73 (-25.62 pct)
64 22224.11 (0.00 pct) 14918.58 (-32.87 pct)
128 33040.38 (0.00 pct) 20830.61 (-36.95 pct)
256 56547.25 (0.00 pct) 47708.18 (-15.63 pct)
512 56220.67 (0.00 pct) 43721.79 (-22.23 pct)
1024 56048.88 (0.00 pct) 40920.49 (-26.99 pct)
NPS4
Clients: tip (5531ecffa4b9) tip (7e9518baed4c)
1 575.50 (0.00 pct) 549.22 (-4.56 pct)
2 1138.70 (0.00 pct) 1000.08 (-12.17 pct)
4 2070.66 (0.00 pct) 1794.78 (-13.32 pct)
8 3811.70 (0.00 pct) 3008.50 (-21.07 pct)
16 6312.80 (0.00 pct) 4804.71 (-23.88 pct)
32 11418.14 (0.00 pct) 9156.57 (-19.80 pct)
64 19671.16 (0.00 pct) 14901.45 (-24.24 pct)
128 30258.53 (0.00 pct) 20771.20 (-31.35 pct)
256 55838.10 (0.00 pct) 47033.88 (-15.76 pct)
512 55586.44 (0.00 pct) 43429.01 (-21.87 pct)
1024 56370.35 (0.00 pct) 39271.27 (-30.33 pct)
* Note: Ignore the data point as tbench runs into ACPI idle driver issue
(https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/)
When bisecting on tip:sched/core, I found the offending commit to be the
following merge commit:
o commit: 53aa930dc4ba ("Merge branch 'sched/warnings' into sched/core, to pick up WARN_ON_ONCE() conversion commit")
This regression was also observed on Linus' tree and started between
v5.19 and v6.0-rc1. Bisecting on Linus' tree led us to the following
merge commit as the offending commit:
o commit: f86d1fbbe785 ("Merge tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next")
Bisecting the problem on netdev/net-next between the changes that went in
v6.0-rc1 led me to the following commit as the offending commit:
o commit: 4890b686f408 ("net: keep sk->sk_forward_alloc as small as possible")
This change was tracked back to the series "net: reduce
tcp_memory_allocated inflation"
(https://lore.kernel.org/netdev/20220609063412.2205738-1-eric.dumazet@gmail.com/)
The commit 4890b686f408 ("net: keep sk->sk_forward_alloc as small as
possible") alone does not make sense as it assumes that the reclaims are
less expensive as a result of per-cpu reserves implemented in
o commit: 0defbb0af775 ("net: add per_cpu_fw_alloc field to struct proto")
o commit: 3cd3399dd7a8 ("net: implement per-cpu reserves for memory_allocated")
which is part of this series. Following are the results of tbench and
netperf after applying the series on Linus's tree on top of last
good commit:
good: 526942b8134c ("Merge tag 'ata-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata")
On dual socket AMD 3rd Generation EPYC Processor
(2 x 64C/128T AMD EPYC 7713) in NPS1 mode:
~~~~~~~~~~
~ tbench ~
~~~~~~~~~~
Clients: good good + series
1 574.93 (0.00 pct) 554.42 (-3.56 pct)
2 1135.60 (0.00 pct) 1034.76 (-8.87 pct)
4 2117.29 (0.00 pct) 1796.97 (-15.12 pct)
8 3799.57 (0.00 pct) 3020.87 (-20.49 pct)
16 6129.79 (0.00 pct) 4536.99 (-25.98 pct)
32 11630.67 (0.00 pct) 8674.74 (-25.41 pct)
64 20895.77 (0.00 pct) 14417.26 (-31.00 pct)
128 31989.55 (0.00 pct) 20611.47 (-35.56 pct)
256 56388.57 (0.00 pct) 48822.72 (-13.41 pct)
512 59326.33 (0.00 pct) 43960.03 (-25.90 pct)
1024 58281.10 (0.00 pct) 41256.18 (-29.21 pct)
~~~~~~~~~~~
~ netperf ~
~~~~~~~~~~~
- netperf-udp
kernel good good + series
Hmean send-64 346.45 ( 0.00%) 346.53 ( 0.02%)
Hmean send-128 688.39 ( 0.00%) 688.53 ( 0.02%)
Hmean send-256 1355.60 ( 0.00%) 1358.59 ( 0.22%)
Hmean send-1024 5314.81 ( 0.00%) 5302.48 ( -0.23%)
Hmean send-2048 9757.81 ( 0.00%) 9996.26 * 2.44%*
Hmean send-3312 15033.99 ( 0.00%) 15289.91 ( 1.70%)
Hmean send-4096 16009.90 ( 0.00%) 16441.11 * 2.69%*
Hmean send-8192 25039.37 ( 0.00%) 24316.10 ( -2.89%)
Hmean send-16384 46928.16 ( 0.00%) 47746.29 ( 1.74%)
Hmean recv-64 346.45 ( 0.00%) 346.53 ( 0.02%)
Hmean recv-128 688.39 ( 0.00%) 688.53 ( 0.02%)
Hmean recv-256 1355.60 ( 0.00%) 1358.59 ( 0.22%)
Hmean recv-1024 5314.80 ( 0.00%) 5302.47 ( -0.23%)
Hmean recv-2048 9757.76 ( 0.00%) 9996.25 * 2.44%*
Hmean recv-3312 15033.95 ( 0.00%) 15289.83 ( 1.70%)
Hmean recv-4096 16009.84 ( 0.00%) 16441.05 * 2.69%*
Hmean recv-8192 25039.12 ( 0.00%) 24315.81 ( -2.89%)
Hmean recv-16384 46927.59 ( 0.00%) 47746.12 ( 1.74%)
- netperf-tcp
kernel good good + series
Hmean 64 1846.16 ( 0.00%) 1795.84 * -2.73%*
Hmean 128 3583.91 ( 0.00%) 3448.49 * -3.78%*
Hmean 256 6803.96 ( 0.00%) 6427.55 * -5.53%*
Hmean 1024 21474.74 ( 0.00%) 17722.92 * -17.47%*
Hmean 2048 32904.31 ( 0.00%) 28104.16 * -14.59%*
Hmean 3312 42468.33 ( 0.00%) 35616.94 * -16.13%*
Hmean 4096 45453.37 ( 0.00%) 38130.18 * -16.11%*
Hmean 8192 54372.39 ( 0.00%) 47438.78 * -12.75%*
Hmean 16384 61173.73 ( 0.00%) 55459.64 * -9.34%*
On dual socket 3rd Generation Intel Xeon Scalable Processor
(2 x 32C/64T Intel Xeon Platinum 8362):
~~~~~~~~~~
~ tbench ~
~~~~~~~~~~
Clients: good good + series
1 424.31 (0.00 pct) 399.00 (-5.96 pct)
2 844.12 (0.00 pct) 797.10 (-5.57 pct)
4 1667.07 (0.00 pct) 1543.72 (-7.39 pct)
8 3289.42 (0.00 pct) 3036.96 (-7.67 pct)
16 6611.76 (0.00 pct) 6095.99 (-7.80 pct)
32 12760.69 (0.00 pct) 11451.82 (-10.25 pct)
64 17750.13 (0.00 pct) 15796.17 (-11.00 pct)
128 15282.56 (0.00 pct) 14492.78 (-5.16 pct)
256 36000.91 (0.00 pct) 31496.12 (-12.51 pct)
512 35020.84 (0.00 pct) 28975.34 (-17.26 pct)
~~~~~~~~~~~
~ netperf ~
~~~~~~~~~~~
- netperf-udp
kernel good good + series
Hmean send-64 234.69 ( 0.00%) 232.32 * -1.01%*
Hmean send-128 471.02 ( 0.00%) 469.08 * -0.41%*
Hmean send-256 934.75 ( 0.00%) 914.79 * -2.14%*
Hmean send-1024 3594.09 ( 0.00%) 3562.71 * -0.87%*
Hmean send-2048 6625.58 ( 0.00%) 6720.12 * 1.43%*
Hmean send-3312 10843.34 ( 0.00%) 10818.02 * -0.23%*
Hmean send-4096 12291.20 ( 0.00%) 12329.75 * 0.31%*
Hmean send-8192 19017.73 ( 0.00%) 19348.73 * 1.74%*
Hmean send-16384 34952.23 ( 0.00%) 34886.12 * -0.19%*
Hmean recv-64 234.69 ( 0.00%) 232.32 * -1.01%*
Hmean recv-128 471.02 ( 0.00%) 469.08 * -0.41%*
Hmean recv-256 934.75 ( 0.00%) 914.79 * -2.14%*
Hmean recv-1024 3594.09 ( 0.00%) 3562.71 * -0.87%*
Hmean recv-2048 6625.58 ( 0.00%) 6720.12 * 1.43%*
Hmean recv-3312 10843.34 ( 0.00%) 10817.95 * -0.23%*
Hmean recv-4096 12291.20 ( 0.00%) 12329.75 * 0.31%*
Hmean recv-8192 19017.72 ( 0.00%) 19348.73 * 1.74%*
Hmean recv-16384 34952.23 ( 0.00%) 34886.12 * -0.19%*
- netperf-tcp
kernel good good + series
Hmean 64 2032.37 ( 0.00%) 1979.42 * -2.61%*
Hmean 128 3951.42 ( 0.00%) 3789.31 * -4.10%*
Hmean 256 7295.39 ( 0.00%) 6989.24 * -4.20%*
Hmean 1024 19844.93 ( 0.00%) 18863.06 * -4.95%*
Hmean 2048 27493.40 ( 0.00%) 25395.34 * -7.63%*
Hmean 3312 33224.91 ( 0.00%) 30145.59 * -9.27%*
Hmean 4096 35082.60 ( 0.00%) 31510.58 * -10.18%*
Hmean 8192 39842.02 ( 0.00%) 36776.27 * -7.69%*
Hmean 16384 44765.12 ( 0.00%) 41373.83 * -7.58%*
On the Zen3 system, running
perf record -a -e ibs_op//pp --raw-samples -- ./tbench_32_clients.sh
following are the reports from kernel based on:
o good (11483.6 MB/sec)
3.54% swapper [kernel.vmlinux] [k] acpi_processor_ffh_cstate_enter
2.01% tbench_srv [kernel.vmlinux] [k] copy_user_generic_string
1.59% tbench [kernel.vmlinux] [k] net_rx_action
1.58% tbench_srv [kernel.vmlinux] [k] net_rx_action
1.46% swapper [kernel.vmlinux] [k] psi_group_change
1.45% tbench_srv [kernel.vmlinux] [k] read_tsc
1.43% tbench [kernel.vmlinux] [k] read_tsc
1.24% tbench [kernel.vmlinux] [k] copy_user_generic_string
1.15% swapper [kernel.vmlinux] [k] check_preemption_disabled
1.10% tbench [kernel.vmlinux] [k] __entry_text_start
1.10% tbench [kernel.vmlinux] [k] tcp_ack
1.00% tbench_srv [kernel.vmlinux] [k] tcp_ack
0.95% tbench [kernel.vmlinux] [k] psi_group_change
0.94% swapper [kernel.vmlinux] [k] read_tsc
0.93% tbench_srv [kernel.vmlinux] [k] psi_group_change
0.91% swapper [kernel.vmlinux] [k] menu_select
0.87% swapper [kernel.vmlinux] [k] __switch_to
o good + series (7903.55 MB/sec)
3.66% tbench_srv [kernel.vmlinux] [k] tcp_cleanup_rbuf
3.31% tbench [kernel.vmlinux] [k] tcp_cleanup_rbuf
3.30% tbench [kernel.vmlinux] [k] tcp_recvmsg_locked
3.16% tbench_srv [kernel.vmlinux] [k] tcp_recvmsg_locked
2.76% swapper [kernel.vmlinux] [k] acpi_processor_ffh_cstate_enter
2.10% tbench [kernel.vmlinux] [k] tcp_ack_update_rtt
2.05% tbench_srv [kernel.vmlinux] [k] copy_user_generic_string
2.04% tbench_srv [kernel.vmlinux] [k] tcp_ack_update_rtt
1.84% tbench [kernel.vmlinux] [k] check_preemption_disabled
1.47% tbench [kernel.vmlinux] [k] __sk_mem_reduce_allocated
1.25% tbench_srv [kernel.vmlinux] [k] __sk_mem_reduce_allocated
1.23% tbench_srv [kernel.vmlinux] [k] check_preemption_disabled
1.11% swapper [kernel.vmlinux] [k] psi_group_change
1.10% tbench [kernel.vmlinux] [k] copy_user_generic_string
0.95% tbench [kernel.vmlinux] [k] entry_SYSCALL_64
0.87% swapper [kernel.vmlinux] [k] check_preemption_disabled
0.85% tbench [kernel.vmlinux] [k] read_tsc
0.84% tbench_srv [kernel.vmlinux] [k] read_tsc
0.82% swapper [kernel.vmlinux] [k] __switch_to
0.81% tbench [kernel.vmlinux] [k] __mod_memcg_state
0.76% tbench [kernel.vmlinux] [k] psi_group_change
On Intel system, running
perf record -a -e cycles:ppp -- ./tbench_32_clients.sh
following are the reports from kernel based on:
o good (12561 MB/sec)
20.62% swapper [kernel.vmlinux] [k] mwait_idle_with_hints.constprop.0
1.55% tbench_srv [kernel.vmlinux] [k] copy_user_enhanced_fast_string
1.37% swapper [kernel.vmlinux] [k] psi_group_change
0.89% swapper [kernel.vmlinux] [k] check_preemption_disabled
0.86% tbench tbench [.] child_run
0.84% tbench [kernel.vmlinux] [k] nft_do_chain
0.83% tbench_srv [kernel.vmlinux] [k] nft_do_chain
0.79% tbench_srv [kernel.vmlinux] [k] strncpy
0.77% tbench [kernel.vmlinux] [k] strncpy
o good + series (11213 MB/sec):
19.11% swapper [kernel.vmlinux] [k] mwait_idle_with_hints.constprop.0
1.90% tbench_srv [kernel.vmlinux] [k] __sk_mem_reduce_allocated
1.86% tbench [kernel.vmlinux] [k] __sk_mem_reduce_allocated
1.40% tbench_srv [kernel.vmlinux] [k] copy_user_enhanced_fast_string
1.31% swapper [kernel.vmlinux] [k] psi_group_change
0.86% tbench tbench [.] child_run
0.83% tbench_srv [kernel.vmlinux] [k] check_preemption_disabled
0.82% swapper [kernel.vmlinux] [k] check_preemption_disabled
0.80% tbench [kernel.vmlinux] [k] check_preemption_disabled
0.78% tbench [kernel.vmlinux] [k] nft_do_chain
0.78% tbench_srv [kernel.vmlinux] [k] update_sd_lb_stats.constprop.0
0.77% tbench_srv [kernel.vmlinux] [k] nft_do_chain
I've inlined some comments below.
On 6/9/2022 12:04 PM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@...gle.com>
>
> Currently, tcp_memory_allocated can hit tcp_mem[] limits quite fast.
>
> Each TCP socket can forward allocate up to 2 MB of memory, even after
> flow became less active.
>
> 10,000 sockets can have reserved 20 GB of memory,
> and we have no shrinker in place to reclaim that.
>
> Instead of trying to reclaim the extra allocations in some places,
> just keep sk->sk_forward_alloc values as small as possible.
>
> This should not impact performance too much now we have per-cpu
> reserves: Changes to tcp_memory_allocated should not be too frequent.
>
> For sockets not using SO_RESERVE_MEM:
> - idle sockets (no packets in tx/rx queues) have zero forward alloc.
> - non idle sockets have a forward alloc smaller than one page.
>
> Note:
>
> - Removal of SK_RECLAIM_CHUNK and SK_RECLAIM_THRESHOLD
> is left to MPTCP maintainers as a follow up.
>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> ---
> include/net/sock.h | 29 ++---------------------------
> net/core/datagram.c | 3 ---
> net/ipv4/tcp.c | 7 -------
> net/ipv4/tcp_input.c | 4 ----
> net/ipv4/tcp_timer.c | 19 ++++---------------
> net/iucv/af_iucv.c | 2 --
> net/mptcp/protocol.c | 2 +-
> net/sctp/sm_statefuns.c | 2 --
> net/sctp/socket.c | 5 -----
> net/sctp/stream_interleave.c | 2 --
> net/sctp/ulpqueue.c | 4 ----
> 11 files changed, 7 insertions(+), 72 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index cf288f7e9019106dfb466be707d34dacf33b339c..0063e8410a4e3ed91aef9cf34eb1127f7ce33b93 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1627,19 +1627,6 @@ static inline void sk_mem_reclaim_final(struct sock *sk)
> sk_mem_reclaim(sk);
> }
>
> -static inline void sk_mem_reclaim_partial(struct sock *sk)
> -{
> - int reclaimable;
> -
> - if (!sk_has_account(sk))
> - return;
> -
> - reclaimable = sk->sk_forward_alloc - sk_unused_reserved_mem(sk);
> -
> - if (reclaimable > (int)PAGE_SIZE)
> - __sk_mem_reclaim(sk, reclaimable - 1);
> -}
> -
> static inline void sk_mem_charge(struct sock *sk, int size)
> {
> if (!sk_has_account(sk))
> @@ -1647,29 +1634,17 @@ static inline void sk_mem_charge(struct sock *sk, int size)
> sk->sk_forward_alloc -= size;
> }
>
> -/* the following macros control memory reclaiming in sk_mem_uncharge()
> +/* the following macros control memory reclaiming in mptcp_rmem_uncharge()
> */
> #define SK_RECLAIM_THRESHOLD (1 << 21)
> #define SK_RECLAIM_CHUNK (1 << 20)
>
> static inline void sk_mem_uncharge(struct sock *sk, int size)
> {
> - int reclaimable;
> -
> if (!sk_has_account(sk))
> return;
> sk->sk_forward_alloc += size;
> - reclaimable = sk->sk_forward_alloc - sk_unused_reserved_mem(sk);
> -
> - /* Avoid a possible overflow.
> - * TCP send queues can make this happen, if sk_mem_reclaim()
> - * is not called and more than 2 GBytes are released at once.
> - *
> - * If we reach 2 MBytes, reclaim 1 MBytes right now, there is
> - * no need to hold that much forward allocation anyway.
> - */
> - if (unlikely(reclaimable >= SK_RECLAIM_THRESHOLD))
> - __sk_mem_reclaim(sk, SK_RECLAIM_CHUNK);
> + sk_mem_reclaim(sk);
Following are the difference in the count of call to few functions
between the good and the bad kernel when running tbench with
32 clients on both machines:
o AMD Zen3 Machine
+-------------------------+------------------+---------------+
| Kernel | Good | Good + Series |
+-------------------------+------------------+---------------+
| Benchmark Result (MB/s) | 11227.9 | 7458.7 |
| __sk_mem_reclaim | 197 | 65293205 | *
| skb_release_head_state | 607219812 | 406127581 |
| tcp_ack_update_rtt | 297442779 | 198937384 |
| tcp_cleanup_rbuf | 892648815 | 596972242 |
| tcp_recvmsg_locked | 594885088 | 397874278 |
+-------------------------+------------------+---------------+
o Intel Xeon Machine
+-------------------------+------------------+---------------+
| Kernel | Good | Good + Series |
+-------------------------+------------------+---------------+
| Benchmark Result (MB/s) | 11584.9 | 10509.7 |
| __sk_mem_reclaim | 198 | 91139810 | *
| skb_release_head_state | 623382197 | 566914077 |
| tcp_ack_update_rtt | 305357022 | 277699272 |
| tcp_cleanup_rbuf | 916296601 | 833239328 |
| tcp_recvmsg_locked | 610713561 | 555398063 |
+-------------------------+------------------+---------------+
As we see, there is a sharp increase in the number of times __sk_mem_reclaim
is called. I believe we might be doing a reclaim too often and the overhead
is adding up.
Following is the kstack for most calls to __sk_mem_reclaim taken on
the AMD Zen3 system using bpftrace on good + series when running
32 tbench clients:
(Found by running: bpftrace -e 'kprobe:__sk_mem_reclaim { @[kstack] = count(); }')
@[
__sk_mem_reclaim+1
tcp_rcv_established+377
tcp_v4_do_rcv+348
tcp_v4_rcv+3286
ip_protocol_deliver_rcu+33
ip_local_deliver_finish+128
ip_local_deliver+111
ip_rcv+373
__netif_receive_skb_one_core+138
__netif_receive_skb+21
process_backlog+150
__napi_poll+51
net_rx_action+335
__softirqentry_text_start+259
do_softirq.part.0+164
__local_bh_enable_ip+135
ip_finish_output2+413
__ip_finish_output+156
ip_finish_output+46
ip_output+120
ip_local_out+94
__ip_queue_xmit+391
ip_queue_xmit+21
__tcp_transmit_skb+2771
tcp_write_xmit+914
__tcp_push_pending_frames+55
tcp_push+264
tcp_sendmsg_locked+697
tcp_sendmsg+45
inet_sendmsg+67
sock_sendmsg+98
__sys_sendto+286
__x64_sys_sendto+36
do_syscall_64+92
entry_SYSCALL_64_after_hwframe+99
]: 28986799
>
> [..snip..]
>
If this is expected based on the tradeoff this series makes, I'll
continue using the latest baseline numbers for testing. Please let
me know if there is something obvious that I might have missed.
If you would like me to gather any data on the test systems,
I'll be happy to get it for you.
--
Thanks and Regards,
Prateek
View attachment "zen3_config" of type "text/plain" (164887 bytes)
View attachment "intel_config" of type "text/plain" (163803 bytes)
Powered by blists - more mailing lists