[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJF2sWcxEJQF8SN4+VuAfVGUmP-s7qFXZEGYJH28iQLWQ@mail.gmail.com>
Date: Sat, 15 Oct 2022 13:19:13 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Eric Dumazet <eric.dumazet@...il.com>,
"David S . Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
netdev <netdev@...r.kernel.org>,
Soheil Hassas Yeganeh <soheil@...gle.com>,
Wei Wang <weiwan@...gle.com>,
Shakeel Butt <shakeelb@...gle.com>,
Neal Cardwell <ncardwell@...gle.com>,
Gautham Shenoy <gautham.shenoy@....com>,
Mel Gorman <mgorman@...hsingularity.net>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Chen Yu <yu.c.chen@...el.com>,
Abel Wu <wuyun.abel@...edance.com>,
Yicong Yang <yangyicong@...ilicon.com>
Subject: Re: [PATCH net-next 6/7] net: keep sk->sk_forward_alloc as small as possible
On Fri, Oct 14, 2022 at 1:30 AM K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> Hello Eric,
...
>
> Following are the results:
>
> Clients: good good + series good +series + larger wmem
> 1 574.93 (0.00 pct) 554.42 (-3.56 pct) 552.92 (-3.82 pct)
> 2 1135.60 (0.00 pct) 1034.76 (-8.87 pct) 1036.94 (-8.68 pct)
> 4 2117.29 (0.00 pct) 1796.97 (-15.12 pct) 1539.21 (-27.30 pct)
> 8 3799.57 (0.00 pct) 3020.87 (-20.49 pct) 2797.98 (-26.36 pct)
> 16 6129.79 (0.00 pct) 4536.99 (-25.98 pct) 4301.20 (-29.83 pct)
> 32 11630.67 (0.00 pct) 8674.74 (-25.41 pct) 8199.28 (-29.50 pct)
> 64 20895.77 (0.00 pct) 14417.26 (-31.00 pct) 14473.34 (-30.73 pct)
> 128 31989.55 (0.00 pct) 20611.47 (-35.56 pct) 19671.08 (-38.50 pct)
> 256 56388.57 (0.00 pct) 48822.72 (-13.41 pct) 48455.77 (-14.06 pct)
> 512 59326.33 (0.00 pct) 43960.03 (-25.90 pct) 43968.59 (-25.88 pct)
> 1024 58281.10 (0.00 pct) 41256.18 (-29.21 pct) 40550.97 (-30.42 pct)
>
> Given the message size is small, I think wmem size does not
> impact the benchmark results much.
Hmmm.
tldr; I can not really repro the issues (tested on AMD EPYC 7B12,
NPS1) with CONFIG_PREEMPT_NONE=y
sendmsg(256 bytes)
grab 4096 bytes forward allocation from sk->sk_prot->per_cpu_fw_alloc
send skb, softirq handler immediately sends ACK back, and queues
the packet into receiver socket (also grabbing bytes from
sk->sk_prot->per_cpu_fw_alloc)
ACK releases the 4096 bytes to per-cpu
sk->sk_prot->per_cpu_fw_alloc on sender TCP socket
per_cpu_fw_alloc have a 1MB cushion (per cpu), not sure why it is not
enough in your case.
Worst case would be one dirtying of tcp_memory_allocated every ~256 messages,
but in more common cases we dirty this cache less often...
I wonder if NPS2/NPS4 could land per-cpu variables into the wrong NUMA
node maybe ?
(or on NPS1, incorrect NUMA information on your platform ?)
Or maybe the small changes are enough for your system to hit a cliff.
AMD systems are quite sensitive to mem-bw saturation.
I ran the following on an AMD host (NPS1) with two physical cpu (256 HT total)
for i in 1 2 4 8 16 32 64 128 192 256; do echo -n $i: ;
./super_netperf $i -H ::1 -l 10 -- -m 256 -M 256; done
Before patch series ( 5c281b4e529c )
1: 6956
2: 14169
4: 28311
8: 56519
16: 113621
32: 225317
64: 341658
128: 475131
192: 304515
256: 181754
After patch series, to me this looks very close or even much better at
high number of threads.
1: 6963
2: 14166
4: 28095
8: 56878
16: 112723
32: 202417
64: 266744
128: 482031
192: 317876
256: 293169
And if we look at "ss -tm" while tests are running, it is clearly
visible that the old kernels were pretty bad in terms of memory
control.
Old kernel:
ESTAB 0 55040
[::1]:39474 [::1]:32891
skmem:(r0,rb540000,t0,tb10243584,f1167104,w57600,o0,bl0,d0)
ESTAB 36864 0
[::1]:37733 [::1]:54752
skmem:(r55040,rb8515000,t0,tb2626560,f1710336,w0,o0,bl0,d0)
These two sockets were holding 1167104+1710336 bytes of forward
allocations, just to 'be efficient'
Now think of servers with millions of TCP sockets :/
New kernel : No more extra forward allocations above 4096 bytes.
sk_forward_alloc only holds the reminder of allocations,
because memcg/tcp_memory_allocated granularity is in pages.
ESTAB 35328 0 [::1]:36493
[::1]:41394
skmem:(r46848,rb7467000,t0,tb2626560,f2304,w0,o0,bl0,d0)
ESTAB 0 54272 [::1]:58680
[::1]:47859
skmem:(r0,rb540000,t0,tb6829056,f512,w56832,o0,bl0,d0)
Only when enabling CONFIG_PREEMPT=y I had some kind of spinlock contention
in scheduler/rcu layers, making test results very flaky.
Powered by blists - more mailing lists