lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Sat, 15 Oct 2022 13:19:13 -0700 From: Eric Dumazet <edumazet@...gle.com> To: K Prateek Nayak <kprateek.nayak@....com> Cc: Eric Dumazet <eric.dumazet@...il.com>, "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, netdev <netdev@...r.kernel.org>, Soheil Hassas Yeganeh <soheil@...gle.com>, Wei Wang <weiwan@...gle.com>, Shakeel Butt <shakeelb@...gle.com>, Neal Cardwell <ncardwell@...gle.com>, Gautham Shenoy <gautham.shenoy@....com>, Mel Gorman <mgorman@...hsingularity.net>, Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...nel.org>, Vincent Guittot <vincent.guittot@...aro.org>, Chen Yu <yu.c.chen@...el.com>, Abel Wu <wuyun.abel@...edance.com>, Yicong Yang <yangyicong@...ilicon.com> Subject: Re: [PATCH net-next 6/7] net: keep sk->sk_forward_alloc as small as possible On Fri, Oct 14, 2022 at 1:30 AM K Prateek Nayak <kprateek.nayak@....com> wrote: > > Hello Eric, ... > > Following are the results: > > Clients: good good + series good +series + larger wmem > 1 574.93 (0.00 pct) 554.42 (-3.56 pct) 552.92 (-3.82 pct) > 2 1135.60 (0.00 pct) 1034.76 (-8.87 pct) 1036.94 (-8.68 pct) > 4 2117.29 (0.00 pct) 1796.97 (-15.12 pct) 1539.21 (-27.30 pct) > 8 3799.57 (0.00 pct) 3020.87 (-20.49 pct) 2797.98 (-26.36 pct) > 16 6129.79 (0.00 pct) 4536.99 (-25.98 pct) 4301.20 (-29.83 pct) > 32 11630.67 (0.00 pct) 8674.74 (-25.41 pct) 8199.28 (-29.50 pct) > 64 20895.77 (0.00 pct) 14417.26 (-31.00 pct) 14473.34 (-30.73 pct) > 128 31989.55 (0.00 pct) 20611.47 (-35.56 pct) 19671.08 (-38.50 pct) > 256 56388.57 (0.00 pct) 48822.72 (-13.41 pct) 48455.77 (-14.06 pct) > 512 59326.33 (0.00 pct) 43960.03 (-25.90 pct) 43968.59 (-25.88 pct) > 1024 58281.10 (0.00 pct) 41256.18 (-29.21 pct) 40550.97 (-30.42 pct) > > Given the message size is small, I think wmem size does not > impact the benchmark results much. Hmmm. tldr; I can not really repro the issues (tested on AMD EPYC 7B12, NPS1) with CONFIG_PREEMPT_NONE=y sendmsg(256 bytes) grab 4096 bytes forward allocation from sk->sk_prot->per_cpu_fw_alloc send skb, softirq handler immediately sends ACK back, and queues the packet into receiver socket (also grabbing bytes from sk->sk_prot->per_cpu_fw_alloc) ACK releases the 4096 bytes to per-cpu sk->sk_prot->per_cpu_fw_alloc on sender TCP socket per_cpu_fw_alloc have a 1MB cushion (per cpu), not sure why it is not enough in your case. Worst case would be one dirtying of tcp_memory_allocated every ~256 messages, but in more common cases we dirty this cache less often... I wonder if NPS2/NPS4 could land per-cpu variables into the wrong NUMA node maybe ? (or on NPS1, incorrect NUMA information on your platform ?) Or maybe the small changes are enough for your system to hit a cliff. AMD systems are quite sensitive to mem-bw saturation. I ran the following on an AMD host (NPS1) with two physical cpu (256 HT total) for i in 1 2 4 8 16 32 64 128 192 256; do echo -n $i: ; ./super_netperf $i -H ::1 -l 10 -- -m 256 -M 256; done Before patch series ( 5c281b4e529c ) 1: 6956 2: 14169 4: 28311 8: 56519 16: 113621 32: 225317 64: 341658 128: 475131 192: 304515 256: 181754 After patch series, to me this looks very close or even much better at high number of threads. 1: 6963 2: 14166 4: 28095 8: 56878 16: 112723 32: 202417 64: 266744 128: 482031 192: 317876 256: 293169 And if we look at "ss -tm" while tests are running, it is clearly visible that the old kernels were pretty bad in terms of memory control. Old kernel: ESTAB 0 55040 [::1]:39474 [::1]:32891 skmem:(r0,rb540000,t0,tb10243584,f1167104,w57600,o0,bl0,d0) ESTAB 36864 0 [::1]:37733 [::1]:54752 skmem:(r55040,rb8515000,t0,tb2626560,f1710336,w0,o0,bl0,d0) These two sockets were holding 1167104+1710336 bytes of forward allocations, just to 'be efficient' Now think of servers with millions of TCP sockets :/ New kernel : No more extra forward allocations above 4096 bytes. sk_forward_alloc only holds the reminder of allocations, because memcg/tcp_memory_allocated granularity is in pages. ESTAB 35328 0 [::1]:36493 [::1]:41394 skmem:(r46848,rb7467000,t0,tb2626560,f2304,w0,o0,bl0,d0) ESTAB 0 54272 [::1]:58680 [::1]:47859 skmem:(r0,rb540000,t0,tb6829056,f512,w56832,o0,bl0,d0) Only when enabling CONFIG_PREEMPT=y I had some kind of spinlock contention in scheduler/rcu layers, making test results very flaky.
Powered by blists - more mailing lists