netdev - Re: [PATCH net-next 6/7] net: keep sk->sk_forward

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <abf9aae5-1497-5a68-26cd-e49d54bbe0fd@amd.com>
Date:   Mon, 17 Oct 2022 09:34:52 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        netdev <netdev@...r.kernel.org>,
        Soheil Hassas Yeganeh <soheil@...gle.com>,
        Wei Wang <weiwan@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        Gautham Shenoy <gautham.shenoy@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Chen Yu <yu.c.chen@...el.com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Yicong Yang <yangyicong@...ilicon.com>
Subject: Re: [PATCH net-next 6/7] net: keep sk->sk_forward_alloc as small as
 possible

Hello Eric,

On 10/16/2022 1:49 AM, Eric Dumazet wrote:
> On Fri, Oct 14, 2022 at 1:30 AM K Prateek Nayak <kprateek.nayak@....com> wrote:
>>
>> Hello Eric,
> ...
>>
>> Following are the results:
>>
>> Clients:      good                 good + series         good  +series + larger wmem
>>     1    574.93 (0.00 pct)       554.42 (-3.56 pct)      552.92 (-3.82 pct)
>>     2    1135.60 (0.00 pct)      1034.76 (-8.87 pct)     1036.94 (-8.68 pct)
>>     4    2117.29 (0.00 pct)      1796.97 (-15.12 pct)    1539.21 (-27.30 pct)
>>     8    3799.57 (0.00 pct)      3020.87 (-20.49 pct)    2797.98 (-26.36 pct)
>>    16    6129.79 (0.00 pct)      4536.99 (-25.98 pct)    4301.20 (-29.83 pct)
>>    32    11630.67 (0.00 pct)     8674.74 (-25.41 pct)    8199.28 (-29.50 pct)
>>    64    20895.77 (0.00 pct)     14417.26 (-31.00 pct)   14473.34 (-30.73 pct)
>>   128    31989.55 (0.00 pct)     20611.47 (-35.56 pct)   19671.08 (-38.50 pct)
>>   256    56388.57 (0.00 pct)     48822.72 (-13.41 pct)   48455.77 (-14.06 pct)
>>   512    59326.33 (0.00 pct)     43960.03 (-25.90 pct)   43968.59 (-25.88 pct)
>>  1024    58281.10 (0.00 pct)     41256.18 (-29.21 pct)   40550.97 (-30.42 pct)
>>
>> Given the message size is small, I think wmem size does not
>> impact the benchmark results much.
> 
> Hmmm.
> 
> tldr; I can not really repro the issues (tested on AMD EPYC 7B12,
> NPS1) with CONFIG_PREEMPT_NONE=y
> 
> sendmsg(256 bytes)
>     grab 4096 bytes forward allocation from sk->sk_prot->per_cpu_fw_alloc
>    send skb, softirq handler immediately sends ACK back, and queues
> the packet into receiver socket (also grabbing bytes from
> sk->sk_prot->per_cpu_fw_alloc)
>      ACK releases the 4096 bytes to per-cpu
> sk->sk_prot->per_cpu_fw_alloc on sender TCP socket
> 
> per_cpu_fw_alloc have a 1MB cushion (per cpu), not sure why it is not
> enough in your case.
> Worst case would be one dirtying of tcp_memory_allocated every ~256 messages,
> but in more common cases we dirty this cache less often...
> 
> I wonder if NPS2/NPS4 could land per-cpu variables into the wrong NUMA
> node maybe ?
> (or on NPS1, incorrect NUMA information on your platform ?)
> Or maybe the small changes are enough for your system to hit a cliff.
> AMD systems are quite sensitive to mem-bw saturation.

We've observed some unintended side effects of introducing per-cpu
variables in the past that impacted tbench performance
(https://lore.kernel.org/lkml/e000b124-afd4-28e1-fde2-393b0e38ce19@amd.com/)

In those cases, the introduction of new per-cpu variables was enough
to see a regression but with this series, I only see the regression
from Patch 6 which is why I believed it was the changes in the reclaim
strategy that caused this. 

> 
>  I ran the following on an AMD host (NPS1) with two physical cpu (256 HT total)
> 
> for i in 1 2 4 8 16 32 64 128 192 256; do echo -n $i: ;
> ./super_netperf $i -H ::1 -l 10 -- -m 256 -M 256; done
> 
> Before patch series ( 5c281b4e529c )
> 1:   6956
> 2:  14169
> 4:  28311
> 8:  56519
> 16: 113621
> 32: 225317
> 64: 341658
> 128: 475131
> 192: 304515
> 256: 181754
> 
> After patch series, to me this looks very close or even much better at
> high number of threads.
> 1:   6963
> 2:  14166
> 4:  28095
> 8:  56878
> 16: 112723
> 32: 202417
> 64: 266744
> 128: 482031
> 192: 317876
> 256: 293169
> 
> And if we look at "ss -tm" while tests are running, it is clearly
> visible that the old kernels were pretty bad in terms of memory
> control.
> 
> Old kernel:
> ESTAB        0              55040
> [::1]:39474                                                [::1]:32891
> skmem:(r0,rb540000,t0,tb10243584,f1167104,w57600,o0,bl0,d0)
> ESTAB        36864          0
> [::1]:37733                                                [::1]:54752
> skmem:(r55040,rb8515000,t0,tb2626560,f1710336,w0,o0,bl0,d0)
> 
> These two sockets were holding 1167104+1710336 bytes of forward
> allocations, just to 'be efficient'
> Now think of servers with millions of TCP sockets :/
> 
> New kernel : No more extra forward allocations above 4096 bytes.
> sk_forward_alloc only holds the reminder of allocations,
> because memcg/tcp_memory_allocated granularity is in pages.
> 
> ESTAB   35328     0                             [::1]:36493
>                            [::1]:41394
> skmem:(r46848,rb7467000,t0,tb2626560,f2304,w0,o0,bl0,d0)
> ESTAB   0         54272                         [::1]:58680
>                            [::1]:47859
> skmem:(r0,rb540000,t0,tb6829056,f512,w56832,o0,bl0,d0)
> 
> Only when enabling CONFIG_PREEMPT=y I had some kind of spinlock contention
> in scheduler/rcu layers, making test results very flaky.

Thank you for trying to reproduce the issue on your system.
The results you shared are indeed promising. I've probably
overlooked something during my testing.

Can you please share the kernel config you used during your
testing? I would like to rule out any obvious setup errors
from my side.

--
Thanks and Regards,
Prateek