netdev - Re: [PATCH net-next 6/7] net: keep sk->sk_forward

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJF2sWcxEJQF8SN4+VuAfVGUmP-s7qFXZEGYJH28iQLWQ@mail.gmail.com>
Date:   Sat, 15 Oct 2022 13:19:13 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     K Prateek Nayak <kprateek.nayak@....com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        netdev <netdev@...r.kernel.org>,
        Soheil Hassas Yeganeh <soheil@...gle.com>,
        Wei Wang <weiwan@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        Gautham Shenoy <gautham.shenoy@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Chen Yu <yu.c.chen@...el.com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Yicong Yang <yangyicong@...ilicon.com>
Subject: Re: [PATCH net-next 6/7] net: keep sk->sk_forward_alloc as small as possible

On Fri, Oct 14, 2022 at 1:30 AM K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> Hello Eric,
...
>
> Following are the results:
>
> Clients:      good                 good + series         good  +series + larger wmem
>     1    574.93 (0.00 pct)       554.42 (-3.56 pct)      552.92 (-3.82 pct)
>     2    1135.60 (0.00 pct)      1034.76 (-8.87 pct)     1036.94 (-8.68 pct)
>     4    2117.29 (0.00 pct)      1796.97 (-15.12 pct)    1539.21 (-27.30 pct)
>     8    3799.57 (0.00 pct)      3020.87 (-20.49 pct)    2797.98 (-26.36 pct)
>    16    6129.79 (0.00 pct)      4536.99 (-25.98 pct)    4301.20 (-29.83 pct)
>    32    11630.67 (0.00 pct)     8674.74 (-25.41 pct)    8199.28 (-29.50 pct)
>    64    20895.77 (0.00 pct)     14417.26 (-31.00 pct)   14473.34 (-30.73 pct)
>   128    31989.55 (0.00 pct)     20611.47 (-35.56 pct)   19671.08 (-38.50 pct)
>   256    56388.57 (0.00 pct)     48822.72 (-13.41 pct)   48455.77 (-14.06 pct)
>   512    59326.33 (0.00 pct)     43960.03 (-25.90 pct)   43968.59 (-25.88 pct)
>  1024    58281.10 (0.00 pct)     41256.18 (-29.21 pct)   40550.97 (-30.42 pct)
>
> Given the message size is small, I think wmem size does not
> impact the benchmark results much.

Hmmm.

tldr; I can not really repro the issues (tested on AMD EPYC 7B12,
NPS1) with CONFIG_PREEMPT_NONE=y

sendmsg(256 bytes)
    grab 4096 bytes forward allocation from sk->sk_prot->per_cpu_fw_alloc
   send skb, softirq handler immediately sends ACK back, and queues
the packet into receiver socket (also grabbing bytes from
sk->sk_prot->per_cpu_fw_alloc)
     ACK releases the 4096 bytes to per-cpu
sk->sk_prot->per_cpu_fw_alloc on sender TCP socket

per_cpu_fw_alloc have a 1MB cushion (per cpu), not sure why it is not
enough in your case.
Worst case would be one dirtying of tcp_memory_allocated every ~256 messages,
but in more common cases we dirty this cache less often...

I wonder if NPS2/NPS4 could land per-cpu variables into the wrong NUMA
node maybe ?
(or on NPS1, incorrect NUMA information on your platform ?)
Or maybe the small changes are enough for your system to hit a cliff.
AMD systems are quite sensitive to mem-bw saturation.

 I ran the following on an AMD host (NPS1) with two physical cpu (256 HT total)

for i in 1 2 4 8 16 32 64 128 192 256; do echo -n $i: ;
./super_netperf $i -H ::1 -l 10 -- -m 256 -M 256; done

Before patch series ( 5c281b4e529c )
1:   6956
2:  14169
4:  28311
8:  56519
16: 113621
32: 225317
64: 341658
128: 475131
192: 304515
256: 181754

After patch series, to me this looks very close or even much better at
high number of threads.
1:   6963
2:  14166
4:  28095
8:  56878
16: 112723
32: 202417
64: 266744
128: 482031
192: 317876
256: 293169

And if we look at "ss -tm" while tests are running, it is clearly
visible that the old kernels were pretty bad in terms of memory
control.

Old kernel:
ESTAB        0              55040
[::1]:39474                                                [::1]:32891
skmem:(r0,rb540000,t0,tb10243584,f1167104,w57600,o0,bl0,d0)
ESTAB        36864          0
[::1]:37733                                                [::1]:54752
skmem:(r55040,rb8515000,t0,tb2626560,f1710336,w0,o0,bl0,d0)

These two sockets were holding 1167104+1710336 bytes of forward
allocations, just to 'be efficient'
Now think of servers with millions of TCP sockets :/

New kernel : No more extra forward allocations above 4096 bytes.
sk_forward_alloc only holds the reminder of allocations,
because memcg/tcp_memory_allocated granularity is in pages.

ESTAB   35328     0                             [::1]:36493
                           [::1]:41394
skmem:(r46848,rb7467000,t0,tb2626560,f2304,w0,o0,bl0,d0)
ESTAB   0         54272                         [::1]:58680
                           [::1]:47859
skmem:(r0,rb540000,t0,tb6829056,f512,w56832,o0,bl0,d0)

Only when enabling CONFIG_PREEMPT=y I had some kind of spinlock contention
in scheduler/rcu layers, making test results very flaky.