netdev - RE: [PATCH net-next 1/2] net: Keep sk->sk_forward

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CH3PR11MB73458F27EF26FACFB8A3CC8CFC779@CH3PR11MB7345.namprd11.prod.outlook.com>
Date: Wed, 10 May 2023 07:43:06 +0000
From: "Zhang, Cathy" <cathy.zhang@...el.com>
To: Eric Dumazet <edumazet@...gle.com>
CC: Paolo Abeni <pabeni@...hat.com>, "davem@...emloft.net"
	<davem@...emloft.net>, "kuba@...nel.org" <kuba@...nel.org>, "Brandeburg,
 Jesse" <jesse.brandeburg@...el.com>, "Srinivas, Suresh"
	<suresh.srinivas@...el.com>, "Chen, Tim C" <tim.c.chen@...el.com>, "You,
 Lizhen" <lizhen.you@...el.com>, "eric.dumazet@...il.com"
	<eric.dumazet@...il.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	Shakeel Butt <shakeelb@...gle.com>
Subject: RE: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper
 size



> -----Original Message-----
> From: Eric Dumazet <edumazet@...gle.com>
> Sent: Tuesday, May 9, 2023 11:43 PM
> To: Zhang, Cathy <cathy.zhang@...el.com>
> Cc: Paolo Abeni <pabeni@...hat.com>; davem@...emloft.net;
> kuba@...nel.org; Brandeburg, Jesse <jesse.brandeburg@...el.com>;
> Srinivas, Suresh <suresh.srinivas@...el.com>; Chen, Tim C
> <tim.c.chen@...el.com>; You, Lizhen <lizhen.you@...el.com>;
> eric.dumazet@...il.com; netdev@...r.kernel.org; Shakeel Butt
> <shakeelb@...gle.com>
> Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper
> size
> 
> On Tue, May 9, 2023 at 5:07 PM Zhang, Cathy <cathy.zhang@...el.com>
> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Eric Dumazet <edumazet@...gle.com>
> > > Sent: Tuesday, May 9, 2023 7:59 PM
> > > To: Zhang, Cathy <cathy.zhang@...el.com>
> > > Cc: Paolo Abeni <pabeni@...hat.com>; davem@...emloft.net;
> > > kuba@...nel.org; Brandeburg, Jesse <jesse.brandeburg@...el.com>;
> > > Srinivas, Suresh <suresh.srinivas@...el.com>; Chen, Tim C
> > > <tim.c.chen@...el.com>; You, Lizhen <lizhen.you@...el.com>;
> > > eric.dumazet@...il.com; netdev@...r.kernel.org; Shakeel Butt
> > > <shakeelb@...gle.com>
> > > Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as
> > > a proper size
> > >
> > > On Tue, May 9, 2023 at 1:01 PM Zhang, Cathy <cathy.zhang@...el.com>
> > > wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Zhang, Cathy
> > > > > Sent: Tuesday, May 9, 2023 6:40 PM
> > > > > To: Paolo Abeni <pabeni@...hat.com>; edumazet@...gle.com;
> > > > > davem@...emloft.net; kuba@...nel.org
> > > > > Cc: Brandeburg, Jesse <jesse.brandeburg@...el.com>; Srinivas,
> > > > > Suresh <suresh.srinivas@...el.com>; Chen, Tim C
> > > > > <tim.c.chen@...el.com>; You, Lizhen <Lizhen.You@...el.com>;
> > > > > eric.dumazet@...il.com; netdev@...r.kernel.org
> > > > > Subject: RE: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc
> > > > > as a proper size
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Paolo Abeni <pabeni@...hat.com>
> > > > > > Sent: Tuesday, May 9, 2023 5:51 PM
> > > > > > To: Zhang, Cathy <cathy.zhang@...el.com>; edumazet@...gle.com;
> > > > > > davem@...emloft.net; kuba@...nel.org
> > > > > > Cc: Brandeburg, Jesse <jesse.brandeburg@...el.com>; Srinivas,
> > > > > > Suresh <suresh.srinivas@...el.com>; Chen, Tim C
> > > > > > <tim.c.chen@...el.com>; You, Lizhen <lizhen.you@...el.com>;
> > > > > > eric.dumazet@...il.com; netdev@...r.kernel.org
> > > > > > Subject: Re: [PATCH net-next 1/2] net: Keep
> > > > > > sk->sk_forward_alloc as a proper size
> > > > > >
> > > > > > On Sun, 2023-05-07 at 19:08 -0700, Cathy Zhang wrote:
> > > > > > > Before commit 4890b686f408 ("net: keep sk->sk_forward_alloc
> > > > > > > as small as possible"), each TCP can forward allocate up to
> > > > > > > 2 MB of memory and tcp_memory_allocated might hit tcp memory
> > > > > > > limitation
> > > quite soon.
> > > > > > > To reduce the memory pressure, that commit keeps
> > > > > > > sk->sk_forward_alloc as small as possible, which will be
> > > > > > > sk->less than 1
> > > > > > > page size if SO_RESERVE_MEM is not specified.
> > > > > > >
> > > > > > > However, with commit 4890b686f408 ("net: keep
> > > > > > > sk->sk_forward_alloc as small as possible"), memcg charge
> > > > > > > sk->hot
> > > > > > > paths are observed while system is stressed with a large
> > > > > > > amount of connections. That is because
> > > > > > > sk->sk_forward_alloc is too small and it's always less than
> > > > > > > sk->truesize, network handlers like tcp_rcv_established()
> > > > > > > sk->should jump to
> > > > > > > slow path more frequently to increase sk->sk_forward_alloc.
> > > > > > > Each memory allocation will trigger memcg charge, then perf
> > > > > > > top shows the following contention paths on the busy system.
> > > > > > >
> > > > > > >     16.77%  [kernel]            [k] page_counter_try_charge
> > > > > > >     16.56%  [kernel]            [k] page_counter_cancel
> > > > > > >     15.65%  [kernel]            [k] try_charge_memcg
> > > > > >
> > > > > > I'm guessing you hit memcg limits frequently. I'm wondering if
> > > > > > it's just a matter of tuning/reducing tcp limits in
> > > /proc/sys/net/ipv4/tcp_mem.
> > > > >
> > > > > Hi Paolo,
> > > > >
> > > > > Do you mean hitting the limit of "--memory" which set when start
> > > container?
> > > > > If the memory option is not specified when init a container,
> > > > > cgroup2 will create a memcg without memory limitation on the
> system, right?
> > > > > We've run test without this setting, and the memcg charge hot
> > > > > paths also
> > > exist.
> > > > >
> > > > > It seems that /proc/sys/net/ipv4/tcp_[wr]mem is not allowed to
> > > > > be changed by a simple echo writing, but requires a change to
> > > > > /etc/sys.conf, I'm not sure if it could be changed without
> > > > > stopping the running application.  Additionally, will this type
> > > > > of change bring more deeper and complex impact of network stack,
> > > > > compared to reclaim_threshold which is assumed to mostly affect
> > > > > of the memory allocation paths? Considering about this, it's
> > > > > decided to add the
> > > reclaim_threshold directly.
> > > > >
> > > >
> > > > BTW, there is a SK_RECLAIM_THRESHOLD in sk_mem_uncharge
> > > > previously,
> > > we
> > > > add it back with a smaller but sensible setting.
> > >
> > > The only sensible setting is as close as possible from 0 really.
> > >
> > > Per-socket caches do not scale.
> > > Sure, they make some benchmarks really look nice.
> >
> > Benchmark aims to help get better performance in reality I think :-)
> 
> Sure, but system stability comes first.
> 
> >
> > >
> > > Something must be wrong in your setup, because the only small issue
> > > that was noticed was the memcg one that Shakeel solved last year.
> >
> > As mentioned in commit log, the test is to create 8 memcached-memtier
> > pairs on the same host, when server and client of the same pair
> > connect to the same CPU socket and share the same CPU set (28 CPUs),
> > the memcg overhead is obviously high as shown in commit log. If they
> > are set with different CPU set from separate CPU socket, the overhead
> > is not so high but still observed.  Here is the server/client command in our
> test:
> > server:
> > memcached -p ${port_i} -t ${threads_i} -c 10240
> > client:
> > memtier_benchmark --server=${memcached_id} --port=${port_i} \
> > --protocol=memcache_text --test-time=20 --threads=${threads_i} \ -c 1
> > --pipeline=16 --ratio=1:100 --run-count=5
> >
> > So, is there anything wrong you see?
> 
> Please post /proc/sys/net/ipv4/tcp_[rw]mem setting, and "cat
> /proc/net/sockstat" while the test is running.

Hi Eric,

Here is the output of tcp_[rw]mem:

~$ cat /proc/sys/net/ipv4/tcp_rmem
4096    131072  6291456
:~$ cat /proc/sys/net/ipv4/tcp_wmem
4096    16384   4194304

Regarding /proc/net/sockstat, it's changed during run, 'mem' is also
changed but it seems less than 1 page, for I checked manually,  it might
not be so accurate. 'mem' here is for 'tcp_memory_allocated' which is only
only add/sub when 'per_cpu_fw_alloc' reaches its limit, right?

> 
> Some mm experts should chime in, this is not a networking issue.
> 
> I suspect some kind of accidental false sharing.
> 
> Can you post this from your .config
> 
> grep RANDSTRUCT .config

CONFIG_RANDSTRUCT_NONE=y