netdev - RE: [PATCH net-next 1/2] net: Keep sk->sk_forward

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CH3PR11MB73458BB403D537CFA96FD8DDFC769@CH3PR11MB7345.namprd11.prod.outlook.com>
Date: Tue, 9 May 2023 15:07:44 +0000
From: "Zhang, Cathy" <cathy.zhang@...el.com>
To: Eric Dumazet <edumazet@...gle.com>
CC: Paolo Abeni <pabeni@...hat.com>, "davem@...emloft.net"
	<davem@...emloft.net>, "kuba@...nel.org" <kuba@...nel.org>, "Brandeburg,
 Jesse" <jesse.brandeburg@...el.com>, "Srinivas, Suresh"
	<suresh.srinivas@...el.com>, "Chen, Tim C" <tim.c.chen@...el.com>, "You,
 Lizhen" <lizhen.you@...el.com>, "eric.dumazet@...il.com"
	<eric.dumazet@...il.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	Shakeel Butt <shakeelb@...gle.com>
Subject: RE: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper
 size



> -----Original Message-----
> From: Eric Dumazet <edumazet@...gle.com>
> Sent: Tuesday, May 9, 2023 7:59 PM
> To: Zhang, Cathy <cathy.zhang@...el.com>
> Cc: Paolo Abeni <pabeni@...hat.com>; davem@...emloft.net;
> kuba@...nel.org; Brandeburg, Jesse <jesse.brandeburg@...el.com>;
> Srinivas, Suresh <suresh.srinivas@...el.com>; Chen, Tim C
> <tim.c.chen@...el.com>; You, Lizhen <lizhen.you@...el.com>;
> eric.dumazet@...il.com; netdev@...r.kernel.org; Shakeel Butt
> <shakeelb@...gle.com>
> Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper
> size
> 
> On Tue, May 9, 2023 at 1:01 PM Zhang, Cathy <cathy.zhang@...el.com>
> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Zhang, Cathy
> > > Sent: Tuesday, May 9, 2023 6:40 PM
> > > To: Paolo Abeni <pabeni@...hat.com>; edumazet@...gle.com;
> > > davem@...emloft.net; kuba@...nel.org
> > > Cc: Brandeburg, Jesse <jesse.brandeburg@...el.com>; Srinivas, Suresh
> > > <suresh.srinivas@...el.com>; Chen, Tim C <tim.c.chen@...el.com>;
> > > You, Lizhen <Lizhen.You@...el.com>; eric.dumazet@...il.com;
> > > netdev@...r.kernel.org
> > > Subject: RE: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as
> > > a proper size
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Paolo Abeni <pabeni@...hat.com>
> > > > Sent: Tuesday, May 9, 2023 5:51 PM
> > > > To: Zhang, Cathy <cathy.zhang@...el.com>; edumazet@...gle.com;
> > > > davem@...emloft.net; kuba@...nel.org
> > > > Cc: Brandeburg, Jesse <jesse.brandeburg@...el.com>; Srinivas,
> > > > Suresh <suresh.srinivas@...el.com>; Chen, Tim C
> > > > <tim.c.chen@...el.com>; You, Lizhen <lizhen.you@...el.com>;
> > > > eric.dumazet@...il.com; netdev@...r.kernel.org
> > > > Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc
> > > > as a proper size
> > > >
> > > > On Sun, 2023-05-07 at 19:08 -0700, Cathy Zhang wrote:
> > > > > Before commit 4890b686f408 ("net: keep sk->sk_forward_alloc as
> > > > > small as possible"), each TCP can forward allocate up to 2 MB of
> > > > > memory and tcp_memory_allocated might hit tcp memory limitation
> quite soon.
> > > > > To reduce the memory pressure, that commit keeps
> > > > > sk->sk_forward_alloc as small as possible, which will be less
> > > > > sk->than 1
> > > > > page size if SO_RESERVE_MEM is not specified.
> > > > >
> > > > > However, with commit 4890b686f408 ("net: keep
> > > > > sk->sk_forward_alloc as small as possible"), memcg charge hot
> > > > > paths are observed while system is stressed with a large amount
> > > > > of connections. That is because
> > > > > sk->sk_forward_alloc is too small and it's always less than
> > > > > sk->truesize, network handlers like tcp_rcv_established() should
> > > > > sk->jump to
> > > > > slow path more frequently to increase sk->sk_forward_alloc. Each
> > > > > memory allocation will trigger memcg charge, then perf top shows
> > > > > the following contention paths on the busy system.
> > > > >
> > > > >     16.77%  [kernel]            [k] page_counter_try_charge
> > > > >     16.56%  [kernel]            [k] page_counter_cancel
> > > > >     15.65%  [kernel]            [k] try_charge_memcg
> > > >
> > > > I'm guessing you hit memcg limits frequently. I'm wondering if
> > > > it's just a matter of tuning/reducing tcp limits in
> /proc/sys/net/ipv4/tcp_mem.
> > >
> > > Hi Paolo,
> > >
> > > Do you mean hitting the limit of "--memory" which set when start
> container?
> > > If the memory option is not specified when init a container, cgroup2
> > > will create a memcg without memory limitation on the system, right?
> > > We've run test without this setting, and the memcg charge hot paths also
> exist.
> > >
> > > It seems that /proc/sys/net/ipv4/tcp_[wr]mem is not allowed to be
> > > changed by a simple echo writing, but requires a change to
> > > /etc/sys.conf, I'm not sure if it could be changed without stopping
> > > the running application.  Additionally, will this type of change
> > > bring more deeper and complex impact of network stack, compared to
> > > reclaim_threshold which is assumed to mostly affect of the memory
> > > allocation paths? Considering about this, it's decided to add the
> reclaim_threshold directly.
> > >
> >
> > BTW, there is a SK_RECLAIM_THRESHOLD in sk_mem_uncharge previously,
> we
> > add it back with a smaller but sensible setting.
> 
> The only sensible setting is as close as possible from 0 really.
> 
> Per-socket caches do not scale.
> Sure, they make some benchmarks really look nice.

Benchmark aims to help get better performance in reality I think :-)

> 
> Something must be wrong in your setup, because the only small issue that
> was noticed was the memcg one that Shakeel solved last year.

As mentioned in commit log, the test is to create 8 memcached-memtier pairs
on the same host, when server and client of the same pair connect to the same
CPU socket and share the same CPU set (28 CPUs), the memcg overhead is
obviously high as shown in commit log. If they are set with different CPU set from
separate CPU socket, the overhead is not so high but still observed.  Here is the
server/client command in our test:
server:
memcached -p ${port_i} -t ${threads_i} -c 10240
client:
memtier_benchmark --server=${memcached_id} --port=${port_i} \
--protocol=memcache_text --test-time=20 --threads=${threads_i} \
-c 1 --pipeline=16 --ratio=1:100 --run-count=5

So, is there anything wrong you see?

> 
> If under pressure, then memory allocations are going to be slow.
> Having per-socket caches is going to be unfair to sockets with empty caches.

Yeah, if the system is under pressure and even reaches to OOM, it should
release memory to make workload keep running. But if system is not with
memory pressure, better performance will be chased.