netdev - Re: [PATCH net-next 1/2] net: Keep sk->sk_forward

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJvpgXTwGEiXAkFwY3j3RqVhNzJ_6_zmuRb4w7rUA_8Ug@mail.gmail.com>
Date: Tue, 9 May 2023 17:43:21 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: "Zhang, Cathy" <cathy.zhang@...el.com>
Cc: Paolo Abeni <pabeni@...hat.com>, "davem@...emloft.net" <davem@...emloft.net>, 
	"kuba@...nel.org" <kuba@...nel.org>, "Brandeburg, Jesse" <jesse.brandeburg@...el.com>, 
	"Srinivas, Suresh" <suresh.srinivas@...el.com>, "Chen, Tim C" <tim.c.chen@...el.com>, 
	"You, Lizhen" <lizhen.you@...el.com>, "eric.dumazet@...il.com" <eric.dumazet@...il.com>, 
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>, Shakeel Butt <shakeelb@...gle.com>
Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper size

On Tue, May 9, 2023 at 5:07 PM Zhang, Cathy <cathy.zhang@...el.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Eric Dumazet <edumazet@...gle.com>
> > Sent: Tuesday, May 9, 2023 7:59 PM
> > To: Zhang, Cathy <cathy.zhang@...el.com>
> > Cc: Paolo Abeni <pabeni@...hat.com>; davem@...emloft.net;
> > kuba@...nel.org; Brandeburg, Jesse <jesse.brandeburg@...el.com>;
> > Srinivas, Suresh <suresh.srinivas@...el.com>; Chen, Tim C
> > <tim.c.chen@...el.com>; You, Lizhen <lizhen.you@...el.com>;
> > eric.dumazet@...il.com; netdev@...r.kernel.org; Shakeel Butt
> > <shakeelb@...gle.com>
> > Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper
> > size
> >
> > On Tue, May 9, 2023 at 1:01 PM Zhang, Cathy <cathy.zhang@...el.com>
> > wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Zhang, Cathy
> > > > Sent: Tuesday, May 9, 2023 6:40 PM
> > > > To: Paolo Abeni <pabeni@...hat.com>; edumazet@...gle.com;
> > > > davem@...emloft.net; kuba@...nel.org
> > > > Cc: Brandeburg, Jesse <jesse.brandeburg@...el.com>; Srinivas, Suresh
> > > > <suresh.srinivas@...el.com>; Chen, Tim C <tim.c.chen@...el.com>;
> > > > You, Lizhen <Lizhen.You@...el.com>; eric.dumazet@...il.com;
> > > > netdev@...r.kernel.org
> > > > Subject: RE: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as
> > > > a proper size
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Paolo Abeni <pabeni@...hat.com>
> > > > > Sent: Tuesday, May 9, 2023 5:51 PM
> > > > > To: Zhang, Cathy <cathy.zhang@...el.com>; edumazet@...gle.com;
> > > > > davem@...emloft.net; kuba@...nel.org
> > > > > Cc: Brandeburg, Jesse <jesse.brandeburg@...el.com>; Srinivas,
> > > > > Suresh <suresh.srinivas@...el.com>; Chen, Tim C
> > > > > <tim.c.chen@...el.com>; You, Lizhen <lizhen.you@...el.com>;
> > > > > eric.dumazet@...il.com; netdev@...r.kernel.org
> > > > > Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc
> > > > > as a proper size
> > > > >
> > > > > On Sun, 2023-05-07 at 19:08 -0700, Cathy Zhang wrote:
> > > > > > Before commit 4890b686f408 ("net: keep sk->sk_forward_alloc as
> > > > > > small as possible"), each TCP can forward allocate up to 2 MB of
> > > > > > memory and tcp_memory_allocated might hit tcp memory limitation
> > quite soon.
> > > > > > To reduce the memory pressure, that commit keeps
> > > > > > sk->sk_forward_alloc as small as possible, which will be less
> > > > > > sk->than 1
> > > > > > page size if SO_RESERVE_MEM is not specified.
> > > > > >
> > > > > > However, with commit 4890b686f408 ("net: keep
> > > > > > sk->sk_forward_alloc as small as possible"), memcg charge hot
> > > > > > paths are observed while system is stressed with a large amount
> > > > > > of connections. That is because
> > > > > > sk->sk_forward_alloc is too small and it's always less than
> > > > > > sk->truesize, network handlers like tcp_rcv_established() should
> > > > > > sk->jump to
> > > > > > slow path more frequently to increase sk->sk_forward_alloc. Each
> > > > > > memory allocation will trigger memcg charge, then perf top shows
> > > > > > the following contention paths on the busy system.
> > > > > >
> > > > > >     16.77%  [kernel]            [k] page_counter_try_charge
> > > > > >     16.56%  [kernel]            [k] page_counter_cancel
> > > > > >     15.65%  [kernel]            [k] try_charge_memcg
> > > > >
> > > > > I'm guessing you hit memcg limits frequently. I'm wondering if
> > > > > it's just a matter of tuning/reducing tcp limits in
> > /proc/sys/net/ipv4/tcp_mem.
> > > >
> > > > Hi Paolo,
> > > >
> > > > Do you mean hitting the limit of "--memory" which set when start
> > container?
> > > > If the memory option is not specified when init a container, cgroup2
> > > > will create a memcg without memory limitation on the system, right?
> > > > We've run test without this setting, and the memcg charge hot paths also
> > exist.
> > > >
> > > > It seems that /proc/sys/net/ipv4/tcp_[wr]mem is not allowed to be
> > > > changed by a simple echo writing, but requires a change to
> > > > /etc/sys.conf, I'm not sure if it could be changed without stopping
> > > > the running application.  Additionally, will this type of change
> > > > bring more deeper and complex impact of network stack, compared to
> > > > reclaim_threshold which is assumed to mostly affect of the memory
> > > > allocation paths? Considering about this, it's decided to add the
> > reclaim_threshold directly.
> > > >
> > >
> > > BTW, there is a SK_RECLAIM_THRESHOLD in sk_mem_uncharge previously,
> > we
> > > add it back with a smaller but sensible setting.
> >
> > The only sensible setting is as close as possible from 0 really.
> >
> > Per-socket caches do not scale.
> > Sure, they make some benchmarks really look nice.
>
> Benchmark aims to help get better performance in reality I think :-)

Sure, but system stability comes first.

>
> >
> > Something must be wrong in your setup, because the only small issue that
> > was noticed was the memcg one that Shakeel solved last year.
>
> As mentioned in commit log, the test is to create 8 memcached-memtier pairs
> on the same host, when server and client of the same pair connect to the same
> CPU socket and share the same CPU set (28 CPUs), the memcg overhead is
> obviously high as shown in commit log. If they are set with different CPU set from
> separate CPU socket, the overhead is not so high but still observed.  Here is the
> server/client command in our test:
> server:
> memcached -p ${port_i} -t ${threads_i} -c 10240
> client:
> memtier_benchmark --server=${memcached_id} --port=${port_i} \
> --protocol=memcache_text --test-time=20 --threads=${threads_i} \
> -c 1 --pipeline=16 --ratio=1:100 --run-count=5
>
> So, is there anything wrong you see?

Please post /proc/sys/net/ipv4/tcp_[rw]mem setting, and "cat
/proc/net/sockstat" while the test is running.

Some mm experts should chime in, this is not a networking issue.

I suspect some kind of accidental false sharing.

Can you post this from your .config

grep RANDSTRUCT .config